Lecture Notes in Artificial Intelligence Edited by R. Goebel, J. Siekmann, and W. Wahlster
Subseries of Lecture Notes in Computer Science
6635
Joshua Zhexue Huang Longbing Cao Jaideep Srivastava (Eds.)
Advances in Knowledge Discovery and Data Mining 15th Pacific-Asia Conference, PAKDD 2011 Shenzhen, China, May 24-27, 2011 Proceedings, Part II
13
Series Editors Randy Goebel, University of Alberta, Edmonton, Canada Jörg Siekmann, University of Saarland, Saarbrücken, Germany Wolfgang Wahlster, DFKI and University of Saarland, Saarbrücken, Germany Volume Editors Joshua Zhexue Huang Chinese Academy of Sciences Shenzhen Institutes of Advanced Technology (SIAT) Shenzhen 518055, China E-mail:
[email protected] Longbing Cao University of Technology Sydney Faculty of Engineering and Information Technology Advanced Analytics Institute Center for Quantum Computation and Intelligent Systems Sydney, NSW 2007, Australia E-mail:
[email protected] Jaideep Srivastava University of Minnesota Department of Computer Science and Engineering Minneapolis, MN 55455, USA E-mail:
[email protected] ISSN 0302-9743 e-ISSN 1611-3349 ISBN 978-3-642-20846-1 e-ISBN 978-3-642-20847-8 DOI 10.1007/978-3-642-20847-8 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2011926132 CR Subject Classification (1998): I.2, H.3, H.4, H.2.8, I.4, C.2 LNCS Sublibrary: SL 7 – Artificial Intelligence © Springer-Verlag Berlin Heidelberg 2011 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Preface
PAKDD has been recognized as a major international conference in the areas of data mining (DM) and knowledge discovery in databases (KDD). It provides an international forum for researchers and industry practitioners to share their new ideas, original research results and practical development experiences from all KDD-related areas including data mining, machine learning, artificial intelligence and pattern recognition, data warehousing and databases, statistics, knowledge engineering, behavioral sciences, visualization, and emerging areas such as social network analysis. The 15th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2011) was held in Shenzhen, China, during May 24–27, 2011. PAKDD 2011 introduced a double-blind review process. It received 331 submissions after checking for validity. Submissions were from 45 countries and regions, which shows a significant improvement in internationalization than PAKDD 2010 (34 countries and regions). All papers were assigned to at least four Program Committee members. Most papers received more than three review reports. As a result of the deliberation process, only 90 papers were accepted, with 32 papers (9.7%) for long presentation and 58 (17.5%) for short presentation. The PAKDD 2011 conference program also included five workshops: the Workshop on Behavior Informatics (BI 2011), Workshop on Advances and Issues in Traditional Chinese Medicine Clinical Data Mining (AI-TCM), Quality Issues, Measures of Interestingness and Evaluation of Data Mining Models (QIMIE 2011), Biologically Inspired Techniques for Data Mining (BDM 2011), and Workshop on Data Mining for Healthcare Management (DMHM 2011). PAKDD 2011 also featured talks by three distinguished invited speakers, six tutorials, and a Doctoral Symposium on Data Mining. The conference would not have been successful without the support of the Program Committee members (203), external reviewers (168), Organizing Committee members, invited speakers, authors, tutorial presenters, workshop organizers, reviewers, authors and the conference attendees. We highly appreciate the conscientious reviews provided by the Program Committee members and external reviewers. We are indebted to the members of the PAKDD Steering Committee for their invaluable suggestions and support throughout the organization process. Our special thanks go to the local arrangements team and volunteers. We would also like to thank all those who contributed to the success of PAKDD 2011 but whose names cannot be listed. We greatly appreciate Springer LNCS for continuing to publish the main conference and workshop proceedings. Thanks also to Andrei Voronkov for hosting the entire PAKDD reviewing process on the EasyChair.org site. Finally, we greatly appreciate the support from various sponsors and institutions. The conference was organized by the Shenzhen Institutes of Advanced
VI
Preface
Technology, Chinese Academy of Sciences, China, and co-organized by the University of Hong Kong, China and the University of Technology Sydney, Australia. We hope you enjoy the proceedings of PAKDD 2011, which presents cuttingedge research in data mining and knowledge discovery. We also hope all participants took this opportunity to exchange ideas with each other and enjoyed the modem city of Shenzhen! May 2011
Joshua Huang Longbing Cao Jaideep Srivastava
Organization
Organizing Committee Honorary Chair Philip S. Yu General Co-chairs Jianping Fan David Cheung
University of Illinois at Chicago, USA
Shenzhen Institutes of Advanced Technology, CAS, China University of Hong Kong, China
Program Committee Co-chairs Joshua Huang Longbing Cao Jaideep Srivastava
Shenzhen Institutes of Advanced Technology, CAS, China University of Technology Sydney, Australia University of Minnesota, USA
Workshop Co-chairs James Bailey Yun Sing Koh Tutorial Co-chairs Xiong Hui Sanjay Chawla
The University of Melbourne, Australia The University of Auckland, New Zealand
Rutgers, the State University of New Jersey, USA The University of Sydney, Australia
Local Arrangements Co-chairs Shengzhong Feng Jun Luo
Shenzhen Institutes of Advanced Technology, CAS, China Shenzhen Institutes of Advanced Technology, CAS, China
Sponsorship Co-chairs Yalei Bi Zhong Ming
Shenzhen Institutes of Advanced Technology, CAS, China Shenzhen University, China
VIII
Organization
Publicity Co-chairs Jian Yang Ye Li Yuming Ou
Beijing University of Technology, China Shenzhen Institutes of Advanced Technology, CAS, China University of Technology Sydney, Australia
Publication Chair Longbing Cao
University of Technology Sydney, Australia
Steering Committee Co-chairs Rao Kotagiri Graham Williams
University of Melbourne, Australia Australian National University, Australia
Life Members David Cheung Masaru Kitsuregawa Rao Kotagiri Hiroshi Motoda Graham Williams (Treasurer) Ning Zhong
University of Hong Kong, China Tokyo University, Japan University of Melbourne, Australia AFOSR/AOARD and Osaka University, Japan Australian National University, Australia Maebashi Institute of Technology, Japan
Members Ming-Syan Chen Tu Bao Ho Ee-Peng Lim Huan Liu Jaideep Srivastava Takashi Washio Thanaruk Theeramunkong Kyu-Young Whang Chengqi Zhang Zhi-Hua Zhou Krishna Reddy
National Taiwan University, Taiwan, ROC Japan Advanced Institute of Science and Technology, Japan Singapore Management University, Singapore Arizona State University, USA University of Minnesota, USA Institute of Scientific and Industrial Research, Osaka University, Japan Thammasat University, Thailand Korea Advanced Institute of Science and Technology, Korea University of Technology Sydney, Australia Nanjing University, China IIIT, Hyderabad, India
Program Committee Adrian Pearce Aijun An Aixin Sun Akihiro Inokuchi
The University of Melbourne, Australia York University, Canada Nanyang Technological University, Singapore Osaka University, Japan
Organization
Akira Shimazu
IX
Japan Advanced Institute of Science and Technology, Japan Alfredo Cuzzocrea University of Calabria, Italy Andrzej Skowron Warsaw University, Poland Anirban Mondal IIIT Delhi, India Aoying Zhou Fudan University, China Arbee Chen National Chengchi University, Taiwan, ROC Aristides Gionis Yahoo Research Labs, Spain Atsuyoshi Nakamura Hokkaido University, Japan Bart Goethals University of Antwerp, Belgium Bernhard Pfahringer University of Waikato, New Zealand Bo Liu University of Technology, Sydney, Australia Bo Zhang Tsinghua University, China Boonserm Kijsirikul Chulalongkorn University, Thailand Bruno Cremilleux Universit´e de Caen, France Chandan Reddy Wayne State University, USA Chang-Tien Lu Virginia Tech, USA Chaveevan Pechsiri Dhurakijpundit University, Thailand Chengqi Zhang University of Technology, Australia Chih-Jen Lin National Taiwan University, Taiwan, ROC Choochart Haruechaiyasak NECTEC, Thailand Chotirat Ann Ratanamahatana Chulalongkorn University, Thailand Chun-hung Li Hong Kong Baptist University, Hong Kong, China Chunsheng Yang NRC Institute for Information Technology, Canada Clement Yu University of Illinois at Chicago, USA Dacheng Tao The Hong Kong Polytechnic University, Hongkong, China Daisuke Ikeda Kyushu University, Japan Dan Luo University of Technology, Sydney, Australia Daoqiang Zhang Nanjing University of Aeronautics and Astronautics, China Dao-Qing Dai Sun Yat-Sen University, China David Albrecht Monash University, Australia David Taniar Monash University, Australia Di Wu Chinese University of Hong Kong, China Diane Cook Washington State University, USA Dit-Yan Yeung Hong Kong University of Science and Technology, China Dragan Gamberger Rudjer Boskovic Institute, Croatia Du Zhang California State University, USA Ee-Peng Lim Nanyang Technological University, Singapore Eibe Frank University of Waikato, New Zealand Evaggelia Pitoura University of Ioannina, Greece
X
Organization
Floriana Esposito Gang Li George Karypis Graham Williams Guozhu Dong Hai Wang Hanzi Wang Harry Zhang Hideo Bannai Hiroshi Nakagawa Hiroyuki Kawano Hiroyuki Kitagawa Hua Lu Huan Liu Hui Wang Hui Xiong Hui Yang Huiping Cao Irena Koprinska Ivor Tsang James Kwok Jason Wang Jean-Marc Petit Jeffrey Ullman Jiacai Zhang Jialie Shen Jian Yin Jiawei Han Jiuyong Li Joao Gama Jun Luo Junbin Gao Junping Zhang K. Selcuk Candan Kaiq huang Kennichi Yoshida Kitsana Waiyamai Kouzou Ohara Liang Wang Ling Chen Lisa Hellerstein
Universit` a di Bari, Italy Deakin University, Australia University of Minnesota, USA Australian Taxation Office, Australia Wright State University, USA University of Aston, UK University of Adelaide, Australia University of New Brunswick, Canada Kyushu University, Japan University of Tokyo, Japan Nanzan University, Japan University of Tsukuba, Japan Aalborg University, Denmark Arizona State University, USA University of Ulster, UK Rutgers University, USA San Francisco State University, USA New Mexico State University, USA University of Sydney, Australia Hong Kong University of Science and Technology, China Hong Kong University of Science and Technology, China New Jersey Science and Technology University, USA INSA Lyon, France Stanford University, USA Beijing Normal University, China Singapore Management University, Singapore Sun Yat-Sen University, China University of Illinois at Urbana-Champaign, USA University of South Australia University of Porto, Portugal Chinese Academy of Sciences, China Charles Sturt University, Australia Fudan University, China Arizona State University, USA Chinese Academy of Sciences, China Tsukuba University, Japan Kasetsart University, Thailand Osaka University, Japan The University of Melbourne, Australia University of Technology Sydney, Australia Polytechnic Institute of NYU, USA
Organization
Longbing Cao Manabu Okumura Marco Maggini Marut Buranarach Marzena Kryszkiewicz Masashi Shimbo Masayuki Numao Maurice van Keulen Xiaofeng Meng Mengjie Zhang Michael Berthold Michael Katehakis Michalis Vazirgiannis Min Yao Mingchun Wang Mingli Song Mohamed Mokbel Naren Ramakrishnan Ngoc Thanh Nguyen Ning Zhong Ninghui Li Olivier de Vel Pabitra Mitra Panagiotis Karras Pang-Ning Tan Patricia Riddle Panagiotis Karras Jialie Shen Pang-Ning Tan Patricia Riddle Peter Christen Peter Triantafillou Philip Yu Philippe Lenca Pohsiang Tsai Prasanna Desikan Qingshan Liu Rao Kotagiri Richi Nayak Rui Camacho Ruoming Jin
XI
University of Technology Sydney, Australia Tokyo Institute of Technology, Japan University of Siena, Italy NECTEC, Thailand Warsaw University of Technology, Poland Nara Institute of Science and Technology, Japan Osaka University, Japan University of Twente, The Netherlands Renmin University of China, China Victoria University of Wellington, New Zealand University of Konstanz, Germany Rutgers Business School, USA Athens University of Economics and Business, Greece Zhejiang University, China Tianjin University of Technology and Education, China Hong Kong Polytechnical University, China University of Minnesota, USA Virginia Tech, USA Wroclaw University of Technology, Poland Maebashi Institute of Technology, Japan Purdue University, USA DSTO, Australia Indian Institute of Technology Kharagpur, India University of Zurich, Switzerland Michigan State University, USA University of Auckland, New Zealand National University of Singapore, Singapore Singapore Management University, Singapore Michigan State University, USA University of Auckland, New Zealand Australian National University, Australia University of Patras, Greece IBM T.J. Watson Research Center, USA Telecom Bretagne, France National Formosa University, Taiwan, ROC University of Minnesota, USA Chinese Academy of Sciences, China The University of Melbourne, Australia Queensland University of Technology, Australia LIACC/FEUP University of Porto, Portugal Kent State University, USA
XII
Organization
S.K. Gupta Salvatore Orlando Sameep Mehta Sanjay Chawla Sanjay Jain Sanjay Ranka San-Yih Hwang Seiji Yamada Sheng Zhong Shichao Zhang Shiguang Shan Shoji Hirano Shu-Ching Chen Shuigeng Zhou Songcan Chen Srikanta Tirthapura Stefan Rueping Suman Nath Sung Ho Ha Sungzoon Cho Szymon Jaroszewicz Tadashi Nomoto Taizhong Hu Takashi Washio Takeaki Uno Takehisa Yairi Tamir Tassa Taneli Mielikainen Tao Chen Tao Li Tao Mei Tao Yang Tetsuya Yoshida Thepchai Supnithi Thomas Seidl Tie-Yan Liu Toshiro Minami
Indian Institute of Technology, India University of Venice, Italy IBM, India Research Labs, India University of Sydney, Australia National University of Singapore, Singapore University of Florida, USA National Sun Yat-Sen University, Taiwan, ROC National Institute of Informatics, Japan State University of New York at Buffalo, USA University of Technology at Sydney, Australia Digital Media Research Center, ICT Shimane University, Japan Florida International University, USA Fudan University, China Nanjing University of Aeronautics and Astronautics, China Iowa State University, USA Fraunhofer IAIS, Germany Networked Embedded Computing Group, Microsoft Research Kyungpook National University, Korea Seoul National University, Korea Technical University of Szczecin, Poland National Institute of Japanese Literature, Tokyo, Japan University of Science and Technology of China Osaka University, Japan National Institute of Informatics (NII), Japan University of Tokyo, Japan The Open University, Israel Nokia Research Center, USA Shenzhen Institutes of Advanced Technology, Chinese Academy of Science, China Florida International University, USA Microsoft Research Asia Shenzhen Institutes of Advanced Technology, Chinese Academy of Science, China Hokkaido University, Japan National Electronics and Computer Technology Center, Thailand RWTH Aachen University, Germany Microsoft Research Asia, China Kyushu Institute of Information Sciences (KIIS) and Kyushu University Library, Japan
Organization
Tru Cao Tsuyoshi Murata Vincent Lee Vincent S. Tseng Vincenzo Piuri Wagner Meira Jr. Wai Lam Warren Jin Wei Fan Weining Qian Wen-Chih Peng Wenjia Wang Wilfred Ng Wlodek Zadrozny Woong-Kee Loh Wynne Hsu Xia Cui Xiangjun Dong Xiaofang Zhou Xiaohua Hu Xin Wang Xindong Wu Xingquan Zhu Xintao Wu Xuelong Li Xuemin Lin Yan Zhou Yang-Sae Moon Yao Tao Yasuhiko Morimoto Yi Chen Yi-Dong Shen Yifeng Zeng Yihua Wu Yi-Ping Phoebe Chen Yiu-ming Cheung Yong Guan Yonghong Peng Yu Jian Yuan Yuan Yun Xiong Yunming Ye
XIII
Ho Chi Minh City University of Technology, Vietnam Tokyo Institute of Technology, Japan Monash University, Australia National Cheng Kung University, Taiwan, ROC University of Milan, Italy Universidade Federal de Minas Gerais, Brazil The Chinese University of Hong Kong, China Australian National University, Australia IBM T.J.Watson Research Center, USA East China Normal University, China National Chiao Tung University, Taiwan, ROC University of East Anglia, UK Hong Kong University of Science and Technology, China IBM Research Sungkyul University, Korea National University of Singapore, Singapore Chinese Academy of Sciences, China Shandong Institute of Light Industry, China The University of Queensland, Australia Drexel University, USA Calgary University, Canada University of Vermont, USA Florida Atlantic University, USA University of North Carolina at Charlotte, USA University of London, UK University of New South Wales, Australia University of South Alabama, USA Kangwon National University, Korea The University of Auckland, New Zealand Hiroshima University, Japan Arizona State University, USA Chinese Academy of Sciences, China Aalborg University, Denmark Google Inc. Deakin University, Australia Hong Kong Baptist University, Hong Kong, China Iowa State University, USA University of Bradford, UK Beijing Jiaotong University, China Aston University, UK Fudan University, China Harbin Institute of Technology, China
XIV
Organization
Zheng Chen Zhi-Hua Zhou Zhongfei (Mark) Zhang Zhongzhi Shi Zili Zhang
Microsoft Research Asia, China Nanjing University, China SUNY Binghamton, USA Chinese Academy of Sciences, China Deakin University, Australia
External Reviewers Ameeta Agrawal Arnaud Soulet Axel Poigne Ben Tan Bian Wei Bibudh Lahiri Bin Yang Bin Zhao Bing Bai Bojian Xu Can Wang Carlos Ferreira Chao Li Cheqing Jin Christian Beecks Chun-Wei Seah De-Chuan Zhan Elnaz Delpisheh Erez Shmueli Fausto Fleites Fei Xie Gaoping Zhu Gongqing Wu Hardy Kremer Hideyuki Kawashima Hsin-Yu Ha Ji Zhou Jianbo Yang Jinfei Jinfeng Zhuang Jinjiu Li Jun Wang Jun Zhang Ke Zhu Keli Xiao Ken-ichi Fukui
York University, Canada Universit´e Francois Rabelais Tours, France Fraunhofer IAIS, Germany Fudan University, China University of Technology, Sydney, Australia Iowa State University Aalborg University, Denmark East China Normal University, China Google Inc. Iowa State University, USA University of Technology, Sydney, Australia University of Porto, Portugal Shenzhen Institutes of Advanced Technology, CAS, China East China Normal University, China RWTH Aachen University, Germany Nanyang Technological University, Singapore Nanjing University, China York University, Canada The Open University, Israel Florida International University, USA University of Vermont, USA University of New South Wales, Australia University of Vermont, USA RWTH Aachen University, Germany Nanzan University, Japan Florida International University, USA Fudan University, China Nanyang Technological University, Singapore Shenzhen Institutes of Advanced Technology, CAS, China Microsoft Research Asia, China University of Technology, Sydney Southwest University, China Charles Sturt University, Australia University of New South Wales, Australia Rutgers University, USA Osaka University, Japan
Organization
Kui Yu Leonard K.M. Poon Leting Wu Liang Du Lin Zhu Ling Chen Linhao Xu Mangesh Gupte Mao Qi Marc Plantevit Marcia Oliveira Ming Li Mingkui Tan Natalja Friesen Nguyen Le Minh Ning Zhang Nuno A. Fonseca Omar Odibat Peipei Li Peng Cai Penjie Ye Peter Tischer Petr Kosina Philipp Kranen Qiao-Liang Xiang Rajul Anand Roberto Legaspi Romain Vuillemot Sergej Fries Smriti Bhagat Stephane Lallich Supaporn Spanurattana Vitor Santos Costa Wang Xinchao Ecole Weifeng Su Weiren Yu Wenjun Zhou Xiang Zhao Xiaodan Wang Xiaowei Ying Xin Liu
XV
University of Vermont, USA Shenzhen Institutes of Advaced Technology, Chinese Academy of Science, China University of North Carolina at Charlotte, USA Chinese Academy of Sciences, China Shanghai Jiaotong University, China University of Technology Sydney, Australia Aalborg University, Denmark Google Inc. Nanyang Technological University, Singapore Universit´e Lyon 1, France University Porto, Portugal Nanjing University, China Nanyang Technological University, Singapore Fraunhofer IAIS, Germany Japan Advanced Institute of Science and Technology, Japan Microsoft Research Asia, China LIACC/FEUP University of Porto, Portugal IIIT, Hyderabad, India University of Vermont, USA East China Normal University, China University of New South Wales, Australia Monash University, Australia University of Porto, Portugal RWTH Aachen University, Germany Nanyang Technological University, Singapore IIIT, Hyderabad, India Osaka University, Japan INSA Lyon, France RWTH Aachen University, Germany Google Inc. Telecom Bretagne, France Tokyo Institute of Technology, Japan LIACC/FEUP University of Porto, Portugal Polytechnique Federale de Lausanne (EPFL), Switzerland Shenzhen Institutes of Advaced Technology, Chinese Academy of Science, China University of New South Wales, Australia Rutgers University, USA University of New South Wales, Australia Fudan University, China University of North Carolina at Charlotte, USA Tokyo Institute of Technology, Japan
XVI
Organization
Xuan Li Xu-Ying Liu Yannick Le Yasayuki Okabe Yasufumi Takama Yi Guo Yi Wang Yi Xu Yiling Zeng Yimin Yang Yoan Renaud Yong Deng Yong Ge Yuan YUAN Zhao Zhang Zhenjie Zhang Zhenyu Lu Zhigang Zheng Zhitao Shen Zhiyong Cheng Zhiyuan Chen Zhongmou Li Zhongqiu Zhao Zhou Tianyi
Chinese Academy of Sciences, China Nanjing University, China Bras Telecom Bretagne, France National Institute of Informatics, Japan National Institute of Informatics, Japan Charles Sturt University, Australia Shenzhen Institutes of Advanced Technology, Chinese Academy of Science, China SUNY Binghamton, USA University of Technology, Sydney Florida International University, USA INSA Lyon, France Southwest University, China Rutgers University, USA Aston University, UK East China Normal University, China Aalborg University, Denmark University of Vermont, USA University of Technology, Sydney, Australia University of New South Wales, Australia Singapore Management University The Open University, Israel Rutgers University, USA University of Vermont, USA University of Technology, Sydney, Australia
Table of Contents – Part II
Graph Mining Spectral Analysis of k-Balanced Signed Graphs . . . . . . . . . . . . . . . . . . . . . . Leting Wu, Xiaowei Ying, Xintao Wu, Aidong Lu, and Zhi-Hua Zhou Spectral Analysis for Billion-Scale Graphs: Discoveries and Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . U Kang, Brendan Meeder, and Christos Faloutsos
1
13
LGM: Mining Frequent Subgraphs from Linear Graphs . . . . . . . . . . . . . . . Yasuo Tabei, Daisuke Okanohara, Shuichi Hirose, and Koji Tsuda
26
Efficient Centrality Monitoring for Time-Evolving Graphs . . . . . . . . . . . . . Yasuhiro Fujiwara, Makoto Onizuka, and Masaru Kitsuregawa
38
Graph-Based Clustering with Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . Rajul Anand and Chandan K. Reddy
51
Social Network/Online Analysis A Partial Correlation-Based Bayesian Network Structure Learning Algorithm under SEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jing Yang and Lian Li
63
Predicting Friendship Links in Social Networks Using a Topic Modeling Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rohit Parimi and Doina Caragea
75
Info-Cluster Based Regional Influence Analysis in Social Networks . . . . . Chao Li, Zhongying Zhao, Jun Luo, and Jianping Fan Utilizing Past Relations and User Similarities in a Social Matching System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Richi Nayak On Sampling Type Distribution from Heterogeneous Social Networks . . . Jhao-Yin Li and Mi-Yen Yeh Ant Colony Optimization with Markov Random Walk for Community Detection in Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Di Jin, Dayou Liu, Bo Yang, Carlos Baquero, and Dongxiao He
87
99 111
123
XVIII
Table of Contents – Part II
Time Series Analysis Faster and Parameter-Free Discord Search in Quasi-Periodic Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wei Luo and Marcus Gallagher
135
INSIGHT: Efficient and Effective Instance Selection for Time-Series Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Krisztian Buza, Alexandros Nanopoulos, and Lars Schmidt-Thieme
149
Multiple Time-Series Prediction through Multiple Time-Series Relationships Profiling and Clustered Recurring Trends . . . . . . . . . . . . . . . Harya Widiputra, Russel Pears, and Nikola Kasabov
161
Probabilistic Feature Extraction from Multivariate Time Series Using Spatio-Temporal Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michal Lewandowski, Dimitrios Makris, and Jean-Christophe Nebel
173
Sequence Analysis Real-Time Change-Point Detection Using Sequentially Discounting Normalized Maximum Likelihood Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . Yasuhiro Urabe, Kenji Yamanishi, Ryota Tomioka, and Hiroki Iwai
185
Compression for Anti-Adversarial Learning . . . . . . . . . . . . . . . . . . . . . . . . . . Yan Zhou, Meador Inge, and Murat Kantarcioglu
198
Mining Sequential Patterns from Probabilistic Databases . . . . . . . . . . . . . . Muhammad Muzammal and Rajeev Raman
210
Large Scale Real-Life Action Recognition Using Conditional Random Fields with Stochastic Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xu Sun, Hisashi Kashima, Ryota Tomioka, and Naonori Ueda
222
Packing Alignment: Alignment for Sequences of Various Length Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Atsuyoshi Nakamura and Mineichi Kudo
234
Outlier Detection Multiple Distribution Data Description Learning Algorithm for Novelty Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Trung Le, Dat Tran, Wanli Ma, and Dharmendra Sharma
246
RADAR: Rare Category Detection via Computation of Boundary Degree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hao Huang, Qinming He, Jiangfeng He, and Lianhang Ma
258
Table of Contents – Part II
XIX
RKOF: Robust Kernel-Based Local Outlier Detection . . . . . . . . . . . . . . . . Jun Gao, Weiming Hu, Zhongfei (Mark) Zhang, Xiaoqin Zhang, and Ou Wu
270
Chinese Categorization and Novelty Mining . . . . . . . . . . . . . . . . . . . . . . . . . Flora S. Tsai and Yi Zhang
284
Finding Rare Classes: Adapting Generative and Discriminative Models in Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Timothy M. Hospedales, Shaogang Gong, and Tao Xiang
296
Imbalanced Data Analysis Margin-Based Over-Sampling Method for Learning From Imbalanced Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiannian Fan, Ke Tang, and Thomas Weise
309
Improving k Nearest Neighbor with Exemplar Generalization for Imbalanced Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yuxuan Li and Xiuzhen Zhang
321
Sample Subset Optimization for Classifying Imbalanced Biological Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pengyi Yang, Zili Zhang, Bing B. Zhou, and Albert Y. Zomaya
333
Class Confidence Weighted k NN Algorithms for Imbalanced Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wei Liu and Sanjay Chawla
345
Agent Mining Multi-agent Based Classification Using Argumentation from Experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maya Wardeh, Frans Coenen, Trevor Bench-Capon, and Adam Wyner Agent-Based Subspace Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chao Luo, Yanchang Zhao, Dan Luo, Chengqi Zhang, and Wei Cao
357
370
Evaluation (Similarity, Ranking, Query) Evaluating Pattern Set Mining Strategies in a Constraint Programming Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tias Guns, Siegfried Nijssen, and Luc De Raedt Asking Generalized Queries with Minimum Cost . . . . . . . . . . . . . . . . . . . . . Jun Du and Charles X. Ling
382 395
XX
Table of Contents – Part II
Ranking Individuals and Groups by Influence Propagation . . . . . . . . . . . . Pei Li, Jeffrey Xu Yu, Hongyan Liu, Jun He, and Xiaoyong Du Dynamic Ordering-Based Search Algorithm for Markov Blanket Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yifeng Zeng, Xian He, Yanping Xiang, and Hua Mao
407
420
Mining Association Rules for Label Ranking . . . . . . . . . . . . . . . . . . . . . . . . . Cl´ audio Rebelo de S´ a, Carlos Soares, Al´ıpio M´ ario Jorge, Paulo Azevedo, and Joaquim Costa
432
Tracing Evolving Clusters by Subspace and Value Similarity . . . . . . . . . . . Stephan G¨ unnemann, Hardy Kremer, Charlotte Laufk¨ otter, and Thomas Seidl
444
An IFS-Based Similarity Measure to Index Electroencephalograms . . . . . Ghita Berrada and Ander de Keijzer
457
DISC: Data-Intensive Similarity Measure for Categorical Data . . . . . . . . . Aditya Desai, Himanshu Singh, and Vikram Pudi
469
ListOPT: Learning to Optimize for XML Ranking . . . . . . . . . . . . . . . . . . . Ning Gao, Zhi-Hong Deng, Hang Yu, and Jia-Jian Jiang
482
Item Set Mining Based on Cover Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . Marc Segond and Christian Borgelt
493
Applications Learning to Advertise: How Many Ads Are Enough? . . . . . . . . . . . . . . . . . Bo Wang, Zhaonan Li, Jie Tang, Kuo Zhang, Songcan Chen, and Liyun Ru
506
TeamSkill: Modeling Team Chemistry in Online Multi-player Games . . . Colin DeLong, Nishith Pathak, Kendrick Erickson, Eric Perrino, Kyong Shim, and Jaideep Srivastava
519
Learning the Funding Momentum of Research Projects . . . . . . . . . . . . . . . Dan He and D.S. Parker
532
Local Feature Based Tensor Kernel for Image Manifold Learning . . . . . . . Yi Guo and Junbin Gao
544
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
555
Table of Contents – Part I
Feature Extraction An Instance Selection Algorithm Based on Reverse Nearest Neighbor . . . Bi-Ru Dai and Shu-Ming Hsu A Game Theoretic Approach for Feature Clustering and Its Application to Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dinesh Garg, Sellamanickam Sundararajan, and Shirish Shevade
1
13
Feature Selection Strategy in Text Classification . . . . . . . . . . . . . . . . . . . . . Pui Cheong Gabriel Fung, Fred Morstatter, and Huan Liu
26
Unsupervised Feature Weighting Based on Local Feature Relatedness . . . Jiali Yun, Liping Jing, Jian Yu, and Houkuan Huang
38
An Effective Feature Selection Method for Text Categorization . . . . . . . . Xipeng Qiu, Jinlong Zhou, and Xuanjing Huang
50
Machine Learning A Subpath Kernel for Rooted Unordered Trees . . . . . . . . . . . . . . . . . . . . . . Daisuke Kimura, Tetsuji Kuboyama, Tetsuo Shibuya, and Hisashi Kashima
62
Classification Probabilistic PCA with Application in Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Victor Cheng and Chun-Hung Li
75
Probabilistic Matrix Factorization Leveraging Contexts for Unsupervised Relation Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shingo Takamatsu, Issei Sato, and Hiroshi Nakagawa
87
The Unsymmetrical-Style Co-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bin Wang, Harry Zhang, Bruce Spencer, and Yuanyuan Guo Balance Support Vector Machines Locally Using the Structural Similarity Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jianxin Wu Using Classifier-Based Nominal Imputation to Improve Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaoyuan Su, Russell Greiner, Taghi M. Khoshgoftaar, and Amri Napolitano
100
112
124
XXII
Table of Contents – Part I
A Bayesian Framework for Learning Shared and Individual Subspaces from Multiple Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sunil Kumar Gupta, Dinh Phung, Brett Adams, and Svetha Venkatesh Are Tensor Decomposition Solutions Unique? On the Global Convergence HOSVD and ParaFac Algorithms . . . . . . . . . . . . . . . . . . . . . . . Dijun Luo, Chris Ding, and Heng Huang Improved Spectral Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sanparith Marukatat and Wasin Sinthupinyo
136
148 160
Clustering High-Order Co-clustering Text Data on Semantics-Based Representation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Liping Jing, Jiali Yun, Jian Yu, and Joshua Huang The Role of Hubness in Clustering High-Dimensional Data . . . . . . . . . . . . Nenad Tomaˇsev, Miloˇs Radovanovi´c, Dunja Mladeni´c, and Mirjana Ivanovi´c Spatial Entropy-Based Clustering for Mining Data with Spatial Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Baijie Wang and Xin Wang Self-adjust Local Connectivity Analysis for Spectral Clustering . . . . . . . . Hui Wu, Guangzhi Qu, and Xingquan Zhu An Effective Density-Based Hierarchical Clustering Technique to Identify Coherent Patterns from Gene Expression Data . . . . . . . . . . . . . . . Sauravjyoti Sarmah, Rosy Das Sarmah, and Dhruba Kumar Bhattacharyya Nonlinear Discriminative Embedding for Clustering via Spectral Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yubin Zhan and Jianping Yin An Adaptive Fuzzy k -Nearest Neighbor Method Based on Parallel Particle Swarm Optimization for Bankruptcy Prediction . . . . . . . . . . . . . . Hui-Ling Chen, Da-You Liu, Bo Yang, Jie Liu, Gang Wang, and Su-Jing Wang Semi-supervised Parameter-Free Divisive Hierarchical Clustering of Categorical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tengke Xiong, Shengrui Wang, Andr´e Mayers, and Ernest Monga
171 183
196 209
225
237
249
265
Table of Contents – Part I
XXIII
Classification Identifying Hidden Contexts in Classification . . . . . . . . . . . . . . . . . . . . . . . . ˇ Indr˙e Zliobait˙ e
277
Cross-Lingual Sentiment Classification via Bi-view Non-negative Matrix Tri-Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Junfeng Pan, Gui-Rong Xue, Yong Yu, and Yang Wang
289
A Sequential Dynamic Multi-class Model and Recursive Filtering by Variational Bayesian Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiangyun Qing and Xingyu Wang
301
Random Ensemble Decision Trees for Learning Concept-Drifting Data Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peipei Li, Xindong Wu, Qianhui Liang, Xuegang Hu, and Yuhong Zhang Collaborative Data Cleaning for Sentiment Classification with Noisy Training Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaojun Wan
313
326
Pattern Mining Using Constraints to Generate and Explore Higher Order Discriminative Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael Steinbach, Haoyu Yu, Gang Fang, and Vipin Kumar
338
Mining Maximal Co-located Event Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jin Soung Yoo and Mark Bow
351
Pattern Mining for a Two-Stage Information Filtering System . . . . . . . . . Xujuan Zhou, Yuefeng Li, Peter Bruza, Yue Xu, and Raymond Y.K. Lau
363
Efficiently Retrieving Longest Common Route Patterns of Moving Objects By Summarizing Turning Regions . . . . . . . . . . . . . . . . . . . . . . . . . . Guangyan Huang, Yanchun Zhang, Jing He, and Zhiming Ding
375
Automatic Assignment of Item Weights for Pattern Mining on Data Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yun Sing Koh, Russel Pears, and Gillian Dobbie
387
Prediction Predicting Private Company Exits Using Qualitative Data . . . . . . . . . . . . Harish S. Bhat and Daniel Zaelit A Rule-Based Method for Customer Churn Prediction in Telecommunication Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ying Huang, Bingquan Huang, and M.-T. Kechadi
399
411
XXIV
Table of Contents – Part I
Text Mining Adaptive and Effective Keyword Search for XML . . . . . . . . . . . . . . . . . . . . Weidong Yang, Hao Zhu, Nan Li, and Guansheng Zhu Steering Time-Dependent Estimation of Posteriors with Hyperparameter Indexing in Bayesian Topic Models . . . . . . . . . . . . . . . . . . Tomonari Masada, Atsuhiro Takasu, Yuichiro Shibata, and Kiyoshi Oguri Constrained LDA for Grouping Product Features in Opinion Mining . . . Zhongwu Zhai, Bing Liu, Hua Xu, and Peifa Jia
423
435
448
Semantic Dependent Word Pairs Generative Model for Fine-Grained Product Feature Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tian-Jie Zhan and Chun-Hung Li
460
Grammatical Dependency-Based Relations for Term Weighting in Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dat Huynh, Dat Tran, Wanli Ma, and Dharmendra Sharma
476
XML Documents Clustering Using a Tensor Space Model . . . . . . . . . . . . . Sangeetha Kutty, Richi Nayak, and Yuefeng Li An Efficient Pre-processing Method to Identify Logical Components from PDF Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ying Liu, Kun Bai, and Liangcai Gao Combining Proper Name-Coreference with Conditional Random Fields for Semi-supervised Named Entity Recognition in Vietnamese Text . . . . . Rathany Chan Sam, Huong Thanh Le, Thuy Thanh Nguyen, and Thien Huu Nguyen Topic Analysis of Web User Behavior Using LDA Model on Proxy Logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hiroshi Fujimoto, Minoru Etoh, Akira Kinno, and Yoshikazu Akinaga SizeSpotSigs: An Effective Deduplicate Algorithm Considering the Size of Page Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xianling Mao, Xiaobing Liu, Nan Di, Xiaoming Li, and Hongfei Yan
488
500
512
525
537
Knowledge Transfer across Multilingual Corpora via Latent Topics . . . . . Wim De Smet, Jie Tang, and Marie-Francine Moens
549
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
561
Spectral Analysis of k-Balanced Signed Graphs Leting Wu1 , Xiaowei Ying1 , Xintao Wu1 , Aidong Lu1 , and Zhi-Hua Zhou2 1
2
University of North Carolina at Charlotte, USA {lwu8,xying,xwu,alu1}@uncc.edu National Key Lab for Novel Software Technology, Nanjing University, China
[email protected]
Abstract. Previous studies on social networks are often focused on networks with only positive relations between individual nodes. As a significant extension, we conduct the spectral analysis on graphs with both positive and negative edges. Specifically, we investigate the impacts of introducing negative edges and examine patterns in the spectral space of the graph’s adjacency matrix. Our theoretical results show that communities in a k-balanced signed graph are distinguishable in the spectral space of its signed adjacency matrix even if connections between communities are dense. This is quite different from recent findings on unsigned graphs, where communities tend to mix together in the spectral space when connections between communities increase. We further conduct theoretical studies based on graph perturbation to examine spectral patterns of general unbalanced signed graphs. We illustrate our theoretical findings with various empirical evaluations.
1
Introduction
Signed networks were originally used in anthropology and sociology to model friendship and enmity [2, 4]. The motivation for signed networks arose from the fact that psychologists use -1, 0, and 1 to represent disliking, indifference, and liking, respectively. Graph topology of signed networks can then be expressed as an adjacency matrix where the entry is 1 (or −1) if the relationship is positive (or negative) and 0 if the relationship is absent. Spectral analysis that considers 0-1 matrices associated with a given network has been well developed. As a significant extension, in this paper we investigate the impacts of introducing negative edges in the graph topology and examine community patterns in the spectral space of its signed adjacency matrix. We start from k-balanced signed graphs which have been extensively examined in social psychology, especially from the stability of sentiments perspective [5]. Our theoretical results show that communities in a k-balanced signed graph are distinguishable in the spectral space of its signed adjacency matrix even if connections between communities are dense. This is very different from recent findings on unsigned graphs [12, 9], where communities tend to mix together when connections between communities increase. We give a theoretical explanation by treating the k-balanced signed graph as a perturbed one from a disconnected J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 1–12, 2011. c Springer-Verlag Berlin Heidelberg 2011
2
L. Wu et al.
k-block network. We further conduct theoretical studies based on graph perturbation to examine spectral patterns of general unbalanced signed graphs. We illustrate our theoretical findings with various empirical evaluations.
2
Notation
A signed graph G can be represented as the symmetric adjacency matrix An×n with aij = 1 if there is a positive edge between node i and j, aij = −1 if there is a negative edge between node i and j, and aij = 0 otherwise. A has n real eigenvalues. Let λi be the i-th largest eigenvalue of A with eigenvector xi , λ1 ≥ λ2 ≥ · · · ≥ λn . Let xij denote the j-th entry of xi . The spectral decomposition of A is A = i λi xi xTi . x1
xi ↓ ⎛ x11 · · · xi1 · · · ⎜ .. .. ⎜ . . ⎜ ⎜ αu →⎜ x1u · · · xiu · · · ⎜ . .. ⎝ .. . x1n · · · xin · · ·
xk
xn
⎞ xk1 · · · xn1 .. .. ⎟ . . ⎟ ⎟ xku · · · xnu ⎟ ⎟ .. ⎟ .. . ⎠ . xkn · · · xnn
(1)
Formula (1) illustrates our notions. The eigenvector xi is represented as a column vector. There usually exist k leading eigenvalues that are significantly greater than the rest ones for networks with k well separated communities. We call row vector αu = (x1u , x2u , · · · , xku ) the spectral coordinate of node u in the k-dimensional subspace spanned by (x1 , · · · xk ). This subspace reflects most topological information of the original graph. The eigenvectors xi (i = 1, . . . , k) naturally form the canonical basis of the subspace denoted by ξi = (0, . . . , 0, 1, 0 . . . , 0), where the i-th entry of ξi is 1. Let E be a symmetric perturbation matrix, and B be the adjacency matrix after perturbation, B = A + E. Similarly, let μi be the i-th largest eigenvalue u = of B with eigenvector yi , and yij is the j-th entry of yi . Row vector α (y1u , . . . , yku ) is the spectral coordinate of node u after perturbation.
3
The Spectral Property of k-Balanced Graph
The k-balanced graph is one type of signed graphs that have received extensive examinations in social psychology. It was shown that the stability of sentiments is equivalent to k-balanced (clusterable). A necessary and sufficient condition for a signed graph to be k-balanced is that the signed graph does not contain the cycle with exactly one negative edge [2]. Definition 1 Graph G is a k-balanced graph if the node set V can be divided into k non-trivial disjoint subsets such that V1 , . . . , Vk , edges connecting any two nodes from the same subset are all positive, and edges connecting any two nodes from different subsets are all negative.
Spectral Analysis of k-Balanced Signed Graphs
3
The k node sets, V1 , . . . , Vk , naturally form k communities denoted by C1 , . . . , Ck respectively. Let ni = |Vi | ( i ni = n), and Ai be the ni × ni adjacency matrix of community Ci . After re-numbering the nodes properly, the adjacency matrix B of a k-balanced graph is: ⎞ ⎛ A1 0 ⎟ ⎜ .. (2) B = A + E, where A = ⎝ ⎠, . 0
Ak
and E represents the negative edges across communities. More generally, euv = 1(−1) if a positive(negative) edge is added between node u and v, and euv = 0 otherwise. 3.1
Non-negative Block-Wise Diagonal Matrix
For a graph with k disconnected communities, its adjacency matrix A is shown in (2). Let νi be the largest eigenvalue of Ai with eigenvector zi of dimension ni × 1. Without loss of generality, we assume ν1 > · · · > νk . Since the entries of Ai are all non-negative, with Perron-Frobenius theorem [10], νi is positive and all the entries of its eigenvector zi are non-negative. When the k communities are comparable in size, νi is the i-th largest eigenvalues of A (i.e., λi = νi ), and the eigenvectors of Ai can be naturally extended to the eigenvalues of A as follows: ⎛ ⎞ z1 ⎜0 ⎜ (x1 , x2 , · · · , xk ) = ⎜ . ⎝ .. 0
0 z2 .. . 0
··· ··· .. . ···
0 0⎟ ⎟ .. ⎟ .⎠
(3)
zk
Now, consider node u in community Ci . Note that all the entries in xi are nonnegative, and the spectral coordinate of node u is just the u-th row of the matrix in (3). Then, we have αu = (0, · · · , 0, xiu , 0, · · · , 0),
(4)
where xiu > 0 is the only non-zero entries of αu . In other words, for a graph with k disconnected comparable communities, spectral coordinates of all nodes locate on k positive half-axes of ξ1 , · · · , ξk and nodes from the same community locate on the same half axis. 3.2
A General Perturbation Result
Let Γui (i = 1, . . . , k) be the set of nodes in Ci that are newly connected to node u by perturbation E: Γui = {v : v ∈ Ci , euv = ±1}. In [11], we derived several theoretical results on general graph perturbation. We include the approximation of spectral coordinates below as a basis for our spectral analysis of signed graphs. Please refer to [11] for proof details.
4
L. Wu et al.
Theorem 1. Let A be a block-wise diagonal matrix as shown in (2), and E be a symmetric perturbation matrix satisfying E2 λk . Let βij = xTi Exj . For a graph with the adjacency matrix B = A + E, the spectral coordinate of an arbitrary node u ∈ Ci can be approximated as ⎛ ⎞
euv x1v
euv xkv ⎠ u ≈ xiu ri + ⎝ α ,..., (5) λ1 λk 1 k v∈Γu
v∈Γu
where scalar xiu is the only non-zero entry in its original spectral coordinate shown in (4), and ri is the i-th row of matrix R in (6): ⎛ ⎞ 12 1k 1 λ2β−λ · · · λkβ−λ 1 1 ⎜ β21 ⎟ 2k 1 · · · λkβ−λ ⎜ λ1 −λ2 ⎟ 2 ⎜ ⎟. R=⎜ . (6) . . . .. .. .. ⎟ ⎝ .. ⎠ βk1 βk2 1 λ1 −λk λ2 −λk · · · 3.3
Moderate Inter-community Edges
Proposition 1. Let B = A + E where A has k disconnected communities and E2 λk and E is non-positive. We have the following properties: u lies on the half-line 1. If node u ∈ Ci is not connected to any Cj (j = i), α ri that starts from the origin, where ri is the i-th row of matrix R shown in (6). The k half-lines are approximately orthogonal to each other. u deviate from ri . 2. If node u ∈ Ci is connected to node v ∈ Cj (j = i), α u and rj is an obtuse angle. Moreover, the angle between α To illustrate Proposition 1, we now consider a 2-balanced graph. Suppose that a graph has two communities and we add some sparse edges between two communities. For node u ∈ C1 and v ∈ C2 , with (5), the spectral coordinates can be approximated as 1
euv x2v ), λ2 2
(7)
1
euv x1u , 0), λ1 1
(8)
u ≈ x1u r1 + (0, α
v∈Γu
v ≈ x2v r2 + ( α
u∈Γv
12 21 where r1 = (1, λ2β−λ ) and r2 = ( λ1β−λ , 1). 1 2 The Item 1 of Proposition 1 is apparent from (7) and (8). For those nodes with no inter-community edges, the second parts of the right hand side (RHS) of (7) and (8) are 0 since all euv are 0, and hence they lie on the two half-lines r1 (nodes in C1 ) and r2 (nodes in C2 ). Note that r1 and r2 are orthogonal since r1 r2T = 0.
Spectral Analysis of k-Balanced Signed Graphs C C 0.15
C
1
C
2
0.15
1
C
2
0.1
2
0.05
0.05
0
0
0
−0.05
−0.05
−0.05
−0.1 −0.1
−0.05
0
0.05 e
0.1
0.15
1
(a) Disconnected
0.2
−0.1 −0.1
2
0.15
e
2
0.05
e
2
C
1
0.1
0.1
e
0.2
0.2
0.2
5
−0.05
0
0.05 e
0.1
0.15
0.2
−0.1 −0.1
−0.05
0
0.05 e
0.1
0.15
0.2
1
1
(b) Add negative edges
(c) Add positive edges
Fig. 1. Synth-2: rotation and deviation with inter-community edges (p = 0.05)
Next, we explain Item 2 of Proposition 1. Consider the inner product u , r2 = α u r2T =
α
1
euv x2v . λ2 2 v∈Γu
u , r2 is thus If node u ∈ C1 has some negative links to C2 (euv = −1), α u lies outside the two half-lines r1 and r2 . negative. In other words, α Another interesting pattern is the direction of rotation of the two half lines. For the 2-balanced graph, r1 and r2 rotate counter-clockwise from the axis ξ1 and ξ2 . Notice that allthe added edges are negative (euv = −1), and hence 12 21 β12 = β21 = xT1 Ex2 = nu,v=1 euv x1u x2v < 0. Therefore, λ2β−λ > 0, λ1β−λ < 0, 1 2 which implies that r1 and r2 have an counter-clockwise rotation from the basis. Comparison with adding positives edges. When the added edges are all positive (euv = 1), we can deduct the following two properties in a similar manner: 1. Nodes with no inter-community edges lie on the k half-lines. (When k = 2, the two half-lines exhibit a clockwise rotation from the axes.) u and rj form an acute 2. For node u ∈ Ci that connects to node v ∈ Cj , α angle. Figure 1 shows the scatter plot of the spectral coordinates for a synthetic graph, Synth-2. Synth-2 is a 2-balanced graph with 600 and 400 nodes in each community. We generate Synth-2 and modify its inter-community edges via the same method as Synthetic data set Synth-3 in Section 5.1. As we can see in Figure 1(a), when the two communities are disconnected, the nodes from C1 and C2 lie on the positive part of axis ξ1 and ξ2 respectively. We then add a small number of edges connecting the two communities (p = 0.05). When the added edges are all negative, as shown in Figure 1(b), the spectral coordinates of the nodes from the two communities form two half-lines respectively. The two quasi-orthogonal half-lines rotate counter-clockwise from the axes. Those nodes having negative inter-community edges lie outside the two half-lines. On the contrary, if we add positive inter-community edges, as shown in Figure 1(c), the nodes from two communities display two half-lines with a clockwise rotation from the axes, and nodes with inter-community edges lie between the two half-lines.
6
L. Wu et al.
3.4
Increase the Magnitude of Inter-community Edges
Theorem 1 stands when the magnitude of perturbation is moderate. When dealing with perturbation of large magnitude, we can divide the perturbation matrix into several perturbation matrices of small magnitude and approximate the eigenvectors step by step. More general, the perturbed spectral coordinate of a node u can be approximated as u ≈ αu R + α
n
euv αv Λ−1 ,
(9)
v=1
where Λ = diag(λ1 , . . . , λk ). One property implied by (9) is that, after adding negative inter-community edges, nodes from different communities are still separable in the spectral space. Note that R is close to an orthogonal matrix, and hence the first part of RHS of (9) specifies an orthogonal transformation. The second part of RHS of (9) specifies a deviation away from the position after the transformation. Note that when the inter-community edges are all negative (euv = −1), the deviation of αu is just towards the negative direction of αv (each dimension is weighted with λ−1 i ). Therefore, after perturbation, node u and v are further separable from each other in the spectral space. The consequence of this repellency caused by adding negative edges is that nodes from different communities stay away from each other in the spectral space. As the magnitude of the noise increases, more nodes deviate from the half-lines ri , and the line pattern eventually disappears. 0.2
0.2
0.2
C
C
1
C
2
0.15
2
0
0
0
−0.05
−0.05
−0.05
0
0.05 e
0.1
0.15
−0.1 −0.1
0.2
2
0.05
−0.05
−0.1 −0.1
1
0.1
0.05
e
2
0.1
0.05
e
2
C
2
0.15
0.1
e
C
1
C 0.15
−0.05
0
1
0.05 e
0.1
0.15
−0.1 −0.1
0.2
−0.05
0
1
0.05 e
0.1
0.15
0.2
1
(a) Negative edges (p = 0.1)(b) Negative edges (p = 0.3) (c) Negative edges (p = 1) 0.2
0.2 C C
0.15
C
2
0.15
C
1
1
C
2
2
0.1
0.05
e
2
0.05
0.05
0
0
0
−0.05
−0.05
−0.05
−0.1 −0.1
−0.05
0
0.05 e 1
0.1
0.15
0.2
−0.1 −0.1
2
0.15
0.1
e
2
0.1
e
0.2 C
1
−0.05
0
0.05 e 1
0.1
0.15
0.2
−0.1 −0.1
−0.05
0
0.05 e
0.1
0.15
0.2
1
(d) Positive edges (p = 0.1)(e) Positive edges (p = 0.3) (f) Positive edges (p = 1) Fig. 2. Synth-2 with different types and magnitude of inter-community edges
Spectral Analysis of k-Balanced Signed Graphs
7
Positive large perturbation. When the added edges are positive, we can similarly conclude the opposite phenomenon: more nodes from the two communities are “pulled” closer to each other by the positive inter-community edges and are finally mixed together, indicating that the well separable communities merge into one community. Figure 2 shows the spectral coordinate of Synth-2 when we increase the magnitude of inter-community edges (p = 0.1, 0.3 and 1). For the first row (Figure 2(a) to 2(c)), we add negative inter-community edges in Synth-2, and for the second row (Figure 2(d) to 2(f)), we add positive inter-community edges. As we add more and more inter-community edges, no matter positive or negative, more and more nodes deviate from the two half-lines, and finally the line pattern diminishes in Figure 2(c) or 2(f). When adding positive inter-community edges, the nodes deviate from the lines and hence finally mix together as show in Figure 2(f), indicating that two communities merge into one community. Different from adding positive edges, which mixes the two communities in the spectral space, adding negative inter-community edges “pushes” the two communities away from each other. This is because nodes with negative inter-community edges lie outside the two half-lines as shown in Figure 2(a) and 2(b). Even when p = 1, as shown in Figure 2(c), two communities are still clearly separable in the spectral space.
4
Unbalanced Signed Graph
Signed networks in general are unbalanced and their topologies can be considered as perturbations on balanced graphs with some negative connections within communities and some positive connections across communities. Therefore, we can divide an unbalanced signed graph into three parts B = A + Ein + Eout ,
(10)
where A is a non-negative block-wise diagonal matrix as shown in (2), Ein represents the negative edges within communities, and Eout represents the both negative and positive inter-community edges. Add negative inner-community edges. For the block-wise diagonal matrix A, we first discuss the case where a small number of negative edges are added within the communities. Ein is also a block-wise diagonal. Hence βij = xTi Ein xj = 0 for all i = j, and matrix R caused by Ein in (6) is reduced to the identity matrix I. Consider that we add one negative inner-community edge between node u, v ∈ u and α v: Ci . Since R = I, only λi and xi are involved in approximating α xiv u = (0, . . . , 0, yiu , 0, . . . , 0), yiu ≈ xiu − α λi xiu v = (0, . . . , 0, yiv , 0, . . . , 0), yiv ≈ xiv − α . λi Without loss of generality, assume xiu > xiv , and we have the following properties when adding euv = −1:
L. Wu et al. 0.2
0.2 C C
0.15
C
2
0.15
C
1
1
C
2
2
2
0.1
0.05
0.05
0
0
0
−0.05
−0.05
−0.05
−0.1 −0.1
−0.05
0
0.05 e
0.1
0.15
0.2
1
(a) 2 disconnected, q = 0.1
−0.1 −0.1
2
0.15
0.1
0.05
e
2
0.1
e
0.2 C
1
e
8
−0.05
0
0.05 e
0.1
0.15
0.2
1
(b) p = 0.1, q = 0.1
−0.1 −0.1
−0.05
0
0.05 e
0.1
0.15
0.2
1
(c) p = 0.1, q = 0.2
Fig. 3. Spectral coordinates of unbalanced graphs generated from Synth-2
1. Both node u and v move towards the negative part of axis ξi after perturbation: yiu < xiu and yiv < xiu . 2. Node v moves farther than u after perturbation: |yiv − xiv | > |yiu − xiu |. The two preceding properties imply that, for those nodes close to the origin, adding negative edges would “push” them towards the negative part of axis ξi , and a small number of nodes can thus lie on the negative part of axis ξi , i.e., yiu < 0). Add inter-community edges. The spectral perturbation caused by adding Eout on to matrix A + Ein can be complicated. Notice that (A + Ein ) is still a block-wise matrix, and we can still apply Thereom 1 and conclude that, when Eout is moderate, the major nodes from k communities form k lines in the spectral space and nodes with inter-community edges deviate from the lines. It is difficult to give the explicit form of the lines and the deviations, because xiu and the inter-community edges can be either positive and negative. However, we expect that the effect of adding negative edges on positive nodes is still dominant in determining the spectral pattern, because most nodes lie along the positive part of the axes and the majority of inter-community edges are negative. Communities are still distinguishable in the spectral space. The majority of nodes in one community lie on the positive part of the line, while a small number of nodes may lie on the negative part due to negative connections within the community. We make graph Synth-2 unbalanced by flipping the signs a small proportion q of the edges. When the two communities are disconnected, as shown in Figure 3(a), after flipping q = 0.1 inner-community edges, a small number of nodes lie on the negative parts of the two axes. Figure 3(b) shows the spectral coordinates of the unbalanced graph generated from balanced graph Synth-2 (p = 0.1, q = 0.1). Since the magnitude of the inter-community edges is small, we can still observe two orthogonal lines in the scatter plots. The negative edges within the communities cause a small number of nodes lie on the negative parts of the two lines. Nodes with inter-community edges deviate from the two lines. For Figure 3(c), we flip more edge signs (p = 0.1, q = 0.2). We can observe that more nodes lie on the negative parts of the lines, since more inner-community edges are changed to negative. The rotation angles of the two lines are smaller than that
Spectral Analysis of k-Balanced Signed Graphs
9
in Figure 3(b). This is because the positive inter-community edges “pull” the rotation clockwise a little, and the rotation we observe depends on the effects from both positive and negative inter-community edges.
5 5.1
Evaluation Synthetic Balanced Graph
Data set Synth-3 is a synthetic 3-balanced graph generated from the power law degree distribution with parameter 2.5. The 3 communities of Synth-3 contain 600, 500, 400 nodes, and 4131, 3179, 2037 edges respectively. All the 13027 intercommunity edges are set to be negative. We delete the inter-community edges randomly until a proportion p of them remain in the graph. The parameter p is the ratio of the magnitude of inter-community edges to that of the innercommunity edges. If p = 0 there are no inter-community edges. If p = 1, innerand inter-community edges have the same magnitude. Figure 4 shows the change of spectral coordinates of Synth-3, as we increase the magnitude of inter-community edges. When there are no any negative links (p = 0), the scatter plot of the spectral coordinates is shown in Figure 4(a). The disconnected communities display 3 orthogonal half-lines. Figure 4(b) shows the spectral coordinates when the magnitude of inter-community edges is moderate (p = 0.1). We can see the nodes form three half-lines that rotate a certain angle, and some of the nodes deviate from the lines. Figures 4(c) and 4(d) show the spectral coordinates when we increase the magnitude of inter-community edges (p = 0.3, 1). We can observe that, as more inter-community edges are added, more and more nodes deviate from the lines. However, nodes from different communities are still separable from each other in the spectral space. We also add positive inter-community edges on Synth-3 for comparison, and the spectral coordinates are shown in Figures 4(e) and 4(f). We can observe that, different from adding negative edges, as the magnitude of inter-community edges (p) increases, nodes from the three communities get closer to each other, and completely mix in one community in Figure 4(f). 5.2
Synthetic Unbalanced Graph
To generate an unbalanced graph, we randomly flip the signs of a small proportion q of the inner- and inter-community edges of a balanced graph, i.e., the parameter q is the proportion of unbalanced edges given the partition. We first flip edge signs on the graph with small magnitude inter-community edges. Figure 5(a) and 5(b) show the spectral coordinates after we flip q = 10% and q = 20% edge signs on Synth-3 with p = 0.1. We can observe that, even the graph is unbalanced, nodes from the three communities exhibit three lines starting from the origin, and some nodes deviate from the lines due to the inter-community edges. We then flip edge signs on the graph with large magnitude inter-community edges. Figure 5(c) shows the spectral coordinates after we flip q = 20% edge signs on Synth-3 with p = 1. We can observe that the line pattern diminishes because of the large number of inter-community edges. However, the nodes from
10
L. Wu et al.
C C
0.2
C
C
1
C
2
C
3
C
1
1
C
2
2
C
3
3
0.15 0.2
0.05
0.1
0.2
3
e
3
0.1
0.1
e
0
e
−0.1
−0.05
−0.1 −0.1
−0.1
0
0.1
0.2 e
e
1
e
0.1
0.1
0
0.1
0.1 0.2
−0.1 −0.1
0
−0.1 0
0
0
0
−0.1
−0.1
3
0
0.1
e1
0.2 0.2
2
2
(a) 3 disconnected communities
(b) Negative p = 0.1
C
C
1
C
C
2
0.2
C
C
3
0.2
e1
0.2 0.2 e
2
(c) Negative p = 0.3
C
1
C
2
0.2
C
3
1 2 3
0.15
0.15 3
0.1 0.1
e
0.1
e
3
e
3
0 −0.1 −0.1
0.05
0.05 0 0
0
−0.05
−0.1 −0.05 0
0.1 0.1 e
2
−0.1
0.2 0.2
−0.1 −0.1 −0.05
e1
0
0.05
0.1
0.15
0 0.1 0.2
0.2
−0.1
−0.1 −0.1
0 0
e
0.2 1
0.2 e
e
e
(d) Negative p = 1
0.1
0.1
1
2
(e) Positive p = 0.1
(f) Positive p = 1
Fig. 4. The spectral coordinates of the 3-balanced graph Synth-3. (b)-(d): add negative inter-community edges; (e)-(f): add positive inter-community edges. C
C
C
C
C
C
1 2 3
C
1
C
2
C
3
0.2
0.2
2 3
0.2
0.1
0.1 3
e
0
−0.1
e
3
−0.1
e
3
0.1
1
0 0 −0.1 −0.1
−0.1 −0.1
0 0
0.1
0 0.2 0.2 e
e1
2
(a) p = 0.1, q = 0.1
e
0 0
0.1 0.1
0.1
−0.1
0 −0.1 −0.1 0.1 0.1
0.2 0.2
e1
2
(b) p = 0.1, q = 0.2
e
0.2 0.2
e1
2
(c) p = 1, q = 0.2
Fig. 5. The spectral coordinates of a unbalanced synthetic graph generated via flipping signs of inner- and inter-community edges of Synth-3 with p = 0.1 or 1
3 communities are separable in the spectral space, indicating that the unbalanced edges do not greatly change the patterns in the spectral space. 5.3
Comparison with Laplacian Spectrum
¯ − A where D ¯ n×n is a diagonal The signed Laplacian matrix is defined as L = D n ¯ degree matrix with Dii = j=1 |Aij | [7]. Note that the unsigned Laplacian matrix is defined as L = D − A where Dn×n is a diagonal degree matrix with n Dii = j=1 Aij . The eigenvectors corresponding to the k smallest eigenvalues of Laplacian matrix also reflect the community structure of a signed graph: the k communities form k clusters in the Laplacian spectral space. However, eigenvectors associated with the smallest eigenvalues are generally instable to noise according to the matrix perturbation theory [10]. Hence, when it comes
Spectral Analysis of k-Balanced Signed Graphs
11
C
1
C
2
C
3
0.03
0.05
0.02 0.05
e
3
3
e
e
3
0.01 0
0
0 −0.01
−0.05
−0.02
−0.05 −0.05
−0.05 C −0.05 1 C
0
0.05
0.05
3
−0.03 0
0
e1
e
0.05 e
2
(a) p = 0.1, q = 0 (balanced)
C
0.05
2
C
0
−0.05
C
1
−0.02
2
0
−0.02 C 3
0 0.02
e
1
(b) p = 0.1, q = 0.2
0.02 e1
2
e
2
(c) p = 1, q = 0.2
Fig. 6. The Laplacian spectral space of signed graphs
to real-world networks, the communities may no longer form distinguishable clusters in the Laplacian spectral space. Figure 6(a) shows the Laplacian spectrum of a balanced graph, Synth-3 with p = 0.1. We can see that the nodes from the three communities form 3 clusters in the spectral space. However, the Laplacian spectrum is less stable to the noise. Figure 6(b) and 6(c) plot the Laplacian spectra of the unbalanced graphs generated from Synth-3. We can observe that C1 and C2 are mixed together in Figure 6(b) and all the three communities are not separable from each other in Figure 6(c). For comparison, the adjacency spectra of the corresponding graphs were shown in Figure 5(b) and Figure 5(c) respectively where we can observe that the three communities are well separable in the adjacency spectral space.
6
Related Work
There are several studies on community partition in social networks with negative (or negatively weighted) edges [1, 3]. In [1], Bansal et al. introduced correlation clustering and showed that it is an NP problem to make a partition to a complete signed graph. In [3], Demaine and Immorlica gave an approximation algorithm and showed that the problem is APX-hard. Kruegis et al. in [6] presented a case study on the signed Slashdot Zoo corpus and analyzed various measures (including signed clustering coefficient and signed centrality measures). Leskovic et al. in [8] studied several signed online social networks and developed a theory of status to explain the observed edge signs. Laplacian graph kernels that apply to signed graphs were described in [7]. However, the authors only focused on 2-balanced signed graphs and many results (such as signed graphs’ definiteness property) do not hold for general k-balanced graphs.
7
Conclusion
We conducted theoretical studies based on graph perturbation to examine spectral patterns of signed graphs. Our results showed that communities in a kbalanced signed graph are distinguishable in the spectral space of its signed
12
L. Wu et al.
adjacency matrix even if connections between communities are dense. To our best knowledge, these are the first reported findings on showing separability of communities in the spectral space of the signed adjacency matrix. In our future work, we will evaluate our findings using various real signed social networks. We will also develop community partition algorithms exploiting our theoretical findings and compare with other clustering methods for signed networks.
Acknowledgment This work was supported in part by U.S. National Science Foundation (CCF1047621, CNS-0831204) for L. Wu, X. Wu, and A. Lu and by the Jiangsu Science Foundation (BK2008018) and the National Science Foundation of China (61073097) for Z.-H. Zhou.
References 1. Bansal, N., Chawla, S.: Correlation clustering. Machine Learning 56, 238–247 (2002) 2. Davis, J.A.: Clustering and structural balance in graphs. Human Relations 20, 181–187 (1967) 3. Demaine, E.D., Immorlica, N.: Correlation clustering with partial information. In: Working Notes of the 6th International Workshop on Approximation Algorithms for Combinatorial Optimization Problems, pp. 1–13 (2003) 4. Hage, P., Harary, F.: Structural models in anthropology, pp. 56–60. Cambridge University Press, Cambridge (1983) 5. Inohara, T.: Characterization of clusterability of signed graph in terms of newcomb’s balance of sentiments. Applied Mathematics and Computation 133, 93–104 (2002) 6. Kunegis, J., Lommatzsch, A., Bauckhage, C.: The slashdot zoo: mining a social network with negative edges. In: WWW 2009, pp. 741–750 (2009) 7. Kunegis, J., Schmidt, S., Lommatzsch, A., Lerner, J., Luca, E.W.D., Albayrak, S.: Spectral analysis of signed graphs for clustering, prediction and visualization. In: SDM, pp. 559–570 (2010) 8. Leskovec, J., Huttenlocher, D., Kleinberg, J.: Signed networks in social media. In: CHI, pp. 1361–1370 (2010) 9. Prakash, B.A., Sridharan, A., Seshadri, M., Machiraju, S., Faloutsos, C.: EigenSpokes: Surprising patterns and scalable community chipping in large graphs. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD 2010. LNCS, vol. 6119, pp. 435–448. Springer, Heidelberg (2010) 10. Stewart, G.W., Sun, J.: Matrix perturbation theory. Academic Press, London (1990) 11. Wu, L., Ying, X., Wu, X., Zhou, Z.-H.: Line orthogonality in adjacency eigenspace and with application to community partition. Technical Report, UNC Charlotte (2010) 12. Ying, X., Wu, X.: On randomness measures for social networks. In: SDM, pp. 709–720 (2009)
Spectral Analysis for Billion-Scale Graphs: Discoveries and Implementation U Kang, Brendan Meeder, and Christos Faloutsos Carnegie Mellon University, School of Computer Science {ukang,bmeeder,christos}@cs.cmu.edu
Abstract. Given a graph with billions of nodes and edges, how can we find patterns and anomalies? Are there nodes that participate in too many or too few triangles? Are there close-knit near-cliques? These questions are expensive to answer unless we have the first several eigenvalues and eigenvectors of the graph adjacency matrix. However, eigensolvers suffer from subtle problems (e.g., convergence) for large sparse matrices, let alone for billion-scale ones. We address this problem with the proposed HE IGEN algorithm, which we carefully design to be accurate, efficient, and able to run on the highly scalable M AP R EDUCE (H ADOOP) environment. This enables HE IGEN to handle matrices more than 1000× larger than those which can be analyzed by existing algorithms. We implement HE IGEN and run it on the M45 cluster, one of the top 50 supercomputers in the world. We report important discoveries about near-cliques and triangles on several real-world graphs, including a snapshot of the Twitter social network (38Gb, 2 billion edges) and the “YahooWeb” dataset, one of the largest publicly available graphs (120Gb, 1.4 billion nodes, 6.6 billion edges).
1 Introduction Graphs with billions of edges, or billion-scale graphs, are becoming common; Facebook boasts about 0.5 billion active users, who-calls-whom networks can reach similar sizes in large countries, and web crawls can easily reach billions of nodes. Given a billionscale graph, how can we find near-cliques, the count of triangles, and related graph properties? As we discuss later, triangle counting and related expensive operations can be computed quickly, provided we have the first several eigenvalues and eigenvectors. In general, spectral analysis is a fundamental tool not only for graph mining, but also for other areas of data mining. Eigenvalues and eigenvectors are at the heart of numerous algorithms such as triangle counting, singular value decomposition (SVD), spectral clustering, and tensor analysis [10]. In spite of their importance, existing eigensolvers do not scale well. As described in Section 6, the maximum order and size of input matrices feasible for these solvers is million-scale. In this paper, we discover patterns on near-cliques and triangles, on several realworld graphs including a Twitter dataset (38Gb, over 2 billion edges) and the “YahooWeb” dataset, one of the largest publicly available graphs (120Gb, 1.4 billion nodes, 6.6 billion edges). To enable discoveries, we propose HE IGEN, an eigensolver for billion-scale, sparse symmetric matrices built on the top of H ADOOP, an open-source M AP R EDUCE framework. Our contributions are the following: J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 13–25, 2011. c Springer-Verlag Berlin Heidelberg 2011
14
U Kang, B. Meeder, and C. Faloutsos
1. Effectiveness: With HE IGEN we analyze billion-scale real-world graphs and report discoveries, including a high triangle vs. degree ratio for adult sites and web pages that participate in billions of triangles. 2. Careful Design: We choose among several serial algorithms and selectively parallelize operations for better efficiency. 3. Scalability: We use the H ADOOP platform for its excellent scalability and implement several optimizations for HE IGEN, such as cache-based multiplications and skewness exploitation. This results in linear scalability in the number of edges, the same accuracy as standard eigensolvers for small matrices, and more than a 76× performance improvement over a naive implementation. Due to our focus on scalability, HE IGEN can handle sparse symmetric matrices with billions of nodes and edges, surpassing the capability of previous eigensolvers (e.g. [20] [16]) by more than 1,000 ×. Note that HE IGEN is different from Google’s PageRank algorithm since HE IGEN computes top k eigenvectors while PageRank computes only the first eigenvector. Designing top k eigensolver is much difficult and subtle than designing the first eigensolver, as we will see in Section 4. With this powerful tool we are able to study several billion-scale graphs, and we report fascinating patterns on the near-cliques and triangle distributions in Section 2. The HE IGEN algorithm (implemented in H ADOOP ) is available at http://www.cs.cmu.edu/∼ukang/HEIGEN. The rest of the paper presents the discoveries in real-world networks, design decisions and details of our method, experimental results, and conclusions.
2 Discoveries In this section, we show discoveries on billion-scale graphs using HE IGEN. We focus on the structural properties of networks: spotting near-cliques and finding triangles. The graphs we used in this and Section 5 are described in Table 11 . 2.1 Spotting Near-Cliques In a large, sparse network, how can we find tightly connected nodes, such as those in near-cliques or bipartite cores? Surprisingly, eigenvectors can be used for this purpose [14]. Given an adjacency matrix W and its SVD W = U ΣV T , an EE-plot is defined to be the scatter plot of the vectors Ui and Uj for any i and j. EE-plots of some real-world graphs contain clear separate lines (or ‘spokes’), and the nodes with the largest values in each spoke are separated from the other nodes by forming near-cliques or bipartite cores. Figures 1 shows several EE-plots and spyplots (i.e., adjacency matrix of induced subgraph) of the top 100 nodes in top eigenvectors of YahooWeb graph. In Figure 1 (a) - (d), we observe clear ‘spokes,’ or outstanding nodes, in the top eigenvectors. Moreover, the top 100 nodes with largest values in U1 , U2 , and U4 form a 1
YahooWeb, LinkedIn: released under NDA. Twitter: http://www.twitter.com/ Kronecker: http://www.cs.cmu.edu/∼ukang/dataset Epinions: not public data.
Spectral Analysis for Billion-Scale Graphs
15
Table 1. Order and size of networks Name Nodes Edges YahooWeb 1,413 M 6,636 M Twitter 62.5 M 2,780 M LinkedIn 7.5 M 58 M Kronecker 59 K ∼ 177 K 282 M ∼ 1,977 M Epinions 75 K 508 K
Description WWW pages in 2002 who follows whom in 2009/11 person-person in 2006 synthetic graph who trusts whom
(a) U4 vs. U1
(b) U3 vs. U2
(c) U7 vs. U5
(d) U8 vs. U6
(e) U1 spoke
(f) U2 spoke
(g) U3 spoke
(h) U4 spoke
(i) Structure of bi-clique
Fig. 1. EE-plots and spyplots from YahooWeb. (a)-(d): EE-plots showing the values of nodes in the ith eigenvector vs. in the jth eigenvector. Notice the clear ‘spokes’ in top eigenvectors signify the existence of a strongly related group of nodes in near-cliques or bi-cliques as depicted in (i). (e)-(h): Spyplots of the top 100 largest nodes from each eigenvector. Notice that we see a near clique in U3 , and bi-cliques in U1 , U2 , and U4 . (i): The structure of ‘bi-clique’ in (e), (f), and (h).
‘bi-clique’, shown in (e), (f), and (h), which is defined to be the combination of a clique and a bipartite core as depicted in Figure 1 (i). Another observation is that the top seven nodes shown in Figure 1 (g) belong to indymedia.org which is the site with the maximum number of triangles in Figure 2. Observation 1 (Eigenspokes). EE-plots of YahooWeb show clear spokes. Additionally, the extreme nodes in the spokes belong to cliques or bi-cliques. 2.2 Triangle Counting Given a particular node in a graph, how are its neighbors connected? Do they form stars? Cliques? The above questions about the community structure of networks can be answered by studying triangles (three nodes connected to each other). However, directly counting triangles in graphs with billions of nodes and edges is prohibitively expensive [19]. Fortunately, we can approximate triangle counts with high accuracy using HE IGEN by exploiting its connection to eigenvalues [18]. In a nutshell, the total number of triangles in a graph is related to the sum of cubes of eigenvalues, and the first few eigenvalues provide extremely good approximations. A slightly more elaborate analysis approximates the number of triangles in which a node participates, using the cubes of the first few eigenvalues and the corresponding eigenvectors.
16
U Kang, B. Meeder, and C. Faloutsos
(a) LinkedIn (58M edges)
(b) Twitter (2.8B edges)
(c) YahooWeb (6.6B edges)
Fig. 2. The distribution of the number of participating triangles of real graphs. In general, they obey the “triangle power-law.” Moreover, well-known U.S. politicians participate in many triangles, demonstrating that their followers are well-connected. In the YahooWeb graph, we observe several anomalous spikes which possibly come from cliques.
Using the top k eigenvalues computed with HE IGEN, we analyze the distribution of triangle counts of real graphs including the Linkedin, Twitter social, and YahooWeb graphs in Figure 2. We first observe that there exists several nodes with extremely large triangle counts. In Figure 2 (b), Barack Obama is the person with the fifth largest number of participating triangles, and has many more than other U.S. politicians. In Figure 2 (c), the web page lists.indymedia.org contains the largest number of triangles; this page is a list of mailing lists which apparently point to each other. We also observe regularities in triangle distributions and note that the beginning part of the distributions follows a power-law. Observation 2 (Triangle power law). The beginning part of the triangle count distribution of real graphs follows a power-law. In the YahooWeb graph in Figure 2 (c), we observe many spikes. One possible reason k−1 of the spikes is that they come from cliques: a k-clique generates k nodes with ( 2 ) triangles. Observation 3 (Spikes in triangle distribution). In the Web graph, there exist several spikes which possibly come from cliques. The rightmost spike in Figure 2 (c) contains 125 web pages that each have about 1 million triangles in their neighborhoods. They all belong to the news site ucimc.org, and are connected to a tightly coupled group of pages. Triangle counts exhibit even more interesting patterns when combined with the degree information as shown in the degree-triangle plot of Figure 3. We see that celebrities have high degree and mildly connected followers, while accounts for adult sites have many fewer, but extremely well connected, followers. Degree-triangle plots can be used to spot and eliminate harmful accounts such as those of adult advertisers and spammers. Observation 4 (Anomalous Triangles vs. Degree Ratio). In Twitter, anomalous accounts have very high triangles vs. degree ratio compared to other regular accounts. All of the above observations need a fast, scalable eigensolver. This is exactly what HE IGEN does, and we describe our proposed design next.
Spectral Analysis for Billion-Scale Graphs
17
Fig. 3. The degree vs. participating triangles of some ‘celebrities’ (rest: omitted, for clarity) in Twitter accounts. Also shown are accounts of adult sites which have smaller degree, but belong to an abnormally large number of triangles (= many, well connected followers - probably, ‘robots’).
3 Background - Sequential Algorithms In the next two sections, we describe our method of computing eigenvalues and eigenvectors of billion-scale graphs. We first describe sequential algorithms to find eigenvalues and eigenvectors of matrices. We limit our attention to symmetric matrices due to the computational difficulties; even the best methods for non-symmetric eigensolver require much more computation than symmetric eigensolvers. We list the alternatives for computing the eigenvalues of symmetric matrix and the reasoning behind our choice. – Power method: the simplest and most famous method for computing the topmost eigenvalue. However, it can not find the top k eigenvalues. – Simultaneous iteration (or QR): an extension of the Power method to find top k eigenvalues. It requires large matrix-matrix multiplications that are prohibitively expensive for billion-scale graphs. – Lanczos-NO(No Orthogonalization): the basic Lanczos algorithm [5] which approximates the top k eigenvalues in the subspace composed of intermediate vectors from the Power method. The problem is that while computing the eigenvalues, they can ‘jump’ up to larger eigenvalues, thereby outputting spurious eigenvalues. Although all of the above algorithms are not suitable for calculations on billion-scale graphs using M AP R EDUCE, we present a tractable, M AP R EDUCE-based algorithm for computing the top k eigenvectors and eigenvalues in the next section.
4 Proposed Method In this section we describe HE IGEN, a parallel algorithm for computing the top k eigenvalues and eigenvectors of symmetric matrices in M AP R EDUCE. 4.1 Summary of the Contributions Efficient top k eigensolvers for billion-scale graphs require careful algorithmic considerations. The main challenge is to carefully design algorithms that work well on distributed systems and exploit the inherent structure of data, including block structure and skewness, in order to be efficient. We summarize the algorithmic contributions here and describe each in detail in later sections.
18
U Kang, B. Meeder, and C. Faloutsos
1. Careful Algorithm Choice: We carefully choose a sequential eigensolver algorithm that is efficient for M AP R EDUCE and gives accurate results. 2. Selective Parallelization: We group operations into expensive and inexpensive ones based on input sizes. Expensive operations are done in parallel for scalability, while inexpensive operations are performed faster on a single machine. 3. Blocking: We reduce the running time by decreasing the input data size and the amount of network traffic among machines. 4. Exploiting Skewness: We decrease the running time by exploiting skewness of data. 4.2 Careful Algorithm Choice In Section 3, we considered three algorithms that are not tractable for analyzing billionscale graphs with M AP R EDUCE. Fortunately, there is an algorithm suitable for such a purpose. Lanczos-SO (Selective Orthogonalization) improves on the Lanczos-NO by selectively reorthogonalizing vectors instead of performing full reorthogonalizations. Algorithm 1. Lanczos -SO(Selective Orthogonalization) Input: Matrix An×n , random n-vector b, maximum number of steps m, error threshold Output: Top k eigenvalues λ[1..k], eigenvectors Y n×k 1: β0 ← 0, v0 ← 0, v1 ← b/||b||; 2: for i = 1..m do 3: v ← Avi ; // Find a new basis vector 4: αi ← viT v; 5: v ← v − βi−1 vi−1 − αi vi ; // Orthogonalize against two previous basis vectors 6: βi ← ||v||; 7: Ti ← (build tri-diagonal matrix from α and β); 8: QDQT ← EIG(Ti ); // Eigen decomposition of Ti 9: for j = 1..i do √ 10: if βi |Q[i, j]| ≤ ||Ti || then 11: r ← Vi Q[:, j]; 12: v ← v − (r T v)r; // Selectively orthogonalize 13: end if 14: end for 15: if (v was selectively orthogonalized) then 16: βi ← ||v||; // Recompute normalization constant βi 17: end if 18: if βi = 0 then 19: break for loop; 20: end if 21: vi+1 ← v/βi ; 22: end for 23: T ← (build tri-diagonal matrix from α and β); 24: QDQT ← EIG(T ); // Eigen decomposition of T 25: λ[1..k] ← top k diagonal elements of D; // Compute eigenvalues 26: Y ← Vm Qk ; // Compute eigenvectors. Qk is the columns of Q corresponding to λ
The main idea of Lanczos-SO is as follows: We start with a random initial basis vector b which comprises a rank-1 subspace. For each iteration, a new basis vector
Spectral Analysis for Billion-Scale Graphs
19
is computed by multiplying the input matrix with the previous basis vector. The new basis vector is then orthogonalized against the last two basis vectors and is added to the previous rank-(m − 1) subspace, forming a rank-m subspace. Let m be the number of the current iteration, Qm be the n × m matrix whose ith column is the ith basis vector, and A be the matrix for which we want to compute eigenvalues. We also define Tm = Q∗m AQm to be a m × m matrix. Then, the eigenvalues of Tm are good approximations of the eigenvalues of A . Furthermore, multiplying Qm by the eigenvectors of Tm gives good approximations of the eigenvectors of A. We refer to [17] for further details. If we used exact arithmetic, the newly computed basis vector would be orthogonal to all previous basis vectors. However, rounding errors from floating-point calculations compound and result in the loss of orthogonality. This is the cause of the spurious eigenvalues in Lanczos-NO. Orthogonality can be recovered once the new basis vector is fully re-orthogonalized to all previous vectors. However, doing this becomes expensive as it requires O(m2 ) re-orthogonalizations, where m is the number of iterations. A better approach uses a quick test (line 10 of Algorithm 1) to selectively choose vectors that need to be re-orthogonalized to the new basis [6]. This selective-reorthogonalization idea is shown in Algorithm 1. The Lanczos-SO has all the properties that we need: it finds the top k largest eigenvalues and eigenvectors, it produces no spurious eigenvalues, and its most expensive operation, a matrix-vector multiplication, is tractable in M AP R EDUCE. Therefore, we choose Lanczos-SO as our choice of the sequential algorithm for parallelization. 4.3 Selective Parallelization Among many sub-operations in Algorithm 1, which operations should we parallelize? A naive approach is to parallelize all the operations; however, some operations run more quickly on a single machine rather than on multiple machines in parallel. The reason is that the overhead incurred by using M AP R EDUCE exceeds gains made by parallelizing the task; simple tasks where the input data is very small complete faster on a single machine. Thus, we divide the sub-operations into two groups: those to be parallelized and those to be run in a single machine. Table 2 summarizes our choice for each sub-operation. Note that the last two operations in the table can be done with a single-machine standard eigensolver since the input matrices are tiny; they have m rows and columns, where m is the number of iterations. 4.4 Blocking Minimizing the volume of information sent between nodes is important to designing efficient distributed algorithms. In HE IGEN, we decrease the amount of network traffic by using the block-based operations. Normally, one would put each edge ”(source, destination)” in one line; H ADOOP treats each line as a data element for its ’map()’ function. Instead, we propose to divide the adjacency matrix into blocks (and, of course, the corresponding vectors also into blocks), and put the edges of each block on a single line, and compress the source- and destination-ids. This makes the map() function a bit more complicated to process blocks, but it saves significant transfer time of data over the network. We use these edge-blocks and the vector-blocks for many parallel operations in Table 2, including matrix-vector multiplication, vector update, vector dot product,
20
U Kang, B. Meeder, and C. Faloutsos
Table 2. Parallelization Choices. The last column (P) indicates whether the operation is parallelized in HE IGEN. Some operations are better to be run in parallel since the input size is very large, while others are better in a single machine since the input size is small and the overhead of parallel execution overshadows its decreased running time. Operation y ← y + ax γ ← xT x y ← αy ||y|| y ← M n×n x y ← Msn×m xs As ← Msn×m Nsm×k ||T || EIG(T )
Description vector update vector dot product vector scale vector L2 norm large matrix-large,dense vector multiplication large matrix-small vector multiplication (n m) large matrix-small matrix multiplication (n m > k) matrix L2 norm which is the largest singular value of the matrix symmetric eigen decomposition to output QDQT
Input P? Large Yes Large Yes Large Yes Large Yes Large Yes Large Yes Large Yes Tiny No Tiny No
vector scale, and vector L2 norm. Performing operations on blocks is faster than doing so on individual elements since both the input size and the key space decrease. This reduces the network traffic and sorting time in the M AP R EDUCE Shuffle stage. As we will see in Section 5, the blocking decreases the running time by more than 4×. Algorithm 2. CBMV(Cache-Based Matrix-Vector Multiplication) for HE IGEN Input: Matrix M = {(idsrc , (iddst , mval))}, Vector x = {(id, vval)} Output: Result vector y 1: Stage1-Map(key k, value v, Vector x) // Multiply matrix elements and the vector x 2: idsrc ← k; 3: (iddst , mval) ← v; 4: Output(idsrc , (mval × x[iddst ])); // Multiply and output partial results 5: 6: Stage1-Reduce(key k, values V []) // Sum up partial results 7: sum ← 0; 8: for v ∈ V do 9: sum ← sum + v; 10: end for 11: Output(k, sum);
4.5 Exploit Skewness: Matrix-Vector Multiplication HE IGEN uses an adaptive method for sub-operations based on the size of the data. In this section, we describe how HE IGEN implements different matrix-vector multiplication algorithms by exploiting the skewness pattern of the data. There are two matrixvector multiplication operations in Algorithm 1: the one with a large vector (line 3) and the other with a small vector (line 11).
Spectral Analysis for Billion-Scale Graphs
21
The first matrix-vector operation multiplies a matrix with a large and dense vector, and thus it requires a two-stage standard M AP R EDUCE algorithm by Kang et al. [9]. In the first stage, matrix elements and vector elements are joined and multiplied to make partial results which are added together to get the result vector in the second stage. The other matrix-vector operation, however, multiplies with a small vector. HE IGEN uses the fact that the small vector can fit in a machine’s main memory, and distributes the small vector to all the mappers using the distributed cache functionality of H ADOOP. The advantage of the small vector being available in mappers is that joining edge elements and vector elements can be done inside the mapper, and thus the first stage of the standard two-stage matrix-vector multiplication can be omitted. In this one-stage algorithm the mapper joins matrix elements and vector elements to make partial results, and the reducer adds up the partial results. The pseudo code of this algorithm, which we call CBMV(Cache-Based Matrix-Vector multiplication), is shown in Algorithm 2. We want to emphasize that this operation cannot be performed when the vector is large, as is the case in the first matrix-vector multiplication. The CBMV is faster than the standard method by 57× as described in Section 5. 4.6 Exploiting Skewness: Matrix-Matrix Multiplication Skewness can also be exploited to efficiently perform matrix-matrix multiplication (line 26 of Algorithm 1). In general, matrix-matrix multiplication is very expensive. A standard, yet naive, way of multiplying two matrices A and B in M AP R EDUCE is to multiply A[:, i] and B[i, :] for each column i of A and sum the resulting matrices. This algorithm, which we call MM(direct Matrix-Matrix multiplication), is very inefficient since it generates huge matrices and sums them up many times. Fortunately, when one of the matrices is very small, we can utilize the skewness to make an efficient M AP R E DUCE algorithm. This is exactly the case in HE IGEN; the first matrix is very large, and the second is very small. The main idea is to distribute the second matrix by the distributed cache functionality in H ADOOP, and multiply each element of the first matrix with the corresponding rows of the second matrix. We call the resulting algorithm Cache-Based Matrix-Matrix multiplication, or CBMM. There are other alternatives to matrix-matrix multiplication: one can decompose the second matrix into column vectors and iteratively multiply the first matrix with each of these vectors. We call the algorithms, introduced in Section 4.5, Iterative matrix-vector multiplications (IMV) and Cache-based iterative matrix-vector multiplications (CBMV). The difference between CBMV and IMV is that CBMV uses cache-based operations while IMV does not. As we will see in Section 5, the best method, CBMM, is faster than naive methods by 76×. 4.7 Analysis We analyze the time and the space complexity of HE IGEN. In the lemmas below, m is the number of iterations, |V | and |E| are the number of nodes and edges, and M is the number of machines.
22
U Kang, B. Meeder, and C. Faloutsos
Lemma 1 (Time Complexity). HE IGEN takes O(m |V |+|E| log |V |+|E| ) time. M M Proof. (Sketch) The running time of one iteration of HE IGEN is dominated by the matrix-large vector multiplication whose running time is O(m |V |+|E| log |V |+|E| ). M M Lemma 2 (Space Complexity). HE IGEN requires O(|V | + |E|) space. Proof. (Sketch) The maximum storage is required at the intermediate output of the twostage matrix-vector multiplication where O(|V | + |E|) space is needed.
5 Performance In this section, we present experimental results to answer the following questions: – Scalability: How well does HE IGEN scale up? – Optimizations: Which of our proposed methods give the best performance? We perform experiments in the Yahoo! M45 H ADOOP cluster with total 480 hosts, 1.5 petabytes of storage, and 3.5 terabytes of memory. We use H ADOOP 0.20.1. The scalability experiments are performed using synthetic Kronecker graphs [12] since realistic graphs of any size can be easily generated. 5.1 Scalability Figure 4(a,b) shows the scalability of HE IGEN-BLOCK, an implementation of HE IGEN that uses blocking, and HE IGEN-PLAIN, an implementation which does not. Notice that the running time is near-linear in the number of edges and machines. We also note that HE IGEN-BLOCK performs up to 4× faster when compared to HE IGEN-PLAIN. 5.2 Optimizations Figure 4(c) shows the comparison of running time of the skewed matrix-matrix multiplication and the matrix-vector multiplication algorithms. We used 100 machines for YahooWeb data. For matrix-matrix multiplications, the best method is our proposed CBMM which is 76× faster than repeated naive matrix-vector multiplications (IMV). The slowest MM algorithm didn’t even finish, and failed due to heavy amounts of data. For matrix-vector multiplications, our proposed CBMV is faster than the naive method (IMV) by 48×.
6 Related Works The related works form two groups: eigensolvers and M AP R EDUCE/H ADOOP. Large-scale Eigensolvers: There are many parallel eigensolvers for large matrices: the work by Zhao et al. [21], HPEC [7], PLANO [20], PREPACK [15], SCALABLE [4], PLAYBACK [3] are several examples. All of them are based on MPI with message passing, which has difficulty in dealing with billion-scale graphs. The maximum order of matrices analyzed with these tools is less than 1 million [20] [16], which is far from web-scale data. Very recently(March 2010), the Mahout project [2] provides SVD on
Spectral Analysis for Billion-Scale Graphs
23
Fig. 4. (a) Running time vs. number of edges in 1 iteration of HE IGEN with 50 machines. Notice the near-linear running time proportional to the edges size. (b) Running time vs. number of machines in 1 iteration of HE IGEN . The running time decreases as number of machines increase. (c) Comparison of running time between different skewed matrix-matrix and matrix-vector multiplications. For matrix-matrix multiplication, our proposed CBMM outperforms naive methods by at least 76×. The slowest matrix-matrix multiplication algorithm (MM) even didn’t finish and the job failed due to excessive data. For matrix-vector multiplication, our proposed CBMV is faster than the naive method by 57×.
top of H ADOOP . Due to insufficient documentation, we were not able to find the input format and run a head-to-head comparison. But, reading the source code, we discovered that Mahout suffers from two major issues: (a) it assumes that the vector (b, with n=O(billion) entries) fits in the memory of a single machine, and (b) it implements the full re-orthogonalization which is inefficient. MapReduce and Hadoop: M AP R EDUCE is a parallel programming framework for processing web-scale data. M AP R EDUCE has two major advantages: (a) it handles data distribution, replication, and load balancing automatically, and furthermore (b) it uses familiar concepts from functional programming. The programmer needs to provide only the map and the reduce functions. The general framework is as follows [11]: The map stage processes input and outputs (key, value) pairs. The shuffling stage sorts the map output and distributes them to reducers. Finally, the reduce stage processes the values with the same key and outputs the final result. H ADOOP [1] is the open source implementation of M AP R EDUCE. It also provides a distributed file system (HDFS) and data processing tools such as PIG [13] and Hive . Due to its extreme scalability and ease of use, H ADOOP is widely used for large scale data mining [9,8] .
7 Conclusion In this paper we discovered patterns in real-world, billion-scale graphs. This was possible by using HE IGEN, our proposed eigensolver for the spectral analysis of very largescale graphs. The main contributions are the following: – Effectiveness: We analyze spectral properties of real world graphs, including Twitter and one of the largest public Web graphs. We report patterns that can be used for anomaly detection and find tightly-knit communities. – Careful Design: We carefully design HE IGEN to selectively parallelize operations based on how they are most effectively performed. – Scalability: We implement and evaluate a billion-scale eigensolver. Experimentation shows that HE IGEN is accurate and scales linearly with the number of edges.
24
U Kang, B. Meeder, and C. Faloutsos
Future research directions include extending the analysis and the algorithms for multidimensional matrices, or tensors [10].
Acknowledgements This material is based upon work supported by the National Science Foundation under Grants No. IIS-0705359, IIS0808661, IIS-0910453, and CCF-1019104, by the Defense Threat Reduction Agency under contract No. HDTRA1-10-1-0120, and by the Army Research Laboratory under Cooperative Agreement Number W911NF-09-2-0053. This work is also partially supported by an IBM Faculty Award, and the Gordon and Betty Moore Foundation, in the eScience project. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government or other funding parties. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on. Brendan Meeder is also supported by a NSF Graduate Research Fellowship and funding from the Fine Foundation, Sloan Foundation, and Microsoft.
References [1] Hadoop information, http://hadoop.apache.org/ [2] Mahout information, http://lucene.apache.org/mahout/ [3] Alpatov, P., Baker, G., Edward, C., Gunnels, J., Morrow, G., Overfelt, J., van de Gejin, R., Wu, Y.-J.: Plapack: Parallel linear algebra package - design overview. In: SC 1997 (1997) [4] Blackford, L., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I.: Scalapack users’s guide. In: SIAM (1997) [5] Lanczos, C.: An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. J. Res. Nat. Bur. Stand. (1950) [6] Demmel, J.W.: Applied numerical linear algebra. SIAM, Philadelphia (1997) [7] Guarracino, M.R., Perla, F., Zanetti, P.: A parallel block lanczos algorithm and its implementation for the evaluation of some eigenvalues of large sparse symmetric matrices on multicomputers. Int. J. Appl. Math. Comput. Sci. (2006) [8] Kang, U, Chau, D.H., Faloutsos, C.: Mining large graphs: Algorithms, inference, and discoveries. In: IEEE International Conference on Data Engineering (2011) [9] Kang, U, Tsourakakis, C., Faloutsos, C.: Pegasus: A peta-scale graph mining system - implementation and observations. In: ICDM (2009) [10] Kolda, T.G., Sun, J.: Scalable tensor decompsitions for multi-aspect data mining. In: ICDM (2008) [11] L¨ammel, R.: Google’s mapreduce programming model – revisited. Science of Computer Programming 70, 1–30 (2008) [12] Leskovec, J., Chakrabarti, D., Kleinberg, J.M., Faloutsos, C.: Realistic, mathematically tractable graph generation and evolution, using kronecker multiplication. In: Jorge, A.M., Torgo, L., Brazdil, P.B., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS (LNAI), vol. 3721, pp. 133–145. Springer, Heidelberg (2005) [13] Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: SIGMOD 2008 (2008)
Spectral Analysis for Billion-Scale Graphs
25
[14] Prakash, B.A., Sridharan, A., Seshadri, M., Machiraju, S., Faloutsos, C.: EigenSpokes: Surprising patterns and scalable community chipping in large graphs. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD 2010. LNCS, vol. 6119, pp. 435–448. Springer, Heidelberg (2010) [15] Lampe, J., Lehoucq, R.B., Sorensen, D.C., Yang, C.: Arpack user’s guide: Solution of largescale eigenvalue problems with implicitly restarted arnoldi methods. SIAM, Philadelphia (1998) [16] Song, Y., Chen, W., Bai, H., Lin, C., Chang, E.: Parallel spectral clustering. In: ECML (2008) [17] Trefethen, L.N., Bau III., D.: Numerical linear algebra. SIAM, Philadelphia (1997) [18] Tsourakakis, C.: Fast counting of triangles in large real networks without counting: Algorithms and laws. In: ICDM (2008) [19] Tsourakakis, C.E., Kang, U, Miller, G.L., Faloutsos, C.: Doulion: Counting triangles in massive graphs with a coin. In: KDD (2009) [20] Wu, K., Simon, H.: A parallel lanczos method for symmetric generalized eigenvalue problems. Computing and Visualization in Science (1999) [21] Zhao, Y., Chi, X., Cheng, Q.: An implementation of parallel eigenvalue computation using dual-level hybrid parallelism. LNCS (2007)
LGM: Mining Frequent Subgraphs from Linear Graphs Yasuo Tabei1 , Daisuke Okanohara2, Shuichi Hirose3 , and Koji Tsuda1,3 1
ERATO Minato Project, Japan Science and Technology Agency, Sapporo, Japan 2 Preferred Infrastructure, Inc, Tokyo, Japan 3 Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan
[email protected],
[email protected],
[email protected],
[email protected]
Abstract. A linear graph is a graph whose vertices are totally ordered. Biological and linguistic sequences with interactions among symbols are naturally represented as linear graphs. Examples include protein contact maps, RNA secondary structures and predicate-argument structures. Our algorithm, linear graph miner (LGM), leverages the vertex order for efficient enumeration of frequent subgraphs. Based on the reverse search principle, the pattern space is systematically traversed without expensive duplication checking. Disconnected subgraph patterns are particularly important in linear graphs due to their sequential nature. Unlike conventional graph mining algorithms detecting connected patterns only, LGM can detect disconnected patterns as well. The utility and efficiency of LGM are demonstrated in experiments on protein contact maps.
1
Introduction
Frequent subgraph mining is an active research area with successful applications in, e.g., chemoinformatics [15], software science [4], and computer vision [13]. The task is to enumerate the complete set of frequently appearing subgraphs in a graph database. Early algorithms include AGM [8], FSG [9] and gSpan [19]. Since then, researchers paid considerable efforts to improve the efficiency, for example, by mining closed patterns only [20], or by early pruning that sacrifices the completeness (e.g., leap search [18]). However, graph mining algorithms are still too slow for large graph databases (see e.g.,[17]). The scalability of graph mining algorithms is much worse than those for more restricted classes such as trees [1] and sequences [14]. It is due to the fact that, for trees and sequences, it is possible to design a pattern extension rule that does not create duplicate patterns (e.g., rightmost extension) [1]. For general graphs, there are multiple ways to generate the same subgraph pattern, and it is necessary to detect duplicate patterns and prune the search tree whenever duplication is detected. In gSpan [19], a graph pattern is represented as a DFS code, and the duplication check is implemented via minimality checking of the code. It is a very clever J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 26–37, 2011. c Springer-Verlag Berlin Heidelberg 2011
LGM: Mining Frequent Subgraphs from Linear Graphs
c
b a 1 A
27
a 2 B
3 A
4 B
5 C
6 A
Fig. 1. An example of linear graph
mechanism, because one does not need to track back the patterns generated so far. Nevertheless, the complexity of duplication checking is exponential to the pattern size [19]. It harms efficiency substantially, especially when mining large patterns. A linear graph is a graph whose vertices are totally ordered [3,5] (Figure 1). For example, protein contact maps, RNA secondary structures, alternative splicing patterns in molecular biology and predicate-argument structures [11] in natural languages can be represented as linear graphs. Amino acid residues of a protein have natural ordering from N- to C-terminus, and English words in a sentence are ordered as well. Davydov and Batzoglou [3] addressed the problem of aligning several linear graphs for RNA sequences, assessed the computational complexity, and proposed an approximate algorithm. Fertin et al. assessed the complexity of finding a maximum common pattern in a set of linear graphs [5]. In this paper, we develop a novel algorithm, linear graph miner (LGM), for enumerating frequently appearing subgraphs in a large number of linear graphs. The advantage of employing linear graphs is that we can derive a pattern extension rule that does not cause duplication, which makes LGM much more efficient than conventional graph mining algorithms. We design the extension rule based on the reverse search principle [2]. Perhaps confusingly, ’reverse search’ does not refer to a particular search method, but a guideline for designing enumeration algorithms. A pattern extension rule specifies how to generate children from a parent in the search space. In reverse search, one specifies a rule that generates a parent uniquely from a child (i.e., reduction map). The pattern extension rule is obtained by ’reversing’ the reduction map: When generating children from a parent, all possible candidates are prepared and those mapping back to the parent by the reduction map are selected. An advantage of reverse search is that, given a reduction map, the completeness of the resulting pattern extension rule can easily be proved [2]. In data mining, LCM, one of the fastest closed itemset miner, was designed using reverse search [16]. It is applied in the design of a dense module enumeration algorithm [6] and a geometric graph mining algorithm recently [12]. In computational geometry and related fields, there are many successful applications1 . LGM’s reduction map is very simple: remove the largest edge in terms of edge ordering. Fortunately, it is not necessary to take the “candidate preparation and selection” approach in LGM. We can directly reverse the reduction map to an explicit extension rule here. 1
See a list of applications at http://cgm.cs.mcgill.ca/~ avis/doc/rs/applications/index.html
28
Y. Tabei et al.
Linear graphs can be perceived as the fusion of graphs and sequences. Sequence mining algorithms such as Prefixspan [14] can usually detect gaped sequence patterns. In applications like motif discovery in protein contact maps [7], it is essential to allow “gaps” in linear graph patterns. More precisely, disconnected graph patterns should be allowed for such applications. Since conventional graph mining algorithms can detect only connected graph patterns, their application to contact maps is difficult. In this paper, we aim to detect connected and disconnected patterns with a unified framework. In experiments, we used a protein 3D-structure dataset from molecular biology. We compared LGM with gSpan in efficiency, and found that LGM is more efficient than gSpan. It is surprising to us, because LGM detects a much larger number of patterns including disconnected ones. To compare the two methods on the same basis, we added supplementary edges to help gSpan to detect a part of disconnected patterns. Then, the efficiency difference became even more significant.
2
Preliminaries
Let us first define linear graphs and associated concepts. Definition 1 (Linear graph). Denote by Σ V and Σ E the set of vertex and edge labels, respectively. A labeled and undirected linear graph g = (V, E, LV , LE ) consists of an ordered vertex set V ⊂ N, an edge set E ⊆ V ×V , a vertex labeling LV : V → Σ V and an edge labeling LE : E → Σ E . Let the size of the linear graph |g| be the number of its edges. Let G denote the set of all possible linear graphs and let θ ∈ G denote the empty graph. The difference from ordinary graphs is that the vertices are defined as a subset of natural numbers, introducing the total order. Notice that we do not impose connectedness here. The order of edges is defined as follows: Definition 2 (Total order among edges). ∀e1 = (i, j), e2 = (k, l) ∈ Eg , e1 <e e2 if and only if i) i < k or ii) i = k, j < l. Namely, one first compares the indices of the left nodes. If they are identical, the right nodes are compared. The subgraph relationship between two linear graphs is defined as follows. Definition 3 (Subgraph). Given two linear graphs g1 = (V1 , E1 , LV1 , LE1 ), g2 = (V2 , E2 , LV2 , LE2 ), g1 is a subgraph of g2 , g1 ⊆ g2 , if and only if there exists an injective mapping m : V1 → V2 such that 1. ∀i ∈ V1 : LV1 (i) = LV2 (m(i)), vertex labels are identical, 2. ∀(i, j) ∈ E1 : (m(i), m(j)) ∈ E2 , LE1 (i, j) = LE2 (m(i), m(j)), all edges of g1 exist in g2 , and 3. ∀(i, j) ∈ E1 : i < j → m(i) < m(j), the order of vertices is conserved. The difference from the ordinary subgraph relation is that the vertex order is conserved. Finally, frequent subgraph mining is defined as follows.
LGM: Mining Frequent Subgraphs from Linear Graphs
29
Definition 4 (Frequent linear subgraph mining). For a set of linear graphs G = {g1 , · · · , g|G|}, gi ∈ G, a minimum support threshold σ > 0 and a maximum pattern size s > 0, find all g ∈ G such that g is frequent enough in G, i.e., |{i = 1, ..., |G| : g ⊆ gi }| ≥ σ, |g| ≤ s
3
Enumeration of Linear Subgraphs
Before addressing the frequent pattern mining problem, let us design an algorithm for enumerating all subgraphs of a linear graph. For simplicity, we do not consider vertex and edge labels in this section, but inclusion of the labels is straightforward. 3.1
Reduction Map
Suppose we would like to enumerate all subgraphs in a linear graph shown in the bottom of Figure 2, left. All linear subgraphs form a natural graph-shaped search space, where one can traverse upwards or downwards by deleting or adding an edge (Figure 2, left). For enumeration, however, one has to choose edges in the search graph to form a search tree (Figure 2, right). Once a search tree is defined, the enumeration can be done either by depth-first or breadth-first traversal. To this aim, we specify a reduction map f : G → G which transforms a child to its parent uniquely. The mapping is chosen such that when it is applied repeatedly, we eventually reduce it to an element of the solution set S ⊂ G. Formally, we write ∀x ∈ G : ∃k ≤ 0 : f k (x) ∈ S. In our case, the reduction map is defined as removing the “largest” edge from the child graph. The largest edge is defined via the total order introduced in Definition 2. By evaluating the mapping repeatedly the graph is shrunk to the empty graph. Thus, here we have S = {θ}. By applying f (g) for all possible g ∈ G, we can induce a search tree with θ ∈ G being the root node, shown in Figure 2, right. A question is if we can always define a unique search tree for any linear graph. The reverse search theorem [2] says that the proposition is true iff any node in the graph-shaped search space converges to the root node (i.e., empty graph) by applying the map a finite number of times. For our reduction map, it is true, because each possible linear graph g ∈ G is reduced to the empty graph by successively applying f to g. A characteristic point of reverse search is that the search tree is implicitly defined by the reduction map. In actual traversal, the search tree is created on demand: when a traverser is at a node with graph g and would like to move down, a set of children nodes are generated by extending g. More precisely, one enumerate all linear graphs by inverting the reduction mapping such that the tree is explored from the root node towards the leaves. The inverse mapping f −1 : G → G ∗ generates for a given linear graph g ∈ G a set of extended graphs X = {g | f (g ) = g}. There are three types of extension patterns according to the number of added nodes in the reduction mapping: (A) no-node-addition, (B) one-node-addition,
30
Y. Tabei et al. empty
empty
Fig. 2. (Left) Graph-shaped search space. (Right) Search tree induced by the reduction map. Linear Graph
i
(A-1) i
j
(B-1)
j
(B-2)
i
(B-3)
i
j
(C-1)
(B-4)
i
j
(C-2)
i
j
(C-4)
j
j
(C-3)
i
j
(C-5)
i
i
j
j
i (C-6)
i
j
i
j
Fig. 3. Example of children patterns. There are three types of extension with respect to the number of nodes: (A) no-node-addition, (B) one-node-addition, (C) two-nodesaddition.
(C) two-nodes-addition. Let us define the largest edge of g as (i, j), i < j. Then, the enumeration of case A is done by adding an edge which is larger than (i, j). For case B, a node is inserted to the position after i, and this node is connected to every other node. If the new edge is smaller than (i, j), this extension is canceled. For case C, two nodes are inserted to the position after i. In that case, the added two nodes must be connected by a new edge. All patterns of valid extensions are shown in Figure 3. This example does not include node labels, but for actual applications, node labels need to be enumerated as well.
4
Frequent Pattern Mining
In frequent pattern mining, we employ the same search tree described above, but the occurrence of a pattern in all linear graphs are tracked in an occurrence list LG (g) [19], defined as follows: LG (g) = {(i, m) : gi ∈ G, g ⊆ gi with node correspondence m}.
LGM: Mining Frequent Subgraphs from Linear Graphs
31
Algorithm 1. Linear Graph Miner (LGM) Input: A set of linear graphs: G = {g1 , ..., g|G| } Minimum support: σ ≥ 0 Maximum pattern size: s ≥ 0 1: function LGM(G, σ, s) the main function 2: Mine(G, φ, σ, s) 3: end function 4: function Mine(G, g, σ, s) 5: sup ← support(LG (g)) 6: if sup < σ then check support condition 7: return 8: end if 9: Report occurrence of subgraph g 10: if |g| = s then check pattern size 11: return 12: end if 13: scan G once by using LG (g), find all extensions f −1 (g) 14: for g ∈ f −1 (g) 15: Mine(G, g , σ, s) call Mine for every extended pattern g 16: end for 17: end function
When a pattern g is extended, its occurrence list LG (g) is updated as well. Based on the occurrence list, the support of each pattern g, i.e., the number of linear graphs which contains the pattern, is calculated. Whenever the support is smaller than the threshold s, the search tree is pruned at this node. This pruning is possible, because of the anti-monotonicity of the support, namely the support of a graph is never larger than that of its subgraph. Algorithm 1 describes the recursive algorithm for frequent mining. In line 13, each pattern g is extended to larger graphs g ∈ f −1 (g) by inverse reduction mapping f −1 . The possible extensions f −1 (g) for each pattern g are found using the location list LG (g). The function Mine is recursively called for each extended pattern g ∈ f −1 (g) in line 15. The graph pruning happens in lines 7, if the support for the pattern g is smaller than the minimum support threshold σ or in line 11 if the pattern size |g| is equal to the maximum pattern size s.
5
Complexity Analysis
The computational time of frequent pattern mining depends on the minimum support and maximum pattern size thresholds [19]. Also, it depends on the “density” of the database: If all graphs are almost identical (i.e., a dense database), the mining would take a prohibitive amount of time. So, conventional worst case analysis is not amenable to mining algorithms. Instead, the delay, interval time
32
Y. Tabei et al.
between two consecutive solutions, is often used to describe the complexity. General graph mining algorithms including gSpan are exponential delay algorithms, i.e., the delay is exponential to the size of patterns [19]. The delay of our algorithm is only polynomial, because no duplication checks are necessary thanks to the vertex order. Theorem 1 (Polynomial delay). For N linear graphs G, a minimum support σ > 0, and a maximum pattern size s > 0, the time between two successive calls to Report in line 9 is bounded by a polynomial of the size of input data. Proof. Let M := maxi |Vgi |, F := maxi |Egi |. The number of matching locations in the linear graphs G can decrease in case g is enlarged, because the only largest edge is added. Considering the number of variations, it is easy to see that the location list always satisfies |LG (g)| ≤ M 2 N . Therefore, the mapping f −1 (g) can be produced in O(M 2 N ) time, because the procedure searches for the location list in line 13. The time complexity between two successive calls to Report can now be bounded by considering two cases after Report has been called once. – Case 1. There is an extension g fulfilling the minimum support condition,or the size of g is s. Then Report is called within O(M 2 N ) time. – Case 2. There is no extension g fulfilling the minimum support condition.Then, no recursion happens and M ine returns in O(M 2 N ) time to its parent node in the search tree. The maximum number of times this can happen successively is bounded by the depth of the reverse search tree, which is bounded by O(F ), because each level in the search tree adds one edge. Therefore, in O(M 2 N F ) time the algorithm either calls Report again or finishes. Thus, the total time between two successive calls to Report is bounded by O(M 2 N F ).
6
Experiments
We performed a motif extraction experiment from protein 3D structures. Frequent and characteristic patterns are often called “motifs” in molecular biology, and we adopt that terminology here. All experiments were performed on a Linux machine with an AMD Opteron processor (2 GHz and 4GB RAM). 1-gap linear graph
1
2
3
2-gap linear graph
4
5
1
2
3
4
5
Fig. 4. Example of gap linear graph. 1-gap linear graph (left) and 2-gap linear graph (right) are represented, respectively. Edges corresponding to gaps are represented in bold line.
LGM: Mining Frequent Subgraphs from Linear Graphs
33
LGM gSpan+g1
40000
time (sec)
30000 20000 10000
50
40
30
20
10
0
minimum support
Fig. 5. Execution time for the protein data. The line labeled by gSpan+g1 is execution time for gSpan on the 1-gap linear graph dataset. gSpan does not work on the 2-gap linear graph dataset even if the minimum support threshold is 50.
6.1
Motif Extraction from Protein 3D Structures
We adopted the Glyankina et al’s dataset [7] which consists of pairs of homologous proteins: one is derived from a thermophilic organism and the other is from a mesophilic organism. This dataset was made for understanding structural properties of proteins which are responsible for the higher thermostability of proteins from thermophilic organisms compared to those from mesophilic organisms. In constructing a linear graph from a 3D structure, each amino acid is represented as a vertex. Vertex labels are chosen from {1, . . . , 6}, which represents the following six classes: aliphatic {AVLIMC}, aromatic {FWYH}, polar {STNQ}, positive {KR}, negative {DE}, special (reflecting their special conformation properties) {GP} [10]. An edge is drawn between the pair of amino acid residues whose distance is within 5 angstrom. No edge labels are assigned. In total, 754 graphs were made. Average number of vertices and edges are 371 and 498, respectively, and the number of labels is 6. To detect the motifs characterizing the difference between two organisms, we take the following two-step approach. First, we employ LGM to find frequent patterns from all proteins of both organisms. In this setting, we did not use (c-6) patterns in Figure 3. Finally, the patterns significantly associated with organism difference are selected via statistical tests. We assess the execution time of our algorithm in comparison with gSpan. The linear graphs from 3D-structure proteins are not always connected graphs and the gSpan can not be applied to such disconnected graphs. Hence, we made two kinds of gaped linear graph: 1-gap linear graph and 2-gap linear graph. 1-gap linear graph is a linear graph whose contiguous vertices in a protein sequence are connected by an edge; 2-gap linear graph is a 1-gap linear graph whose two vertices skipping one in a protein sequence are connected by an edge (Figure 4). We run gSpan on two datasets: one consists of 1-gap linear graphs and the other consists of 2-gap linear graphs. We run LGM on the original linear graphs. We set the maximum execution time to 12 hours for both programs. Figure 5 shows the execution time by changing minimum support thresholds. gSpan does not
34
Y. Tabei et al.
Significant subgraphs in human polII promotor pvalue:9×10-5
Significant subgraphs in TATA-binding protein pvalue:4×10-4 129
130
132
1 1 1 pvalue:3×10-2
133
134
135
136
1
5
5
1
147
148
151
152
154
155
1
1
1
3
1
3
149
150
151
153
1
1
3
pvalue:9×10-5
129
130
133
134
135
147
1
1
1
5
5
1
pvalue:3×10-2
148
1 2 pvalue:2×10-4
157
158
159
160
161
143
144
146
147
151
152
155
1
1
3
1
1
1
1
1
1
1
3
3
Fig. 6. Significant subgraphs detected by LGM. The p-value calculated by fisher exact test is attached to each linear graph. The node labels 1, 2, 3, 4 and 5, represent aliphatic, aromatic, polar, positive and negative proteins, respectively.
Fig. 7. 3D-structures of TATA-binding protein(left) and human pol II promotor protein (right). The spheres represent the amino acid residues corresponding to vertices forming subgraphs in figure 6.
work on the 2-gap linear graph dataset even if the minimum support threshold is 50. Our algorithm is faster than gSpan on the 1-gap linear graph dataset, and its execution time is reasonable. Then, we assess a motif extraction ability of our algorithm. To choose significant subgraphs from the enumerated subgraphs, we use Fisher’s exact test. In this case, a significant subgraph should distinguish thermophilic proteins from mesophilic proteins. Thus, for each frequent subgraph, we count the number of proteins containing this subgraph in the thermophilic and mesophilic proteins; and generate a 2×2 contingency table, which includes the number of thermophilic organisms that contain subgraph g nT P , the number of thermophilic organisms that does not contain a subgraph g nF P , the number of mesophilic organisms
LGM: Mining Frequent Subgraphs from Linear Graphs
35
that does not contain a subraph g nF N and the number of mesophilic organisms that contain a subgraph g nT N . The probability representing the independence in the contingency table is calculated as follows: ng ng nT P nF N ng !ng !nP !nN ! Pr = = , n!nT P !nF P !nF N !nT N ! n np where nP is the number of thermophilic proteins; nN the number of mesophilic proteins; ng the number of proteins with a subgraph g; ng the number of proteins without a subgraph g . The p-value of the two-sided Fisher’s exact test on a table can be computed by the sum of all probabilities of tables that are more extreme than this table. We ranked the frequent subgraphs according to the p-values, and obtained 103 subgraphs whose p-values are no more than 0.001. Here, we focused on a pair of proteins, TATA-binding protein and human polII promotor protein, where TATA-binding protein is derived from a thermophilic organism and human polII promotor is from a mesophilic organism. The reason we chose these two proteins is that they include a large number of statistically significant motifs which are mutually exclusive between two organisms. These two proteins share the same function as DNA-binding protein, but their thermostabilities are different. Figure 6 shows the top-3 subgraphs in significance. Figure 7 shows 3D-structure proteins, TATA-binding protein (left) and human polII promotor protein(right), and the amino acid residues forming top3-subgraphs are represented by spheres.
7
Conclusion
We proposed an efficient frequent subgraph mining algorithm from linear graphs. A key point is that vertices in a linear graph are totally ordered. We designed a fast enumeration algorithm from linear graphs based on this property. For an efficient enumeration without duplication, we define a search tree based on reverse search techniques. Different from gSpan, our algorithm enumerates frequent subgraphs including disconnected ones by traversing this search tree. Many kinds of data, such as protein 3D-structures and alternative splicing forms, which can be represented as linear graphs, include disconnected subgraphs as important patterns. The computational time of our algorithm is polynomial-delay. We performed a motif extraction experiment of a protein 3D-structure dataset in molecular biology. In the experiment, our algorithm could extract important subgraphs as frequent patterns. By comparing our algorithm to gSpan with respect to execution time, we have shown our algorithm is fast enough for the real world datasets. Data which can be represented as linear graphs occur in many fields, for instance bioinformatics and natural language processing. Our mining algorithm from linear graphs provide a new way to analyze such data.
36
Y. Tabei et al.
Acknowledgements This work is partly supported by research fellowship from JSPS for young scientists, MEXT Kakenhi 21680025 and the FIRST program. We would like to thank M. Gromiha for providing the protein 3D-structure dataset, T. Uno and H. Kashima for fruitful discussions.
References 1. Abe, K., Kawasoe, S., Asai, T., Arimura, H., Arikawa, S.: Optimized substructure discovery for semi-structured data. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) PKDD 2002. LNCS (LNAI), vol. 2431, pp. 1–14. Springer, Heidelberg (2002) 2. Avis, D., Fukuda, K.: Reverse search for enumeration. Discrete Appl. Math. 65, 21–46 (1996) 3. Davydov, E., Batzoglou, S.: A computational model for RNA multiple sequence alignment. Theoretical Computer Science 368, 205–216 (2006) 4. Eichinger, F., B¨ ohm, K., Huber, M.: Mining edge-weighted call graphs to localise software bugs. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part I. LNCS (LNAI), vol. 5211, pp. 333–348. Springer, Heidelberg (2008) 5. Fertin, G., Hermelin, D., Rizzi, R., Vialette, S.: Common structured patterns in linear graphs: Approximation and combinatorics. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 241–252. Springer, Heidelberg (2007) 6. Georgii, E., Dietmann, S., Uno, T., Pagel, P., Tsuda, K.: Enumeration of conditiondependent dense modules in protein interaction networks. Bioinformatics 25(7), 933–940 (2009) 7. Glyakina, A.V., Garbuzynskiy, S.O., Lobanov, M.Y., Galzitskaya, O.V.: Different packing of external residues can explain differences in the thermostability of proteins from thermophilic and mosophilic organisms. Bioinformatics 23, 2231–2238 (2007) 8. Inokuchi, A., Washio, T., Motoda, H.: An apriori-based algorithm for mining fre˙ quent substructures from graph data. In: Zighed, D.A., Komorowski, J., Zytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 13–23. Springer, Heidelberg (2000) 9. Kuramochi, M., Karypis, G.: Frequent subgraph discovery. In: Proceedings of the 2001 IEEE International Conference on Data Mining (ICDM 2001), pp. 313–320 (2001) 10. Mirny, L.A., Shakhnovich, E.I.: Universally Conserved Positions in Protein Folds: Reading Evolutionary Signals about Stability, Folding Kinetics and Function. Journal of Molecular Biology 291, 177–196 (1999) 11. Miyao, Y., Sætre, R., Sagae, K., Matsuzaki, T., Tsujii, J.: Task-oriented evaluation of syntactic parsers and their representations. In: 46th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 46–54 (2008) 12. Nowozin, S., Tsuda, K.: Frequent subgraph retrieval in geometric graph databases. In: Perner, P. (ed.) ICDM 2008. LNCS (LNAI), vol. 5077, pp. 953–958. Springer, Heidelberg (2008) 13. Nowozin, S., Tsuda, K., Uno, T., Kudo, T., Bakir, G.: Weighted substructure mining for image analysis. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los Alamitos (2007)
LGM: Mining Frequent Subgraphs from Linear Graphs
37
14. Pei, J., Han, J., Mortazavi-asl, B., Wang, J., Pinto, H., Chen, Q., Dayal, U., Hsu, M.: Mining sequential patterns by pattern-growth: The prefixspan approach. IEEE Transactions on Knowledge and Data Engineering 16(11), 1424–1440 (2004) 15. Saigo, H., Nowozin, S., Kadowaki, T., Taku, K., Tsuda, K.: gBoost: a mathematical programming approach to graph classification and regression. Machine Learning 75, 69–89 (2008) 16. Uno, T., Kiyomi, M., Arimura, H.: LCM ver.3: collaboration of array, bitmap and prefix tree for frequent itemset mining. In: Proceedings of the 1st International Workshop on Open Source Data Mining: Frequent Pattern Mining Implementations, pp. 77–86 (2005) 17. Wale, N., Karypis, G.: Comparison of descriptor spaces for chemical compound retrieval and classification. In: Proceedings of the 2006 IEEE International Conference on Data Mining, pp. 678–689 (2006) 18. Yan, X., Cheng, H., Han, J., Yu, P.S.: Mining significant graph patterns by leap search. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 433–444 (2008) 19. Yan, X., Han, J.: gSpan: Graph-based substructure pattern mining. In: Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM 2002), pp. 721– 724 (2002) 20. Yan, X., Han, J.: CloseGraph: mining closed frequent graph patterns. In: Proceedings of 2003 International Conference on Knowledge Discovery and Data Mining (SIGKDD 2003), pp. 286–295 (2003)
Efficient Centrality Monitoring for Time-Evolving Graphs Yasuhiro Fujiwara1 , Makoto Onizuka1 , and Masaru Kitsuregawa2 1
NTT Cyber Space Laboratories, Japan {fujiwara.yasuhiro,onizuka.makoto}@lab.ntt.co.jp 2 The University of Tokyo, Japan
[email protected]
Abstract. The goal of this work is to identify the nodes that have the smallest sum of distances to other nodes (the lowest closeness centrality nodes) in graphs that evolve over time. Previous approaches to this problem find the lowest centrality nodes efficiently at the expense of exactness. The main motivation of this paper is to answer, in the affirmative, the question, ‘Is it possible to improve the search time without sacrificing the exactness?’. Our solution is Sniper, a fast search method for time-evolving graphs. Sniper is based on two ideas: (1) It computes approximate centrality by reducing the original graph size while guaranteeing the answer exactness, and (2) It terminates unnecessary distance computations early when pruning unlikely nodes. The experimental results show that Sniper can find the lowest centrality nodes significantly faster than the previous approaches while it guarantees answer exactness. Keywords: Centrality, Graph mining, Time-evolving.
1
Introduction
In graph theory, the facility location problem is quite important since it involves finding good locations for one or more facilities in a given environment. Solving this problem starts by finding the nodes whose distances to other nodes is the shortest in the graph, since the cost it takes to reach all other nodes from these nodes is expected to be low. In graph analysis, the centralities based on this concept are closeness. In this paper, the closeness centrality of node u, Cu , is defined as the sum of distances from the node to other nodes. The naive approach, the exact computation of centrality, is impractical; it needs distances of all node pairs. This led to the introduction of approximate approaches, such as the annotation approach [13] and the embedding approach [12,11], to estimate centralities. These approaches have the advantage of speed at the expense of exactness. However, approximate algorithms are not adopted by many practitioners. This is because the optimality of the solution is not guaranteed; it is hard for approximate algorithms to identify the lowest centrality node exactly. Furthermore, the focus of traditional graph theory has been limited to just ‘static’ graphs; the implicit assumption is that nodes and edges never J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 38–50, 2011. c Springer-Verlag Berlin Heidelberg 2011
Efficient Centrality Monitoring for Time-Evolving Graphs
39
change. Recent years have witnessed a dramatic increase in the availability of graph datasets that comprise many thousands and sometimes even millions of time-evolving nodes; a consequence of the widespread availability of electronic databases and the Internet. Recent studies on large-scale graphs are discovering several important principles of time-evolving graphs [10,8]. Thus demands for the efficient analysis of time-evolving graphs are increasing. We address the following problem in this paper: Given: graph G[t] = (V [t], E[t]) at time t where V [t] is a set of nodes and E[t] is a set of edges at time t. Find: the nodes that have the lowest closeness centrality in graph G[t]. We propose a novel method called Sniper that can efficiently identify the lowest centrality nodes in time-evolving graphs. To the best of our knowledge, our approach is the first solution to achieve both exactness and efficiency at the same time in identifying the lowest centrality nodes from time-evolving graphs. 1.1
Problem Motivation
The problem tackled in this paper must be overcome to develop the following important applications. Networks of interaction have been studied for a long time by social science researchers, where nodes correspond to people or organizations, and edges represent some type of social interaction. The question of ‘which is the most important node in a network?’ is being avidly pursued by scientific researchers. An important example is the network obtained by considering scientific publications. Nodes in this case are researchers, papers, books, or entire journals, and edges correspond to co-authorship or citations. This kind of network generally grows very rapidly over time. For example, the collaboration network of scientists in the database area contains several tens of thousands of authors and its rate of growth is increasing year by year; there are several thousand new authors each year [5]. The systematic construction of such networks was introduced by Garfield, who later proposed a measure of standing for journals that is still in use. This measure, called impact factor, is defined as the number of citations per published item [6]. Basically, the impact factor is a very simple measure, since it corresponds to the degree of the citation network. However the degree is a local measure, because the value is only determined by the number of adjacent nodes. That is, if a high-degree node lies in an isolated community of the network, the influence of the node is very limited. Closeness centrality is a global centrality measure since it is computed by summing the distances to all other nodes in a graph. Therefore, it is an effective measure of influence on other nodes. The most influential node can be effectively detected as the lowest closeness centrality node by monitoring time-evolving graphs. Nascimento et al. analyzed SIGMOD’s co-authorship graph [9] They successfully discovered that L. A. Rowe, M. Stonebraker, and M. J. Carey were
40
Y. Fujiwara, M. Onizuka, and M. Kitsuregawa
the most influential researchers from 1986 to 1988, 1989 to 1992, and 1993 to 2002, respectively. All these three are very famous and important researchers in the database community. The remainder of this paper is organized as follows. Section 2 describes related work. Section 3 overviews some of the background of this work. Section 4 introduces the main ideas of Sniper. Section 5 discusses some of the topics related to Sniper. Section 6 gives theoretical analyses of Sniper. Section 7 reviews the results of our experiments. Section 8 provides our brief conclusion.
2
Related Work
Many papers have been published on approximations of node-to-node distances. The previous distance approximation schemes are distinguished into two types: annotation types and embedding types. Rattigna et al. studied two annotation schemes [13]. They randomly select nodes in a graph and divide the graph into regions that are connected, mutually exclusive, and collectively exhaustive. They give a set of annotations to every node from the regions. Distances are computed by the annotations. They demonstrated their method can compute node distances more accurately than the embedding approaches. However, this method can require O(n2 ) space and O(n3 ) time to estimate the lowest centrality nodes as described in their paper. The Landmark technique is an embedding approach [7,12], and estimates nodeto-node distance from selected nodes at O(n) time. The minimum distance via a landmark node is utilized as node distance in this method. Another embedding technique is Global Network Positioning, which was studied by Ng et al. [11]. Node distances are estimated from the Lp norm between node pairs. These embedding techniques require O(n2 ) space since all n nodes hold distances to O(n) selected nodes. Moreover, they require O(n3 ) time to identify the lowest centrality node. This is because they take O(n) time to estimate a node pair distance and need the distances of n2 node pairs to compute centralities of all nodes.
3
Preliminary
In this section, we introduce the background to this paper. Social networks and others can be described as graph G = (V, E), where V is the set of nodes, and E is the set of edges. We use n and m to denote the number of nodes and edges, respectively. That is n = |V | and m = |E|. A path from node u to v is the sequence of nodes linked by edges, beginning with node u and ending at node v. A path from node u to v is the shortest path if and only if the number of nodes in the path is the smallest possible among all paths from node u to v. The distance between node u and v, d(u, v), is the number of edges in the shortest path connecting them in the graph. Therefore d(u, u) = 0 for every u ∈ V , and d(u, v) = d(v, u) for u, v ∈ V . The closeness centrality of node u, C u , is the sum of the distances from the node to any other node, and computed as v∈V d(u, v).
Efficient Centrality Monitoring for Time-Evolving Graphs
4
41
Centrality Monitoring
In this section, we explain the two main ideas underlying Sniper. The main advantage of Sniper is to exactly and efficiently identify the lowest closeness centrality nodes in time-evolving graphs. While we focus on undirected and unweighted graphs in this section, our approach can be applied to weighted or directed graphs as described in Section 5.1. Moreover, we can handle range queries (find the nodes whose centralities are less than a given threshold) and K-best queries (find the K lowest centrality nodes) as described in Section 5.2. For ease of explanation, we assume that no two nodes will have exactly the same centrality value and that one node is added to a time-evolving graph at each time tick. These assumptions can be eliminated easily. And all proofs in this section are omitted due to the space limitations. 4.1
Ideas Behind Sniper
Our solution is based on the two ideas described below. Node aggregation. We introduce approximations to reduce the high cost of the existing approaches. Instead of computing the exact centrality of every node, we approximate the centrality, and use the result to efficiently prune high-centrality nodes. For a given graph with n nodes and m edges, we create an approximate graph of n nodes and m edges (n < n, m < m) by aggregating ‘similar’ nodes in the graph. For the approximate graph, O(n + m ) time is required for Sniper to compute the approximate centralities, while the existing approximate algorithm requires O(n2 ) time as described in Section 2. We exploit the Jaccard coefficient to find similar nodes, and then aggregate the original nodes to create node groups. We refer to such groupings as aggregate nodes. This new idea has the following two major advantages. First, we can find the answer node exactly; the node that has the lowest centrality is never missed by this approach. This is because our approximate graphs guarantee the lower bounding distances. This means that we can safely discard unpromising nodes at low CPU cost. The second advantage is that this idea can reduce the number of nodes that must be processed to compute centralities, as well as reducing the computation cost for each node. That is, we can identify the lowest centrality node among a large number of nodes efficiently. Tree estimation. Although our approximation technique is able to discard most of the unlikely nodes, we still rely on exact centrality computation to guarantee the correctness of the search results. Here we focus on reducing the cost of this computation. To compute the exact centrality of a node, distances to all other nodes from the node have to be computed by breadth-first search (BFS). But clearly the exhaustive exploration of nodes in a graph is not computationally feasible, especially for large graphs. Our proposal exploits the following idea: If a node cannot
42
Y. Fujiwara, M. Onizuka, and M. Kitsuregawa
be the lowest centrality node, we terminate subsequent distance computations as being unnecessary. Our search algorithm first holds a candidate node, which is expected to have low centrality. We then estimate the distances of unexplored nodes in the distance computation from a single BFS-tree to obtain the lower centrality bound. In the search process, if the lower centrality bound of a node gives a value larger than the exact centrality of the candidate node, the node cannot be the lowest centrality node in the original graph. Accordingly, unnecessary distance computations can be terminated early. 4.2
Node Aggregation
Our first idea involves aggregating nodes of the original graph, which enables us to compute the lower centrality bound and thus realize reliable node pruning. Graph Approximation. We reduce the original graph size in order to compute approximate centralities at low computation cost. To realize efficient search, given original graph G with n nodes and m edges, we compute n nodes and m edges in the approximate graph G . That is, the original graph G = (V, E) is collapsed to yield the approximate graph G = (V , E ). We first describe how to compute the edges of the approximate graph, and then show our approach of original node aggregation. For the aggregate nodes u and v , there is an edge, {u , v } ∈ E , if and only if there is at least one edge between aggregated original nodes in u and v . This definition is important in computing the lower centrality bound. Formally, we obtain the edges between aggregate node u and v as follows: Definition 1 (Node aggregation). In the approximate graph G , node u and v have an edge if and only if: (1)u = v ,
(2)∃{u, v} s.t. u ∈ u ∩ v ∈ v
(1)
where u ∈ u indicates that aggregate node u contains original node u. To reduce the approximation error, we aggregate similar nodes. As described above, the aggregate nodes share an edge if and only if there is at least one edge between the original nodes that have been aggregated. Therefore, the approximation error decreases as the number of neighbors shared by the aggregated nodes increases. For this reason, we utilize the Jaccard coefficient since it is a simple and natural measure of similarity between sets [4]. Let Nu and Nv be sets of neighbors (adjacent nodes) of nodes u and v, respectively; the Jaccard coefficient is defined as |Nu ∩ Nv |/|Nu ∪ Nv |, i.e. the size of the intersection of the sets divided by the size of their union. We aggregate node u and v if the most similar node of u is node v, this yields good approximations. Note, we do not aggregate nodes u and v if the size of their intersection is less than one half the size of their union to avoid aggregating dissimilar nodes.
Efficient Centrality Monitoring for Time-Evolving Graphs
43
If one node is added to a time-evolving graph, we compute its most similar node to update the approximate graph. The naive approach to compute the most similar node for the added node is to compute the similarities for all nodes. We, on the other hand, utilize the following lemma to efficiently update the most similar node: Lemma 1 (Update the most similar nodes). For the added node, the most similar node is at most two hops apart. By using the above lemma, we first obtain the nodes which are one and two hop away from the added node in the search process. And we compute similarity for the added node and update the most similar node. And we link the aggregate nodes with Definition 1. Even though we assume a single node is added for time-evolving graphs in each time tick, Lemma 1 can also be applied for the case of single node deletion. That is we can efficiently update the most similar node with Lemma 1 for node deletion. We iterate the above procedure for each node if several nodes are added. If one edge is added/deleted, we delete one connected node and add the node. Lower Bounding Centrality. Given an approximate graph, we compute the approximate centrality of node u as follows: Definition 2 (Approximate closeness centrality). For the approximate graph, the approximate closeness centrality of node u , Cu , is computed as follows: Cu = {d(u , v ) · |v |} (2) v ∈V
where d(u , v ) is node distance in the approximation graph (i.e. the number of hops from node u to v ) and |v | is the number of original nodes aggregated within node v . We can provide the following lemma about the centrality approximation: Lemma 2 (Approximate closeness centrality). For any node in the approximate graph, Cu ≤ Cu holds. Lemma 2 provides Sniper with the property of finding the exact answer as is described in Section 6. 4.3
Tree Estimation
We introduce an algorithm for computing original centralities efficiently. We terminate subsequent distance computations from a node if the estimate centrality of the node is larger than the exact centrality of the candidate node. In this approach, we compute lower bounding distances of unexplored nodes via BFS to estimate the lower centrality bound of a node. Estimations are obtained from a single BFS-tree.
44
Y. Fujiwara, M. Onizuka, and M. Kitsuregawa
Notation. We first give some notations for the estimation. In the search process, we construct the BFS-tree rooted at a selected node. As a result, the selected node forms layer 0. The direct neighbors of the node form layer 1. All nodes that are i hops apart from the selected node form layer i. We later describe our approach to selecting the node. Next, we check by BFS that the exact centralities of other nodes in the tree are lower than the exact centrality of the candidate node. We define the set of nodes explored by BFS as Vex , and the set of unexplored nodes as Vun (= V − Vex ). dmax (u) is the maximum distance of the explored node from node u, that is dmax (u) = max{d(u, v) : v ∈ Vex }. Moreover, we define the explored layers of the tree as Lex if and only if there exists at least one explored node in the layer. Similarly we define the unexplored layers as Lun if and only if there exists no explored node in the layer. The layer number of node u is denoted as l(u). Centrality Estimation. We define how to estimate the centrality of a node. We estimate the closeness centrality of node u via BFS as follows: Definition 3 (Estimate closeness centrality). For the original graph, we define the following centrality estimate of node u, Cˆu , to terminate distance computation in BFS: Cˆu = d(u, v) + e(u, v) (3) v∈Vex
e(u, v) =
v∈Vun
dmax (u) (v ∈ Vun ∩ Lex ) dmax (u) + min{|l(v) − l(w)|} − 1 (v ∈ Lun , w ∈ Lex )
The estimation is the same as exact centrality if all nodes are explored (i.e. Vex = V ) in Equation (3). To show the property of the centrality estimate, we introduce the following lemma: Lemma 3 (Estimate closeness centrality). For the original graph, Cˆu ≤ Cu holds in BFS. This property enables Sniper to identify the lowest centrality node exactly. The selection of the root node of the tree is important for efficient pruning. We select the lowest centrality node of the previous time tick as the root node. There are two reasons for this approach. The first is that this node and nearby nodes are expected to have the lowest centrality value, and thus are likely to be the answer node after node addition. In the case of time-evolving graphs, small numbers of nodes are continually being added to the large number of already existing nodes. Therefore, there is little difference between the graphs before and after node addition. In addition, we can more accurately estimate the centrality value of a node if the node is close to the root node; this is the second reason. This is because our estimation scheme is based on distances from the root node.
Efficient Centrality Monitoring for Time-Evolving Graphs
45
Algorithm 1. Sniper Input: G[t] = (V, E), a time-evolving graph at time t uadd , the node added at time t ulow [t − 1], the previous lowest centrality node Output: ulow [t]: the lowest centrality node. 1: //Update the approximate graph 2: update the approximate graph by the update algorithm; 3: //Search for the lowest centrality node 4: Vexact ← empty set; 5: compute the BFS-tree of node ulow [t − 1]; 6: compute θ, the exact centrality of node ulow [t − 1]; 7: for each node v ∈ V do 8: compute Cv in the approximate graph by the estimation algorithm; 9: if Cv ≤ θ then 10: for each node v ∈ v do 11: append node v → Vexact ; 12: end for 13: end if 14: end for 15: for each node v ∈ Vexact do 16: compute Cv in the original graph by the estimation algorithm; 17: if Cv < θ then 18: θ ← Cv ; 19: ulow [t] ← v; 20: end if 21: end for 22: return ulow [t];
4.4
Search Algorithm
Our main approach to finding the lowest centrality node is to prune unlikely nodes by using our approximation, and then confirm by exact centrality computations whether the viable nodes are the answer. However, an important question is which node should be selected as the candidate in time-evolving graphs. We select the previous lowest centrality node as the candidate. This node likely to have lowest centrality as described in Section 4.3. After we construct the BFStree, the exact centrality of the candidate node can be directly obtained with this approach. Algorithm 1 shows the search algorithm that targets the lowest closeness centrality node. In this algorithm, ulow [t], ulow [t − 1] and uadd indicate the lowest centrality node, the previous lowest centrality node, and the added node, respectively. Vexact represents the set of nodes for which we compute exact centralities. The algorithm can be divided into two phases: update and search. In the update phase, Sniper computes the approximate graph by the update algorithm (line 2). In the search phase, Sniper first computes the BFS-tree of the answer node of the last time tick (line 5) and θ (line 6). If the approximate centrality of a node is larger than θ, we prune the node since it cannot be the lowest centrality
46
Y. Fujiwara, M. Onizuka, and M. Kitsuregawa
node. Otherwise, Sniper appends aggregated original nodes to Vexact (lines 9-13), and then computes exact centralities to identify the lowest centrality node (lines 15-21).
5
Extension
In this section, we give a discussion of possible extensions to Sniper. 5.1
Directed or Weighted Graphs
We focus on undirected and unweighted graphs in this paper, but Sniper can also handle directed or weighted graphs effectively. As described in Section 4.2, approximate graphs have an edge if and only if there is at least one edge between aggregated nodes in an undirected and unweighted graph. However, we must modify how approximate graphs are constructed if we are to handle other kinds of graphs. For directed graphs, we apply Definition 1 for each direction to handle directed edges of approximate graphs. For weighted graphs, we choose the lowest value of the weights of the original edges as the weight of the aggregated edge to compute the lower bound of exact centralities. To estimate centrality values for weighted graphs, we can directly apply Definition 2. For weighted graphs, however, we need a little modification. We estimate distance from node u to v as dmax (u) + min{ω(v, w) : w ∈ V \ v} where ω(v, w) is the weight of edge {v, w}. 5.2
Other Types of Queries
Although the search algorithm described here identifies the node that has the lowest centrality, the proposed approach can be applied to range queries and K-best queries. Range queries find the nodes whose centralities are less than a given threshold, while K-best queries find the K lowest centrality nodes. For range queries, we utilize the given search threshold as θ, instead of the exact centrality of the previous time tick (i.e., we do not use the candidate). We compute approximate centralities of all nodes and prune unlikely nodes using the given θ; we confirm the answer nodes by calculating exact centralities. For K-best queries, we first compute the exact centralities at time t of all K answer nodes in the last time tick. Next, we select the K-th lowest exact centrality as θ. Subsequent procedures are the same as for the case of identifying the lowest centrality node.
6
Theoretical Analysis
This section provides theoretical analyses that confirm the accuracy and complexity of Sniper. Let n be the number of nodes and m the number of edges. We prove that Sniper finds the lowest centrality node accurately (without fail) as follows:
Efficient Centrality Monitoring for Time-Evolving Graphs
47
Theorem 1 (Find the lowest centrality node). Sniper guarantees the exact answer when identifying the node whose centrality is the lowest. Proof. Let ulow be the lowest centrality node in the original graph, and θlow be the exact centrality of ulow (i.e., θlow is the lowest centrality). Also let θ be the candidate centrality in the search process. In the approximate graph, since θlow ≤ θ, the approximate centrality of node ulow is never greater than θ (Lemma 2). Similarly, in the original graph, the estimate centrality of node ulow is never grater than θ (Lemma 3). The algorithm discards a node if (and only if) its approximate or estimated centrality is greater than θ. Therefore, the lowest centrality node ulow can never be pruned during the search process. We now turn to the complexity of Sniper. Note that the previous approaches need O(n2 ) space and O(n3 ) time to compute the lowest centrality node. Theorem 2 (Complexity of Sniper). Sniper requires O(n + m) space and O(n2 + nm) time to compute the lowest centrality node. Proof. We first prove that Sniper requires O(n + m) space. Sniper keeps the approximate graph and the original graph. In the approximate graph, since the number of nodes and edges are at most n and m, respectively, Sniper needs O(n+m) space for the approximate graph; O(n+m) space is required for keeping the original graph. Therefore, the space complexity of Sniper is O(n + m). Next, we prove that Sniper requires O(n2 + nm) time. To identify the lowest centrality node, Sniper first updates the approximate graph and then computes approximate and exact centralities. Sniper needs O(nm) time to update the approximate graph, since it requires O(m) time to compute similarity for the added node against each node in the original graph. It requires O(n2 + nm) time to compute the approximate and exact centralities since the number of nodes and edges are at most n and m, respectively. Therefore, Sniper requires O(n2 + nm) time. Theorem 2 shows, theoretically, that the space and time complexities of Sniper are lower in order than those of the previous approximate approaches. In practice, the search cost depends on the effectiveness of the approximation and estimation techniques used by Sniper. In the next section, we show their effectiveness by presenting the results of extensive experiments.
7
Experimental Evaluation
We performed experiments to demonstrate Sniper’s effectiveness in a comparison to two annotation approaches: the Zone annotation scheme and the Distant to zone annotation scheme (abbreviated to DTZ). These were selected since they outperform the other embedding schemes on the contents of our dataset; the same result is reported in [13]. Zone and DTZ annotation have two parameters:
Y. Fujiwara, M. Onizuka, and M. Kitsuregawa
4
Sniper Zone DTZ
103 102
10
1
10
0
10
5
10
4
1
0.75
103
10
2
0
0
WWW
2
Zone(zones=2) Zone(zones=3) Zone(zones=4) DTZ(zones=2) DTZ(zones=3) DTZ(zones=4) Sniper
101
100
1
10 50000 Social
0.5
0.25 10
P2P
10
ZoneZone(zones=2) Zone(zones=3) Zone(zones=4) DTZ(zones=2) DTZ(zones=3) DTZ(zones=4) Sniper
Sniper Zone DTZ
Wall clock time [s]
5
10
Error ratio
10
Wall clock time [s]
Wall clock time [s]
48
200000
350000
Number of nodes
500000
1
2
3
4
5
6
7
8
9
Number of dimensions
(1) Error ratio
10
10-1
1
2
3
4
5
6
7
8
9
10
Number of dimensions
(2) Wall clock time
Fig. 1. Efficiency of Fig. 2. Scalability of Fig. 3. The results of the annotation Sniper Sniper approaches
zones and dimensions. Zones are divided regions of the entire graph, and dimensions are sets of zones1 . Note that these approaches can compute centrality quickly at the expense of exactness. We used the following three public datasets in the experiments: P2P [1], Social [2], and WWW [3]. They are a campus P2P network for file sharing, a free online social network, and web pages within the ‘nd.edu’ domain, respectively. We extracted the largest connected component from the real data, and we added single nodes one by one in the experiments. We evaluated the search performance through wall clock time. All experiments were conducted on a Linux quad 3.33 GHz Intel Xeon server with 32GB of main memory. We implemented our algorithms using GCC. 7.1
Efficiency of Sniper
We assessed the search time needed for Sniper and the annotation approach. Figure 1 shows the efficiency of Sniper where the number of nodes are 500, 000 for P2P and Social, and 100, 000 for WWW. We also show the scalability of our approach in Figure 2; this figure shows the wall clock time as a function of the number of nodes. We show only the result of P2P in Figure 2 due to space limitations. These figures indicate Sniper’s total processing time (both update and search time are included). We set the number of zones and the dimension parameter to 2 and 1, respectively. Note that, these parameter values allow the annotation approaches to estimate the lowest centrality node most efficiently. These figures show that our method is much faster than the annotation approaches under all conditions examined. Specifically, Sniper is more than 110 times faster. The annotation approaches require O(n2 ) time for computing centralities while Sniper requires O(n + m ) time for computing approximate centralities. Even if Sniper computes the approximate centralities of all aggregate nodes to prune the nodes, this cost does not alter the search cost since approximate computations are effectively terminated. Sniper requires O(n + m) time to compute 1
To compute the centralities of all nodes by the annotation approaches, we sampled half pairs from all nodes, which is the same setting used in [13].
Efficient Centrality Monitoring for Time-Evolving Graphs
49
exact centralities for nodes that cannot be pruned through approximation. This cost, however, has no effect on the experimental results. This is because a significant number of nodes are pruned by approximation. 7.2
Exactness of the Search Results
One major advantage of Sniper is that it guarantees the exact answer, but this raises the following simple question: ‘How successful are the previous approaches in providing the exact answer even though they sacrifice exactness?’. To answer this question, we conducted comparative experiments on the annotation approaches. As the metric of accuracy, we used the error ratio, which is the error centrality value of the estimated lowest centrality node divided by the centrality value of the exact answer node. Figure 3-(1) shows the error ratio and the wall clock time of the annotation approaches with various parameter settings. The number of nodes is 10, 000 and the dateset used is P2P in these figures. As we can see from Figure 3, the error ratio of Sniper is 0 because it identifies the lowest centrality node without fail. The annotation approaches, on the other hand, have much higher error ratios. Therefore, it is not practical to use the annotation approaches in identifying the lowest centrality node. Figure 3-(2) shows that Sniper greatly reduces the computation time even though it guarantees the exact answer. The efficiency of the annotation approaches depends on the parameters used. Furthermore, the results show that the annotation approaches force a tradeoff between speed and accuracy. That is, as the number of zones and dimensions parameters decreases, the wall clock time decreases but the error ratio increases. The annotation approaches are approximation techniques and so can miss the lowest centrality node. Sniper also computes approximate centralities, but unlike the annotation approaches, Sniper does not discard the lowest centrality node in the search process. As a result, Sniper is superior to the annotation approaches in not only accuracy, but also speed.
8
Conclusions
This paper addressed the problem of finding the lowest closeness centrality node from time-evolving graphs efficiently and exactly. Our proposal, Sniper, is based on two ideas: (1) It approximates the original graph by aggregating original nodes to compute approximate centralities efficiently, and (2) It terminates unnecessary distance computations early in finding the answer nodes, which greatly improves overall efficiency. Our experiments show that Sniper works as expected; it can find the lowest centrality node at high speed; specifically, it is significantly (more than 110 times) faster than existing approximate methods.
50
Y. Fujiwara, M. Onizuka, and M. Kitsuregawa
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.
http://kdl.cs.umass.edu/data/canosleep/canosleep-info.html http://snap.stanford.edu/data/soc-LiveJournal1.html http://vlado.fmf.uni-lj.si/pub/networks/data/ND/NDnets.htm Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Computer Networks 29(8-13), 1157–1166 (1997) Elmacioglu, E., Lee, D.: On six degrees of separation in dblp-db and more. SIGMOD Record 34(2), 33–40 (2005) Garfield, E.: Citation Analysis as a Tool in Journal Evaluation. Science 178, 471– 479 (1972) Goldberg, A.V., Harrelson, C.: Computing the shortest path: search meets graph theory. In: SODA, pp. 156–165 (2005) Leskovec, J., Kleinberg, J.M., Faloutsos, C.: Graph evolution: Densification and shrinking diameters. TKDD 1(1) (2007) Nascimento, M.A., Sander, J., Pound, J.: Analysis of sigmod’s co-authorship graph. SIGMOD Record 32(3), 8–10 (2003) Newman: The structure and function of complex networks. SIREV: SIAM Review 45 (2003) Ng, T.S.E., Zhang, H.: Predicting internet network distance with coordinates-based approaches. In: INFOCOM (2002) Potamias, M., Bonchi, F., Castillo, C., Gionis, A.: Fast shortest path distance estimation in large networks. In: CIKM, pp. 867–876 (2009) Rattigan, M.J., Maier, M., Jensen, D.: Using structure indices for efficient approximation of network properties. In: KDD, pp. 357–366 (2006)
Graph-Based Clustering with Constraints Rajul Anand and Chandan K. Reddy Department of Computer Science, Wayne State University, Detroit, MI, USA
[email protected],
[email protected]
Abstract. A common way to add background knowledge to the clustering algorithms is by adding constraints. Though there had been some algorithms that incorporate constraints into the clustering process, not much focus was given to the topic of graph-based clustering with constraints. In this paper, we propose a constrained graph-based clustering method and argue that adding constraints in distance function before graph partitioning will lead to better results. We also specify a novel approach for adding constraints by introducing the distance limit criteria. We will also examine how our new distance limit approach performs in comparison to earlier approaches of using fixed distance measure for constraints. The proposed approach and its variants are evaluated on UCI datasets and compared with the other constrained-clustering algorithms which embed constraints in a similar fashion. Keywords: Clustering, constrained clustering, graph-based clustering.
1 Introduction One of the primary forms of adding background knowledge for clustering the data is to provide constraints during the clustering process [1]. Recently, data clustering using constraints has received a lot of attention. Several works in the literature have demonstrated improved results by incorporating external knowledge into clustering in different applications such as document clustering, text classification. The addition of some background knowledge can sometimes significantly improve the quality of the final results obtained. The final clusters that do not obey the initial constraints are often inadequate for the end-user. Hence, adding constraints and respecting these constraints during the clustering process plays a vital role in obtaining desired results in many practical domains. Several methods are proposed in the literature for adding instancelevel and cluster-level constraints. Constrained versions of partitional [19,1,7], hierarchical [5,13] and more recently, density-based [17,15] clustering algorithms have been studied thoroughly. However, there has been little work in utilizing the constraints in the graph-based clustering methods [14]. 1.1 Our Contributions We propose an algorithm to systematically add instance-level constraints to the graphbased clustering algorithm. In this work, we primarily focused our attention to one such popular algorithm, CHAMELEON, an overview of which is provided in section 3.2. Our contributions can be outlined as follows: J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 51–62, 2011. c Springer-Verlag Berlin Heidelberg 2011
52
R. Anand and C.K. Reddy
– Investigate the appropriate way of embedding constraints into the graph-based clustering algorithm for obtaining better results. – Propose a novel distance limit criteria for must-links and cannot-links while embedding constraints. – Study the effects of adding different types of constraints to graph-based clustering. The remainder of the paper is organized as follows: we briefly review the current approaches for using constraints in different methods in Section 2. In Section 3, we will describe the various notations used throughout our paper and also give an overview of a graph-based clustering method, namely, CHAMELEON. Next, we propose our algorithm and discuss our approach regarding how and where to embed constraints in Section 4. We present several empirical results on different UCI datasets and comparisons to the state-of-the-art methods in Section 5. Finally, Section 6 concludes our discussion.
2 Relevant Background Constraint-based clustering has received a lot of attention in the data mining community in the recent years [3]. In particular, instance-based constraints have been successfully used to guide the mining process. Instance-based constraints enforce constraints on data points as opposed to and δ constraints which work on the complete clusters. The -constraint says that for cluster X having more than two points, for each point x ∈ X, there must be another point y ∈ X such that the distance betweeen x and y is at most . The δ-constraint requires distance between any two points in different clusters to be at least δ . This methodology has also been termed as semi-supervised clustering [9] when the cluster memberships are available for some data. As pointed out in the literature [19,5], even adding a small number of constraints can help in improving the quality of results. Embedding instance-level constraints into the clustering method can be done in several ways. A popular method of incorporating constraints is to compute a new distance metric and perform clustering. Other methods directly embed constraints into optimization criteria of the clustering algorithm [19,1,5,17]. Hybrid methods of combining these two basic approaches are also studied in the literature [2,10]. Adding instance-level constraints to the density-based clustering methods had recently received some attention as well as [17,15]. Inspite of the popularity of graph-based clustering methods, not much attention is given to the problem of adding constraints to these methods.
3 Preliminaries Let us consider a dataset D, whose cardinality is represented as D. The total number of classes in the dataset are K . Proximity graph is constructed from this dataset by computing the pair-wise Euclidean distance between the instances. A user-defined parameter k is used to define the number of neigbors considered for each data point. The hyper-graph partitioning algorithm generates intermediate subgraphs (or sub-clusters) to be formed which are represented by κ.
Graph-Based Clustering with Constraints
53
3.1 Definitions Given a dataset D with each point denoted as (x, y) where x represents the point and y represent the corresponding label, we define constraints as follows: Definition 1: Must-Link Constraints(ML): Two instances (x1 , y1 ) and (x2 , y2 ) are said to be must-link constraints, if and only if, y1 = y2 where y1 , y2 ∈ K. Definition 2: Cannot-Link Constraints(CL): Two instances (x1 , y1 ) and (x2 , y2 ) are said to be cannot-link constraints, if and only if, y1 = y2 , where y1 , y2 ∈ K. Definition 3: Transitivity of ML-constraints: Let X, Y be two components formed must−link using ML-constraints. Then, a new ML-constraint (x1 → x2 ) where x1 ∈ X must−link and x2 ∈ Y introduces the following new constraints: (xi → xj ) ∀xi , xj where xi ∈ X and xj ∈ Y , i = j,X = Y . Definition 4: Entailment of CL-constraints: Let X, Y be two components formed using cannot−link ML-constraints. Then, a new CL-constraint (x1 → x2 ), where x1 ∈ X and x2 ∈ Y introduces the following new CL-constraints: (xi where xi ∈ X and xj ∈ Y , i = j,X = Y .
cannot−link
→
xj ) ∀xi , xj
3.2 Graph-Based Hierarchical Clustering We chose to demonstrate the performance of adding constraints to the popularly studied and practically successful CHAMELEON clustering algorithm. Karypis et al. [11] proposed CHAMELEON algorithm which can find arbitrarily shaped, varying density clusters. It operates on sparse graphs containing similarity or dissimilarity between data points. Compared to various graph-based clustering methods [18] such as Minimum Spanning Tree clustering, OPOSSUM, ROCK, SLINK, CHAMELEON is superior because it incorporates the best features of graph-based clustering (like similarity measure on vertices as in ROCK) and the hierarchical clustering part which is similar or better than SLINK. These features make CHAMELEON attractive to add constraints to obtain better results. Moreover, CHAMELEON outperforms other algorithms like CURE [11] and also density-based methods like DBSCAN [18]. Thus, we believe adding constraints to CHAMELEON will not only give better results but also provide some insights on the performance of other similar algorithms in the presence of constraints. Unlike other algorithms which use a given static modeling parameters to find clusters, CHAMELEON find clusters by dynamic modeling. CHAMELEON uses both closeness and interconnectivity while identifying the most similar pair of clusters to be merged. CHAMELEON works in two phases. In the first phase, it finds the k-nearest neighbors based on the similarity between the data points. Then, using an efficient multi-level graph partitioning algorithm (such as METIS [12]), sub-clusters are created in such a way that similar data points are merged together. In the second phase, these sub-clusters are combined by using a novel agglomerative hierarchical algorithm. Clusters are merged using RI and RC metrics defined below. Let X, Y be two clusters. Mathematically, Relative Interconnectivity (RI) is defined as follows: RI =
EC(X, Y ) + EC(Y ))
1 (EC(X) 2
(1)
54
R. Anand and C.K. Reddy
where EC(X, Y ) is the sum of edges that connects clusters X and Y in the k-nearest neighbor graph. EC(X) is the minimum sum of the cut-edges if we bisect cluster X; and EC(Y ) is the minimum sum of the cut-edges if we bisect cluster Y . Let lx and ly represents size of the clusters X and Y respectively. Mathematically, Relative Closeness (RC) is defined as follows: RC =
S EC (X, Y ) ly lx S (X) + lx +l S EC (Y lx +ly EC y
)
(2)
where S EC (X, Y ) is the average weight of edges connecting clusters X and Y in knearest neighbor graph. S EC (X), S EC (Y ) represents the average weight of edges if the clusters X and Y are bisected correspondingly. There are many schemes to account for both of the measures. The function used to combine them is (RI × RC α ). Here another parameter α is included so as to give preference to one of the two measures. Thus, we maximize the function: argmaxα∈(0,∞) (RI × RC α )
(3)
4 Constrained Graph-Based Clustering CHAMELEON, like other graph-based algorithms, is sensitive to the parameters as a slight change in similarity values can both dramatically increase or decrease the quality of the final outcome. For CHAMELEON, changes in similarity measures might result in different k-nearest neighbors. Overlapping clusters or clusters with very minimal inter-cluster distance sometimes leads to different class members in the same cluster. In this work, the primary emphasis is to demonstrate that adding constraints to graphbased clustering can potentially avoid this problem at least sometimes, if not always. Our basis for this assumption, is the transitivity of ML constraints and the entailing property of CL constraints (Section 3.1). 4.1 Embedding Constraints Using distance (or dissimilarity) metric to enforce constraints [13] was claimed to be effective in practice, despite having some drawbacks. The main problem is caused due to setting the distance to zero between all the must-linked pair of constraints. i.e., Let (pi , pj ) be two instances in a must-link constraint then, distance(pi , pj ) = 0 To compensate for distorted metric, we run all-pairs-shortest-path algorithm so that new distance measures is similar to the original space. If we bring any two points much closer to each other, i.e. lim
n→distance(pi ,pj )
distance(pi , pj ) − n = η
(4)
Graph-Based Clustering with Constraints
55
At the first look, it may seem that this minute change will not affect the results significantly. However, after running all-pairs-shortest-path algorithm, the updated distance matrix in this case, will respect the original distance measures better than setting the distance to zero. Similarly for cannot-link constraints, let (qi , qj ) be a pair of cannot-link constraints, then the points qi and qj are taken apart as far as possible. i.e., lim distance(qi , qj ) + n = λ
n→∞
(5)
Thus, by varying the values of η and λ, we can push and pull away points reasonably. It seems that this might create a problem for finding optimal values of η and λ. However, our preliminary experiments show that the basic limiting values for these parameters is enough in most of the cases. This addition of constraints (and thus the manipulation of the distance matrix) can be performed in the CHAMELEON algorithm primarily in any of the two phases. We can add these constraints before (or after) the graph partitioning step. After the graph partitioning, we can add constraints during agglomerative clustering. However, we prefer to add constraints before graph partitioning primarily due to the following reasons: – When the data points are already in sub-clusters, enforcing constraints through distance will not be beneficial unless we ensure that all such constraints are satisfied during the agglomerative clustering. However, constraint satisfaction might not lead to convergence every time. Especially with CL constraints, even determining whether satisfying assignments exist is NP-complete. – Intermediate clusters formed are on the basis of original distance metric. Hence, RI and RC on the original distance metric will get undermined by the new distance updation through constraints. 4.2 The Proposed Algorithm Our approach for embedding constraints into the clustering algorithm is through learning a new distance (or dissimilarity) function. This measure is also adjusted to ensure that the new distance (or dissimilarity) function respects the original distance values to a maximum extent for unlabeled data points. We used Euclidean distance for calculating dissimilarity. For embedding constraints, an important and a intuitive question is: where to embed these constraints to achieve the best possible results? As outlined in the previous section, we choose to embed constraints in first phase of CHAMELEON. Now, we will present a step-by-step discussion of our algorithm. Using Constraints. Our algorithm begins by using constraints to modify the distance matrix. To utilize properties of the constraints (Section 3.1) and to restore metricity of the distances, we propogate constraints. The must-links are propogated is done by running the fast version of all-pairs-shortest-path algorithm. If u, v represents the source and destination respectively, then the shortest path between u, v involves only points u, v and x, where x must belong to any pair of ML constraints. Using this modification, the algorithm runs in O(n2 m) (here m is the number of unique points in ML). The complete-link clustering inherently propagates the cannot-link constraints. Thus, there is no need to propagate CL constraints during Step 1.
56
R. Anand and C.K. Reddy
Algorithm 1. Constrained CHAMELEON(CC) Input: Dataset D, Set of must-link constraints M L = {ml1 , ..., mln }, Set of cannot-link constraints CL = {cl1 , ...., cln }, Number of desired clusters K, Number of nearest neighbors k, Number of intermediate clusters κ, Significance factor for RI or RC α. Output: Set of K clusters 1: 2: 3: 4: 5: 6: 7: 8:
Step 1: Embed Constraints for all (p1,p2) ∈ ML do limn→distance(p1 ,p2 ) distance (p1 , p2 ) - n = η end for fastAllPairShortestPaths(DistanceMatrix) for all (q1,q2) ∈ CL do limn→∞ distance (p1 , p2 ) + n = λ end for
9: Step 2: k-nn = Build k-nearest neighbor graph 10: Step 3: Partition the k-nn graph into κ clusters using edge cut minimization 11: Step 4: Merge the κ clusters until K number of clusters remain by maximizing RI × RC α as criteria for merging clusters
Importance of Constraint Positioning. Imposition of CL constraints just before Stage 4 rather than in Stage 1 might seem reasonable. We used CL constraints in Stage 1 due to our experimental observations stated below: 1. Hyper-graph partitioning with constraints is often better than constrained agglomerative clustering, when we are not trying to satisfy constraints in either one of them. 2. Clusters induced by graph partitioning have stronger impact on the final clustering solution. After Step 1, we create the k nearest neighbor graph (Step 2) and partition the k-nn using a graph partitioning algorithm (METIS). The κ number of clusters are then merged using the agglomerative clustering where the aim is to maximize the product (RI ×RC α ). Complete-link agglomerative clustering is used to propogate CL constraints embedded earlier. The cut-off point in dendrogram of the clustering is decided by parameter K (See Algorithm 1). The time complexity of unconstrained version of our algorithm is O(nκ + nlogn + κ2 logκ) [11]. The time complexity of Stage 1 consists of adding constraints which is O(l) (l = M L + CL) and O(n2 m) for propagation of ML constraints. Thus, overall complexity of O(n2 m) for Stage 1. Therefore, time complexity of our algorithm is finally O(nκ + nlogn + n2 m + κ2 logκ).
5 Experimental Results We will now present our experimental results obtained using the proposed method on benchmark datasets from UCI machine Learning Repository [8]. Our results on various versions of Constrained CHAMELEON(CC) were obtained with same parameter settings for each dataset. These parameters were not tuned particularly for CHAMELEON, however we did follow some specific guidelines for each dataset to obtain these parameters. We used the same default settings for all the internal parameters of the METIS
Graph-Based Clustering with Constraints
57
hyper-graph partitioning package except κ, which is dataset dependent. We did not compare our results directly with constrained hierarchical clustering, since CC itself contains hierarchical clustering, which will be similar or better than stand-alone hierarchical clustering algorithms. Instead, we compared with those algorithms which embed constraints into the distance function in the same manner as our approach. Our CC with fixed values of (0,∞) for (η, λ) is similar to [13] except that we have graphpartitioning step on nearest-neighbor graph before agglomerative clustering. So, we ruled out this algorithm and instead compared our results with MPCK-means [4] as this algorithm also embeds constraints in the distance function. MPCK-means incorporates learning of the distance function on labeled data and on the data affected by constraints in each iteration. Thus, it learns different distance function for each cluster. For the performance measure, we used the Rand Index Statistic [16], which measures the agreement between two sets of clusters X and Y for the same set of n objects as follows: R = a+b n , where a is the number of pairs of objects assigned to the same 2 cluster in both X and Y , and b is the number of pairs of objects assigned to different clusters in both X and Y . All the parameter selection is done systematically. For all the clustering results, K is set to be the true number of classes in the dataset. The value of α is chosen between 1-2 with the increments of 0.1. We ran some basic experiments on CHAMELEON for each dataset, to figure out the effect of α on the results. We choose the particular value of α for each dataset which can potentially produce better result. We used similar procedure for k and κ. It is important to note that κ is dataset dependent parameter among all the other parameters. We are assuming that at least 10 data points should be present in a single cluster. Thus K ≤ κ ≤ D/10. We used the class labels of the data points to generate constraints. We randomly select a pair of data points and check their labels: if the labels are same, they are denoted as must-link, else they are denoted as cannot-link. To assess the impact of the constraints on the quality of the results, we varied the number of constraints. We generated results for ML only, CL only and ML, CL constraints. The complete dataset is used to randomly select data points for constraints, thus removing any bias towards the generated constraints. Table 1. Average Rand Index values for 100 ML + CL constraints on UCI datasets Datasets Ionosphere Iris Liver Sonar Wine
Instances Attributes 351 34 150 4 345 6 208 60 178 13 Average
Classes 2 3 2 2 3
MPCK-means 0.5122 0.6739 0.5013 0.5166 0.6665 0.5741
CC(p=1) 0.5245 0.7403 0.5034 0.5195 0.6611 0.5898
CC(fixed) 0.5355 0.7419 0.5097 0.5213 0.7162 0.6049
We used five UCI datasets in our experiments as shown in Table 1. Average Rand Index values for 100 Must-link and Cannot-link constraints clearly outlines that on most occasions, MPCK-means [4] is outperformed by both the variants of CC. CC(fixed) performed marginally better than CC(p=1). Also, we only show results for CC(p=5) and CC(p=15), since the results of CC(p=1) and CC(p=10) are similar to the other two.
58
R. Anand and C.K. Reddy
For each dataset, we randomly select constraints and run algorithm once per constraint set. This activity is done 10 times and we report the average Rand-Index value for all the 10 runs. We used this experimentation for all the variants of CC and MPCK-means. The results are depicted in Figs. 1-3. We state that the distance value for must-links and cannot-links can be varied instead of fixed values like 0 and ∞ correspondingly. The CC(fixed) uses (0,∞) for distance measures. Ideally, the values of (η, λ) could be anything close to extreme values of 0 and ∞, yet they have to be quantifiable. In order to quantify them in our experiments, we defined as follows: λ = Dmax ∗ 10p
(6)
η = Dmin ∗ 10−p
(7)
where Dmax , Dmin represents maximum and minimum distance values in the data matrix respectively. In order to study the effect of p, we varied it’s values: p = 1, 5, 10 and 15. Thus, we have CC(p = 1), CC(p = 5), CC(p = 10) and CC(p = 15) referring to different values of (η, λ). It is interesting to note that, for different values of p, distance values for constraints are different for each dataset due to different minimum and maximum distance values. In this manner, we respect the metricity of original distances and vary our constraint values accordingly. We tried various parameter settings and found only few selected ones to be making some significant difference in the quality of the final results. It is also important to note that these settings were found by running the basic CHAMELEON algorithm rather than CC. This is because, finding optimal parameters for CC using various constraints will be constraints-specific and it will not truly represent the algorithmic aspect. We then run CC using a few selected settings for all the variants of CC using all constraints size and finally report the average values specific to one set of parameters only showing better performance on average across all CC variants. The individual settings of parameters (k, κ, α) for each dataset shown in results are as follows: Ion(19,10,1.2), Iris(9,3,1), Liver(10,5,2), Sonar(6,3,1) and Wine(16,3,1). In summary, we selected the best results obtained by the basic version of the CHAMELEON algorithm, and have shown that these best results can be improved by adding constraints. We observed across all the variants of CC and MPCK-means for all datasets consistently that the performance decreases as the number of constraints increase, except in some prominent cases (Figs. 1(d),2(a),2(b) and 3(d)). This observation is consistent with the results outlined in the recent literature [6]. We stated earlier that we did not attempt to satisfy constraints implicitly or explicitly. However, we observed that during Step 3 of Algorithm 1, for fewer constraints, most of the times the constraint violation is zero in the intermediate clusters, which is often reflected in the final partitions. As the number of constraints increase, the number of constraint violations also increase. However, on an average, violations are roughly between 10%-15% for must-link constraints, 20%-25% for cannot-link constraints, and about 15%-20% for must-links and cannotlinks combined. We also observed that few times, the constraint violations are reduced after Step 4, i.e., after the final agglomerative clustering. Thus, we can conclude that the effect of constraints is significant in Step 3 and we re-iterate that embedding constraints earlier is always better for CC.
Graph-Based Clustering with Constraints 0.61
1 CC(fixed) CC(p=5) CC(p=15) MPCK
0.6 0.59
CC(fixed) CC(p=5) CC(p=15) MPCK
0.95
0.58
0.9
0.57
Rand Index
Rand Index
59
0.56
0.85
0.55 0.8 0.54 0.53
0.75
0.52 0.51 10
20
30
40
50
60
70
80
90
0.7 10
100
20
30
40
50
60
70
80
90
100
Number of ML constraints
Number of ML constraints
(a)
(b) 0.54
0.514 CC(fixed) CC(p=5) CC(p=15) MPCK
0.512
CC(fixed) CC(p=5) CC(p=15) MPCK
0.535 0.53
0.51
0.525 Rand Index
Rand Index
0.508
0.506
0.52 0.515
0.504
0.51 0.502
0.505
0.5
0.498 10
0.5
20
30
40
50 60 70 Number of ML constraints
(c)
80
90
100
0.495 10
20
30
40
50
60
70
80
90
100
Number of ML constraints
(d)
Fig. 1. Different versions of Constrained CHAMELEON(CC) compared with MPCK-means using Rand Index values averaged over 10 runs with ML constraints on different UCI datasets (a) Ionosphere (b) Iris (c) Liver and (d) Sonar
Overall, different variants of our algorithm CC outperformed MPCK-means. Iris and liver datasets are examples where for all combinations of constraints, CC results are clearly much better than MPCK-means. In the rest of the two datasets, CC performed nearly similar to MPCK-means. Only in some cases, MPCK-means performed slightly better than CC variants as shown in Figs. 1(a), 2(a) and 2(d). Even in these particular scenarios, at least one variant of CC outperformed (or nearly matched) the result of MPCK-means. Surprisingly, CC(fixed) was slightly better or worse compared to the other variants of CC. A direct comparison of CC(fixed) with MPCK-means reveals that only in two cases (Figs. 2(a) and 2(d)), it outperformed CC(fixed). In the rest of the scenarios, CC(fixed) performed better. The primary reason for wavering performance in Ionosphere and Sonar datasets could be attributed to the large number of attributes in these datasets (Table 1). Due to curse of dimensionality, distance function looses its meaning by directly affecting nearest neigbours. Adding constraints do provide some contrast so as to group similar objects, but overall discernability is still less. It is important to note that, we did not search or tune for optimal values of (η, λ) for any particular dataset. During our initial investigation, we found that, for some change in values, the results were improved. We did some experiments on iris dataset and were able to achieve average Rand Index value of 0.99 and quite often achieved perfect clustering
60
R. Anand and C.K. Reddy 0.94
0.7 0.68
CC(fixed) CC(p=5) CC(p=15) MPCK
0.92 0.66
0.9
0.62
Rand Index
Rand Index
0.64
0.6 0.58 0.56
CC(fixed) CC(p=5) CC(p=15) MPCK
0.54
0.88
0.86
0.84
0.82
0.52 0.5 10
20
30
40
50 60 70 Number of CL constraints
80
90
0.8 10
100
20
30
40
50
60
70
80
90
100
Number of CL constraints
(a)
(b)
0.507
0.505 CC(fixed) CC(p=5) CC(p=15) MPCK
0.506
CC(fixed) CC(p=5) CC(p=15) MPCK
0.504
0.505 0.503
Rand Index
Rand Index
0.504
0.503
0.502
0.501
0.502 0.5 0.501 0.499
0.5
0.499 10
20
30
40
50
60
Number of CL constraints
(c)
70
80
90
100
0.498 10
20
30
40
50
60
70
80
90
100
Number of CL constraints
(d)
Fig. 2. Different versions of Constrained CHAMELEON(CC) compared with MPCK-means using Rand Index values averaged over 10 runs with CL constraints on different UCI datasets (a) Ionosphere (b) Iris (c) Liver (d) and Sonar
(Rand Index=1) during some of the runs for 190 constraints, with same settings as used in all other experiments shown. However, it will be too early to conclude that finding tuned values for (η, λ) will always increase performance, based on some initial results and will need further experimental evidence. Based on our findings, we observed that changing values for (η, λ) did sometimes increase performance, but not consistently and can also sometimes lead to decrease in performance. We were also surprised by this phenomenon demonstrated by both the algorithms. In our case, carrying more experiments with additional constraints revealed that this decrease in performance is true upto a particular number of constraints. After that we again see rise in performance and with enough number of constraints (1% to 5% of constraints in our case with these datasets), we are able to decipher original clustering or close to it (Rand Index close to 1.0). CC(fixed) compared to other variants of CC were only slightly different on an average. CC(fixed) performed reasonably well across all the datasets on nearly all settings with MPCK-means. Other variants of CC were also better on an average compared to MPCK-means. Thus, our algorithm performed better than MPCK-means in terms of handling the decrease in performance when the number of constraints increase. Most importantly, our algorithm performed well despite not trying to satisfy constraints implicitly or explicitly.
Graph-Based Clustering with Constraints 0.64
1 CC(fixed) CC(p=5) CC(p=15) MPCK
CC(fixed) CC(p=5) CC(p=15) MPCK
0.95
0.6
0.9
0.58
0.85
Rand Index
Rand Index
0.62
0.56
0.54
0.8
0.75
0.52
0.5 10
0.7
20
30
40
50
60
70
80
90
0.65 10
100
20
30
40
Number of ML,CL constraints
50
60
70
80
90
100
Number of ML,CL constraints
(a)
(b)
0.518
0.545 CC(fixed) CC(p=5) CC(p=15) MPCK
0.516 0.514
CC(fixed) CC(p=5) CC(p=15) MPCK
0.54 0.535
0.512
0.53
0.51
Rand Index
Rand Index
61
0.508
0.525 0.52
0.506 0.515 0.504 0.51 0.502 0.505 0.5 0.498 10
20
30
40
50
60
70
Number of ML,CL constraints
(c)
80
90
100
0.5 10
20
30
40
50
60
70
80
90
100
Number of ML,CL constraints
(d)
Fig. 3. Different versions of Constrained CHAMELEON(CC) compared with MPCK-means using Rand Index values averaged over 10 runs with ML and CL constraints on different UCI datasets (a) Ionosphere (b) Iris (c) Liver and (d) Sonar
6 Conclusion In this work, we presented a novel constrained graph-based clustering method based on the CHAMELEON algorithm. We proposed a new framework for embedding constraints into the graph-based clustering algorithm to obtain promising results. Specifically, we thoroughly investigated the “how and when to add constraints” aspect of the problem. We also proposed a novel method for the distance limit criteria while embedding constraints into the distance function. Our algorithm outperformed the popular MPCK method on several real-world datasets under various constraint settings.
References 1. Basu, S., Banerjee, A., Mooney, R.J.: Semi-supervised clustering by seeding. In: Proceedings of the Nineteenth International Conference on Machine Learning (ICML 2002), pp. 27–34 (2002) 2. Basu, S., Bilenko, M., Mooney, R.J.: A probabilistic framework for semi-supervised clustering. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 59–68 (2004)
62
R. Anand and C.K. Reddy
3. Basu, S., Davidson, I., Wagstaff, K.: Constrained Clustering: Advances in Algorithms, Theory, and Applications. Chapman & Hall/CRC (2008) 4. Bilenko, M., Basu, S., Mooney, R.J.: Integrating constraints and metric learning in semisupervised clustering. In: Proceedings of the Twenty-first International Conference on Machine Learning, ICML 2004 (2004) 5. Davidson, I., Ravi, S.S.: Agglomerative Hierarchical Clustering with Constraints: Theoretical and Empirical Results. In: Jorge, A.M., Torgo, L., Brazdil, P.B., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS (LNAI), vol. 3721, pp. 59–70. Springer, Heidelberg (2005) 6. Davidson, I., Ravi, S.S., Shamis, L.: A sat-based framework for efficient constrained clustering. In: Jonker, W., Petkovi´c, M. (eds.) SDM 2010. LNCS, vol. 6358, pp. 94–105. Springer, Heidelberg (2010) 7. Davidson, I., Wagstaff, K.L., Basu, S.: Measuring Constraint-Set Utility for Partitional Clustering Algorithms. In: F¨urnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PKDD 2006. LNCS (LNAI), vol. 4213, pp. 115–126. Springer, Heidelberg (2006) 8. Frank, A., Asuncion, A.: UCI machine learning repository (2010), http://archive.ics.uci.edu/ml 9. Gunopulos, D., Vazirgiannis, M., Halkidi, M.: From unsupervised to semi-supervised learning: Algorithms and evaluation approaches. In: SIAM International Conference on Data Mining: Tutorial (2006) 10. Halkidi, M., Gunopulos, D., Kumar, N., Vazirgiannis, M., Domeniconi, C.: A framework for semi-supervised learning based on subjective and objective clustering criteria. In: Proceedings of the 5th IEEE International Conference on Data Mining (ICDM 2005), pp. 637–640 (2005) 11. Karypis, G., Han, E.-H., Kumar, V.: Chameleon: Hierarchical clustering using dynamic modeling. IEEE Computer 32(8), 68–75 (1999) 12. Karypis, G., Kumar, V.: Metis 4.0: Unstructured graph partitioning and sparse matrix ordering system. Tech. Report, Dept. of Computer Science, Univ. of Minnesota (1998) 13. Klein, D., Kamvar, S.D., Manning, C.D.: From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering. In: Proceedings of the Nineteenth International Conference on Machine Learning (ICML 2002), pp. 307–314 (2002) 14. Kulis, B., Basu, S., Dhillon, I.S., Mooney, R.J.: Semi-supervised graph clustering: a kernel approach. In: Proceedings of the Twenty-Second International Conference on Machine Learning (ICML 2005), pp. 457–464 (2005) 15. Lelis, L., Sander, J.: Semi-supervised density-based clustering. In: Perner, P. (ed.) ICDM 2009. LNCS, vol. 5633, pp. 842–847. Springer, Heidelberg (2009) 16. Rand, W.M.: Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66(336), 846–850 (1971) 17. Ruiz, C., Spiliopoulou, M., Menasalvas, E.: Density based semi-supervised clustering. Data Mining and Knowledge Discovery 21(3), 345–370 (2009) 18. Tan, P.-N., Steinbach, M., Kumar, V.: Introduction to Data Mining, US edition. Addison Wesley, Reading (2005) 19. Wagstaff, K., Cardie, C., Rogers, S., Schr¨odl, S.: Constrained k-means clustering with background knowledge. In: Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), pp. 577–584 (2001)
A Partial Correlation-Based Bayesian Network Structure Learning Algorithm under SEM Jing Yang and Lian Li Department of Computer Science and Technology, Hefei University of Technology, Hefei 230009, China
[email protected] Abstract. A new algorithm, PCB (Partial Correlation-Based) algorithm, is presented for Bayesian network structure learning. The algorithm combines ideas from local learning with partial correlation techniques in an effective way. It reconstructs the skeleton of a Bayesian network based on partial correlation and then performs greedy hill-climbing search to orient the edges. Specifically, we make three contributions. Firstly, we give the proof that in a SEM (simultaneous equation model) with uncorrelated errors, when datasets are generated by SEM no matter what distribution disturbances subject to, we can use partial correlation as the criterion of CI test. Second, we have done a series of experiments to find the best threshold value of partial correlation. Finally, we show how partial relation can be used in Bayesian network structure learning under SEM. The effectiveness of the method is compared with current state of the art methods on 8 networks. Simulation shows that PCB algorithm outperforms existing algorithms in both accuracy and run time. Keywords: partial relation; Bayesian network structure learning; SEM (simultaneous equation model).
1
Introduction
Learning the structure of Bayesian network from dataset D is useful, unfortunately, it is an NP-hard problem [2]. Consequently, many heuristic techniques have been proposed. One of the most basic search algorithms is a local greedy hill-climbing search over all DAG structures. The size of the search space of greedy search is super exponential to the number of variables. One of the approaches uses constraints placed on the search to improve efficiency of the search, such as the K2 algorithm [3], the SC algorithm [4], the MMHC algorithm [15], the L1MB algorithm [7]. One drawback of the K2 algorithm is that it requires a total variable ordering. The SC algorithm first introduces local learning idea and proposes two-phase framework including a Restrict step and a Search step. In the Restrict step, the SC algorithm uses mutual information to find a set of potential neighbors for each node and achieves fast learning by restricting the search space. One drawback of the SC algorithm is that it only allows a variable to have a maximum
Corresponding author.
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 63–74, 2011. c Springer-Verlag Berlin Heidelberg 2011
64
J. Yang and L. Li
up to k parents. However a common parameter k for all nodes will have to sacrifice either efficiency or quality of reconstruction [15]. The MMHC algorithm uses the max-min parents-children (MMPC) algorithm to identify a set of potential neighbors [15]. Experiments show that the MMHC algorithm has quite high accuracy, one drawback of which is that it needs conditional independency tests on exponentially large conditioning sets. The L1MB algorithm introduces L1 techniques to learn DAG structure and uses the LARS algorithm to find a set of potential neighbors [7]. The L1MB algorithm has good time performance. However, the L1MB algorithm can describe the correlation between a set of variables and a variable, not the correlation between two variables. Experiments show that the L1MB algorithm has low accuracy. In fact, many algorithms, such as the K2, SC, PC [13], TPDA [1], MMHC, can be implemented efficiently with discrete variables, and are not applicable to the continuous variables straightforwardly. The L1MB algorithm has been designed for continuous variables. However, its accuracy is not very high. Partial correlation method can reveal the true correlation between two variables by eliminating the influences of other correlative variables[16]. It has been successfully applied to many fields such as medicine [8], economics [14], and geology [16]. In causal discovery, it has been used (as transformed by Fisher’s z [12] ) as a continuous replacement for CI tests in PC algorithm. Pellet et al. introduced partial-correlation-based CI test into causation discovery with the assumption that data follows multivariate Gaussian distribution for continuous variables [9]. However, when the data doesn’t follow multivariate Gaussian distribution, can partial correlation be CI test? Our first contribution is that we give the proof that partial correlation can be used as the criterion of CI test under linear simultaneous equation model (SEM), which includes multivariate Gaussian distribution as a special case. Our second contribution is that we propose an effective algorithm, called PCB (Partial Correlation-Based), which combines ideas from local learning with partial correlation techniques in an effective way. PCB algorithm works on the continuous variable setting with the assumption data generated by SEM. Computational complexity of PCB is O(3mn2 + n3 ) (n is the number of variables and m is the number of cases). Advantages of PCB are that PCB has quite good time performance and quite high accuracy. The time complexity of our PCB is polynomially bounded by the number of variables. The third advantage of PCB algorithm is that PCB algorithm uses a relevance threshold to evaluate the correlation to alleviate the drawback of SC algorithm (common parameter k for all nodes), and we also find the best relevance threshold by a series of extensive experiments. Empirical results show that PCB outperforms the above existing algorithms in both accuracy and time performance. The remainder of the paper is structured as follows. In section 2, we present PCB algorithm and give computational complexity analysis. Some empirical results are shown and discussed in section 3. Finally, we conclude our work and address some issues about future work in section 4.
A PCB Bayesian Network Structure Learning Algorithm under SEM
2
65
PCB Algorithm
PCB(Partial Correlation-Based) includes two steps: the Restrict step and the Search step. 2.1
Restrict Step
The restrict step is analogous to the pruning step of the SC algorithm, the MMHC algorithm, the L1MB algorithm. In this paper, partial correlation is used to identify the candidate neighbors. To a certain extent, there is a correlation between each two variables, but the correlation is affected by the other correlative variables. Simple correlation method does not consider the influences, so it cannot reveal the true correlation between two variables. Partial correlation can eliminate the influences of other correlative variables and reveal the true correlation between two variables. A larger magnitude of partial correlation coefficient means a closer correlation [Xu et al., 2007]. So partial correlation is used to select the potential neighbors. Before we give our algorithm, we give some definitions and theorems. Definition 2.1[9] (SEM). A SEM (structural equation model) is a set of equations describing the value of each variable Xi in X as a function fXi of its parents Pa(Xi ) and a random disturbance term uXi : xi = fXi (Pa(Xi ), uXi )
(1)
In our paper, without loss of generality, we simplify the function as linear, so we multiply a weight set WTXi and a parent set Pa(Xi ), one weight for one parent. Here WXi and Pa(Xi ) are vectors, and WTXi is transposing vector of WXi . we can obtain the equation(2): xi = WTXi Pa(Xi ) + uXi
(2)
equation(2) is a special case of the general SEM described by equation(1). Disturbances uXi are continuous random variables. Specially, when all uXi are Gaussian random variables, X subjects to multivariate Gaussian distribution. Then, partial correlation is a valid CI measure [9]. However,we want to deal with a more general case, when uXi are continuous but from arbitrary distribution. Can partial correlation be a valid CI measure? Definition 2.2[9] (Conditional independence). In a variable set X, two random variables Xi , Xj ∈ X are conditionally independent given Z ⊆ X \ {Xi , Xj }, if and only if P (Xi |Xj , Z) = P (Xi |Z) , denoted as Ind(Xi , Xj |Z) . Definition 2.3[9] (d-separation). In a DAG G, two nodes Xi , Xj are d-separated by Z ⊆ X \ {Xi , Xj }, if and only if every path from Xi to Xj is blocked by Z, denoted as Dsep(Xi , Xj |Z). A path is blocked if at least one diverging or serially connected node in Z or if at least one converging node and
66
J. Yang and L. Li
all its descendants are not in Z. If X and Y are not d-separated by Z, they are d-connected,denoted as Dcon(Xi , Xj |Z). Theorem 2.1.[12] In a SEM with uncorrelated errors (that means for two random variables Xi , Xj ∈ X, uXi and uXj are uncorrelated), Z ⊆ X \ {Xi , Xj }, a partial correlation ρ(Xi , Xj |Z) is entailed to be zero if and only if Xi and Xj are d-separated given Z. Definition 2.4[10] (Perfect map). If the Causal Markov and Faithfulness conditions hold together, A DAG G is a directed perfect map of a joint probability distribution P (X), and there is bijection between d-separation in G and conditional independence in P : ∀Xi , Xj ∈ X, ∀Z ⊆ X \ {Xi , Xj } : Ind(Xi , Xj |Z) ⇔ Dsep(Xi , Xj |Z)
(3)
Definition 2.5[5](Linear Correlation). In a variable set X, the linear correlation coefficient γXi Xj between two random variables Xi , Xj ∈ X, provides the most commonly used measure to assess the strength of the linear relationship between Xi and Xj is defined by γXi Xj = σXi Xj /σXi σXj
(4)
where σXi Xj ,denotes the covariance between Xi and Xj , and σXi and σXj denote the standard deviation of Xi and Xj respectively. γxi xj is estimated by m
γˆXi Xj =
(xki − x¯i )(xkj − x¯j ) m m (xki − x¯i )2 (xkj − x¯j )2 k=1
k=1
(5)
k=1
Here, m is the number of instances. xki means k-th realization (or case) of Xi , and x¯i is the mean of Xi . xkj means k-th case of Xj , and x ¯j is the mean of Xj . Definition 2.6[10] (Partial correlation). In a variable set X, the partial correlation between two random variables Xi , Xj ∈ X, given Z ⊆ X \ {Xi , Xj }, noted ρ(Xi , Xj |Z), is the correlation of the residuals RXi and RXj resulting from the least-squares linear regression of Xi on Z and of Xj on Z, respectively. Partial correlation can be computed efficiently without having to solve the regression problem by inverting the correlation matrix R of the X. With R−1 = (rij ), here R−1 is the inverse matrix of R, we have: √ ρ(Xi , Xj |X \ {Xi , Xj }) = −rij / rii rjj (6) In this case, we can compute all partial correlations with a single matrix inversion. This is an approach we use in our algorithm. Theorem 2.2. In a SEM with uncorrelated errors, when data is generated by the SEM no matter what distribution disturbances subject to, we can use partial correlation as the criterion of CI test. Prove: From Theorem 2.1, ∀Xi , Xj ∈ X, ∀Z ⊆ X \ {Xi , Xj }, the partial correlation ρ(Xi , Xj |Z) is entailed to be zero if and only if Xi and Xj are d-separated
A PCB Bayesian Network Structure Learning Algorithm under SEM
67
given Z; From Definition 2.4 there is bijection between d-separation in G and conditional independence in P , Ind(Xi , Xj |Z) ⇔ Dsep(Xi , Xj |Z), thus a partial correlation ρ(Xi , Xj |Z) is entailed to be zero if and only if Xi and Xj are conditionally independent given Z. So we can use partial correlation as the criterion of CI test in a SEM with uncorrelated errors. Definition 2.7 (Strong relevance). ∀Xi , Xj ∈ X, ∀Z ⊆ X \ {Xi , Xj }, Xi is strongly relevant to Xj if the partial correlation ρ(Xi , Xj |Z) >= threshold. Definition 2.8(Weak relevance). ∀Xi , Xj ∈ X, ∀Z ⊆ X \ {Xi , Xj }, Xi is weekly relevant to Xj if the partial correlation ρ(Xi , Xj |Z) <= threshold. The outline of the Restrict step is shown in Fig.1. Input of the step is threshold k and a dataset D = {x1 , · · · , xm } of instances of X, where each xi is a complete assignment to the variables X1 , · · · , Xn in V al(X1 , · · · , Xn ). Each column of the dataset represents one variable. Output of the step is a set of potential neighbors PN(Xj ) of each Xj and the matrix of potential neighbors PNM. If PNM(i, j) is 1, that means Xi is Xj ’s potential neighbor. Otherwise, if PNM(i, j) is 0, Xi isn’t Xj ’s potential neighbor. Initially, PN(Xj )(potential neighbors of each variable Xj ) is empty, all elements of PNM are set to 0 (step 1 ). Then we select a set of potential neighbors for each variable and obtain the final matrix of potential neighbors (from step 2 to step 9). For each pair of variables Xi and Xj (Xi , Xj ∈ X, j = 1 to n, i = 1, · · · , j, i = j), Z = X \ {Xi , Xj }, calculate ρ(Xi , Xj |Z), if absolute value of ρ(Xi , Xj |Z) is greater than k, then choose Xi as Xj ’s potential neighbor and set the value of PNM(i, j) to 1, otherwise set the value of PNM(i, j) to 0. In fact, ρ(Xi , Xj |Z) (i < j) equals ρ(Xj , Xi |Z), however, if there is strong correlation between them, we only set PNM(i, j) to 1. PNM is upper triangular matrix and on the diagonal elements are 0. Because, Search step includes reverse-edge operator, by performing greedy hill-climbing search, the step can orient the edges properly. 2.2
Search Step
After Restrict step, we find the candidate neighbors of each variable, then perform a greedy hill-climbing search. We assume that we have fully observed (complete) data, which are continuous, that our goal is to find the DAG G that minimizes the MDL cost, MDL is defined as M DL(G) =
n
|θˆmle | (N LL(i, Pa(Xi ), θˆimle ) + i log m) 2 i=1
N LL(i, Pa(Xi ), θ) = −
m
log(P (Xj,i |Xj,Pa(Xi ) , θ))
(7)
(8)
j=1
The method is used in [7]. Where m is the number of data cases, n is the number of variables, Pa(Xi ) are the parents of node i in G, N LL(i, Pa(Xi ), θ) is the negative loglikelihood of node i with parents Pa(Xi ) and parameters θ,
68
J. Yang and L. Li
1
m
Input: a dataset D={x ,…, x }, Output: 1.
PN( Xj )= ( Xj X,
2.
for Xj X, j=1 to n,
3. 4.
threshold:
for Xi X,
j=1 to n ) ,
Calculate partial correlation U (Xi , Xj |Z) if abs( U(Xi, Xj |Z) )> =k else PNM(i, j)=0
9.
j=1 to n )
do
6. 8.
PNM( i, j )=0 ( i=1 to n,
i=1 to j , izj , Z = X \ { Xi, Xj }, do
5. 7.
k
a set of potential neighbors PN(Xj) of each variable Xj and potential neighbors matrix PNM
then PN(Xj)= PN(Xj) Xi , PNM(i, j)=1,
end for end for return PN and PNM
Fig. 1. Outline of the Restrict step
and θˆimle = argminθ N LL(i, Pa(Xi ), θ) is the maximum likelihood estimate of i’s parameters. The term |θˆi | is the number of free parameters in the CPD (conditional probability distributions) for node i. For linear regression, |θˆi | = |Pa(Xi )|, the number of parents. Search step performs a greedy hill-climbing search to obtain the final DAG. We follow the L1MB implementation (also to allow for a fair comparison). The search begins with an empty graph. The basic heuristic search procedure we use is a greedy hill-climbing that considers local moves in the form of edge addition, edge deletion, and edge reversal. At each iteration, the procedure examines the change in the score for each possible move, and applies the one that leads to the biggest decrease in MDL score. These iterations are repeated until convergence. The important difference from standard greedy search is that the search is constrained to only consider adding an edge if it was discovered by PCB in the Restrict step. The operator remove-edge just can be used to remove the edge added in the graph actually. Maybe the orientation of some edge is not right, if reversing the edge can lead to decrease in MDL score, the operator reverse-edge should be used to reverse the edge. We terminate the procedure after some fixed number of changes failed to result in an improvement over the best score so far. After termination, the procedure returns the best scoring structure it encountered. 2.3
Time Complexity of PCB Algorithm
A dataset with n variables and m cases is considered. For comparison with L1MB, we only consider time complexity of Restrict step. Time complexity of Restrict step is O(3mn2 + n3 ). Computations of PCB algorithm mainly exist in calculating the correlation coefficient matrix R and calculating the inverse matrix of R. Matrix(n ∗ n) multiplication algorithm needs n2 vector inner products and computational complexity of vector(n) inner product is O(n), so computational complexity of matrix multiplication algorithm is O(n3 ) at most. From Definition 2.5, we know
A PCB Bayesian Network Structure Learning Algorithm under SEM
69
that calculating the correlation coefficient of two variables needs 3 vector inner products, and the correlation coefficient matrix has n2 elements, calculating the correlation coefficient matrix requires 3n2 inner products, for m cases, computational complexity of vector(m) inner product is O(m), thus computational complexity of calculating the correlation coefficient matrix R is O(3mn2 ). We know that the computation of the inverse matrix and matrix multiplication are equal, so computational complexity of calculating the inverse matrix of R(n ∗ n) is at most O(n3 ). We can get the conclusion that the total time complexity of Restrict step is O(3mn2 + n3 ).
3
Experimental Results
3.1
Networks, Datasets and Measures of Performance
The experiments were conducted on a computer with Windows XP, Inter(R) 2.6GHz CPU and 2GB memory. All together 8 networks are selected from Bayes net repository (BNR)1 except factors network. Factors network is synthetic. The networks, including their number of nodes and edges are shown as follows: 1.alarm (37/46), 2.barley (48/84), 3.carpo (61/74), 4.factors(27/68), 5.hailfinder(56/66), 6.insurance (27/52), 7.mildew (35/46), 8.water (32/66). Datasets used in our experiment are generated by SEM. We adopt the following two kinds of SEMs. (1) xi = WTXi Pa(Xi ) + N (0, 1)
(2) xi = WTXi Pa(Xi ) + rand(0, 1)
The weights are generated randomly, generally, randomly distributed uniformly [9] or distributed normally[7]. We sampled the weights from ±1 + N (0, 1)/4. Datasets sampled from SEM (1) belong to multivariate Gaussian distribution and are continuous data. Datasets sampled from SEM (2) don’t belong to multivariate Gaussian distribution and are also continuous data. We employ two metrics to compare the algorithms: run time and structural errors. Structural errors include all of the errors including missing edges, extra edges, missing orientation and wrong orientation. The number of structural errors means the number of incorrect edges in the estimated model compared to the true model[7]. 3.2
Experimental Results and Analyses
We firstly evaluate the performance of PCB algorithm under the above two cases, with different sample sizes, thresholds and networks. Fig.2 shows the results under SEM (2). X axis denotes networks. Y axis denotes the number of structural errors. The results of SEM (1) are omitted because of space. From Fig.2, we can see that threshold has a great effect on the performance of PCB algorithm. The results of different SEMs are similar. When the dataset size is 1
http://www.cs.huji.ac.il/labs/compbio/Repository
70
J. Yang and L. Li
structural errors with different sample sizes, thresholds and networks of PCB algorithm under SEM(2)
Fig. 2. Structural errors of PCB algorithm. X axis denotes networks: 1.alarm 2.barley 3.carpo 4.factors 5.hailfinder 6.insurance 7.mildew 8.water. Y axis denotes the number of structural errors. With different sample sizes(1000, 5000, 10000, 20000, 100000), thresholds(0, 0.1, 0.3, m (m is the mean) ) and networks(the above 8 networks), PCB algorithm has been tested.
small (1000, 5000), PCB (0.1) has the fewest structural errors on average, with the dataset size gets larger, PCB(m) and PCB(0.1) have similar performances. So when the threshold is 0.1, PCB algorithm achieves the best performance and has the fewest structural errors on average almost on all the networks. Zero partial correlation is not the best choice for CI test. For zero partial correlation means independent, however, relevance has different extent, such as strong relevance and weak relevance. The threshold is hard to select, maybe it depends on the adopted networks. We have done a series of extensive experiments, and found the best threshold on average. The second experiment is to compare existing structure learning methods with PCB algorithm. We adopt DAG, SC (5), SC (10), L1MB, PC (0.05), TPDA (0.05), PCB (0.1) etc. PCB (0.1) means running DAG-Search after PCB pruning, and DAG means running DAG-Search without pruning, SC (5) and SC (10) means running DAG-Search after SC pruning (where we set the fan in bound to 5 and 10), L1MB means running DAG-Search after L1MB pruning. For DAG, SC (5), SC (10), L1MB, we use Murphy’s DAGsearch implementation of DAGLearn software2. For PC (0.05), TPDA (0.05), we used ”causal explorer”3. Fig.3 shows the structural errors and time performance on the above networks under SEM (2) by the seven algorithms. The results of SEM (1) are omitted because of space. We give analyses in details as follows. (1) PCB (0.1) algorithm vs DAG algorithm. DAG has worse performance on all the networks. PCB algorithm achieves higher accuracy on all the networks under all 2 3
http://people.cs.ubc.ca/~ murphyk/ http://www.dsl-lab.org/
A PCB Bayesian Network Structure Learning Algorithm under SEM
71
(a) structural errors with different sample sizes and networks of seven algorithms under SEM(2)
(b) run time with different sample sizes and networks of seven algorithms under SEM(2)
Fig. 3. structural errors and run times under SEM(2). Under SEM(2), with different sample sizes(1000,5000,10000,20000,100000) and networks(1.alarm 2.barley 3.carpo 4.factors 5.hailfinder 6.insurance 7.mildew 8.water), the 7 algorthms(DAG, SC(5), SC(10), L1MB, PCB(0.1), PC(0.05), TPDA(0.05) ) have been tested. (a) are results of structural errors. (b) are results of run time.
72
J. Yang and L. Li
the SEMs. For time performance, PCB (0.1) wins 5, ties 2, and loses 1 under SEM (1), wins 5, ties 3, and loses 0 under SEM (2). Under the two SEMs, the results are similar. For DAG, potential neighbors of each variable are all the other variables. In the Search step, because we set the maximum number of iteration to 2500, maybe 2500 is too small, the Search step may terminate before finding the best DAG, so the structural errors are more. Without the pruning step, time performance of DAG algorithm is worse than PCB (0.1), the reason is as follows: Search step examines the change in the score for each possible move. For without pruning, the number of potential neighbors for each variable is large, the number within consideration is also large, so the cost of search step is higher. (2) PCB (0.1) algorithm vs SC(5) and SC(10). PCB algorithm achieves both better time performance and higher accuracy almost on all the networks under all the SEMs. SC algorithm needs specify the maximum fan in advance, however some nodes in the true structure may have much higher connectivity than others, so a common parameter for all nodes is not reasonable. In addition, based on correlation coefficient SC algorithm of DAGLearn software selects the top k (maximum fan) candidate neighbors and doesn’t consider symmetry of correlation coefficient, and this will lead to redundant information of potential neighbors, and will sacrifice either efficiency or performance. PCB algorithm has not the above problems. From the section 2.3 we know that computational complexity of calculating the correlation coefficient matrix is O(3mn2 ); in order to select the top k candidate neighbors, we must sort each row of correlation coefficient matrix, computational complexity is n3 , so the total complexity is O(3mn2 + n3 ), which is equal to PCB restrict step (O(3mn2 + n3 )). However, total time performance of SC algorithm is worse than that of PCB (0.1). Due to the unreasonable selection of potential neighbors and redundant information of potential neighbors, the cost of the search step will be increased. So SC algorithm has worse time performance and accuracy. (3) PCB (0.1) algorithm vs L1MB. PCB algorithm achieves both better time performance and higher accuracy on all the networks under all the SEMs. L1MB algorithm adopts LARS algorithm to select potential neighbors. For a variable, L1MB selects the set of variables that have the best predictive accuracy as a whole, and L1MB evaluates the effects of a set of variables, not a single variable. Using the method to select potential neighbors has some shortcomings. The method can describe the correlation between a set of variables and a variable, not the correlation between two variables. There maybe exist some variables, which do not belong to the selected set of potential neighbors, but have strong relevance with the target variable. However, Partial correlation method can reveal the true correlation between two variables by eliminating the influences of other correlative variables. PCB algorithm is based on partial correlation to select potential neighbors and evaluates the effect of a single variable. So PCB algorithm is more reasonable, and experimental results also indicate that PCB algorithm has fewer structural errors. PCB (0.1) has better time performance than L1MB. From section 3.4, we know that time complexity of PCB is O(3mn2 + n3 ) (n is the number of variables and
A PCB Bayesian Network Structure Learning Algorithm under SEM
73
m is the number of cases ). For L1MB, time complexity of computing the L1regularization path is O(mn2 ) in the Gaussian case (SEM (1) and SEM (2))[7]. In addition, L1MB also includes computing the Maximum Likelihood parameters for all non-zero sets of variables encountered along this path and selecting the set of variables that achieved the highest MDL score. So L1MB has worse time performance than PCB(0.1) under all the SEMs. (4) PCB (0.1) algorithm vs PC(0.05) algorithm. PCB (0.1) algorithm achieves both better time performance and higher accuracy on all the networks under all the SEMs. PC (0.05) algorithm has been designed for discrete variables, or imposes restrictions on which variables may be continuous. PC first identifies the skeleton of a Bayesian network and then orients the edges. However, PC algorithm may fail to orient some edges, and in our experiments, we take the edges as wrong. So PC algorithm has more structural errors. PC algorithm needs O(nk+2 ) CI tests, k is the maximum degree of any node in the true structure[13]. Time complexity of CI test at least is O(m). So time complexity of PC algorithm is O(mnk+2 ). PC algorithm has an exponential time complexity in the worst case. Time complexity of PCB (0.1) is O(3mn2 + n3 ). Obviously, PC algorithm has worse time performance. (5) PCB(0.1) algorithm vs TPDA(0.05) algorithm. PCB algorithm achieves both better time performance and higher accuracy on all the networks. TPDA has been designed for discrete variables, or imposes restrictions on which variables may be continuous. So TPDA algorithm has more structural errors. TPDA requires at most O(n4 ) CI tests to discover the edges. In some special case, TPDA requires only O(n2 ) CI tests[1]. Time complexity of CI test at least is O(m). So time complexity of TPDA is O(mn4 ) or O(mn2 ). Compared with PC algorithm, TPDA algorithm has better time performance, however compared with PCB algorithm O(mn2 ), time complexity of TPDA algorithm is still high. So PCB (0.1) has better time performance than TPDA (0.05).
4
Conclusions and Future Work
The contributions of this paper are two-fold. (1)We prove that partial correlation can be used as CI test under SEM, which includes multivariate Gaussian distribution as a special case. We redefine the strong relevance and weak relevance. Based on a series of experiments, we find the best relevance threshold. (2) We propose PCB algorithm, and theoretical analysis and empirical results show that PCB algorithm performs better than the other existing algorithms on both accuracy and run time. We are seeking a way of automatically determining the best threshold. We will also extend our algorithm to higher dimension and larger datasets.
Acknowledgement The research has been supported by 973 Program of China under award 2009CB 326203, the National Natural Science Foundation of China under award 61073193
74
J. Yang and L. Li
and 61070131. The authors are very grateful to the anonymous reviewers for their constructive comments and suggestions that have led to an improved version of this paper.
References 1. Cheng, J., Greiner, R., Kelly, J., Bell, D.A., Liu, W.: Learning Bayesian networks from data: An information-theory based approach. Doctoral Dissertation. Department of Computing Science, University of Alberta and Faculty of Informatics, University of Ulster, November 1 (2001) 2. Chickering, D.: Learning Bayesian networks is NP-Complete. In: AI/Stats V (1996) 3. Cooper, G.F., Herskovits, E.: A Bayesian method for the induction of probabilistic networks from data. Machine Learning 9(4), 309–347 (1992) 4. Friedman, N., Nachman, I., Peer, D.: Learning Bayesian network structure from massive datasets: The ”sparse candidate” algorithm. In: UAI (1999) 5. Kleijnena, J.P.C., Heltonb, J.C.: Statistical analyses of scatterplots to identify important factors in largescale simulations, 1: Review and comparison of techniques. Reliability Engineering and System Safety 65, 147–185 (1999) 6. Lam, W., Bacchus, F.: Learning Bayesian belief networks: An approach based on the MDL principle. Comp. Int. 10, 269–293 (1994) 7. Schmidt, M., Niculescu-Mizil, A., Murphy, K.: Learning Graphical Model Structure Using L1-Regularization Paths. In: Proceedings of Association for the Advancement of Artificial Intelligence (AAAI), pp. 1278–1283 (2007) 8. Ogawa, T., Shimada, M., Ishida, H.: Relation of stiffness parameter b to carotid arteriosclerosis and silent cerebral infarction in patients on chronic hemodialysis. Int. Urol. Nephrol. 41, 739–745 (2009) 9. Pellet, J.P., Elisseeff, A.: Partial Correlation and Regression-Based Approaches to Causal Structure Learning, IBM Research Technical Report (2007) 10. Pellet, J.P., Elisseeff, A.: Using Markov Blankets for Causal Structure Learning. Journal of Machine Learning Research 9, 1295–1342 (2008) 11. Rissanen, J.: Stochastic complexity. Journal of the Royal Statistical Society, Series B 49, 223–239 (1987) 12. Scheines, R., Spirtes, P., Glymour, C., Meek, C., Richardson, T.: The tetrad project: Constraint based aids to causal model specification. Technical report, Carnegie Mellon University, Dpt. of Philosophy (1995) 13. Spirtes, P., Glymour, C., Scheines, R.: Causation, prediction, and search, 2nd edn. The MIT Press, Cambridge (2000) 14. Sun, Y., Negishi, M.: Measuring the relationships among university, industry and other sectors in Japan’s national innovation system: a comparison of new approaches with mutual information indicators. Scientometrics 82, 677–685 (2010) 15. Tsamardinos, I., Brown, L., Aliferis, C.: The max-min hill-climbing bayesian network structure learning algorithm. Machine Learning 65, 31–78 (2006) 16. Xu, G.R., Wan, W.X., Ning, B.Q.: Applying partial correlation method to analyzing the correlation between ionospheric NmF2 and height of isobaric level in the lower atmosphere. Chinese Science Bulletin 52(17), 2413–2419 (2007)
Predicting Friendship Links in Social Networks Using a Topic Modeling Approach Rohit Parimi and Doina Caragea Computing and Information Sciences, Kansas State University, Manhattan, KS, USA 66506 {rohitp,dcaragea}@ksu.edu
Abstract. In the recent years, the number of social network users has increased dramatically. The resulting amount of data associated with users of social networks has created great opportunities for data mining problems. One data mining problem of interest for social networks is the friendship link prediction problem. Intuitively, a friendship link between two users can be predicted based on their common friends and interests. However, using user interests directly can be challenging, given the large number of possible interests. In the past, approaches that make use of an explicit user interest ontology have been proposed to tackle this problem, but the construction of the ontology proved to be computationally expensive and the resulting ontology was not very useful. As an alternative, we propose a topic modeling approach to the problem of predicting new friendships based on interests and existing friendships. Specifically, we use Latent Dirichlet Allocation (LDA) to model user interests and, thus, we create an implicit interest ontology. We construct features for the link prediction problem based on the resulting topic distributions. Experimental results on several LiveJournal data sets of varying sizes show the usefulness of the LDA features for predicting friendships. Keywords: Link Mining, Topic Modeling, Social Networks, Learning.
1
Introduction
Social network such as MySpace, Facebook, Orkut, LiveJournal and Bebo have attracted millions of users [1], some of these networks growing at a rate of more than 50 percent during the past year [2]. Recent statistics have suggested that social networks have overtaken search engines in terms of usage [3]. This shows how Internet users have integrated social networks into their daily practices. Many social networks, including LiveJournal online services [4] are focused on user interactions. Users in LiveJournal can tag other users as their friends. In addition to tagging friends, users can also specify their demographics and interests in this social network. We can see LiveJournal as a graph structure with users (along with their specific information, e.g. user interests) corresponding to nodes in the graph and edges corresponding to friendship links between the users. In general, the graph corresponding to a social network is undirected. However, J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 75–86, 2011. c Springer-Verlag Berlin Heidelberg 2011
76
R. Parimi and D. Caragea
in LiveJournal, the edges are directed i.e., if a user ‘A’ specifies another user ‘B’ as its friend, then it is not necessary for user ‘A’ to be the friend of user ‘B’. One desirable feature of an online social network is to be able to suggest potential friends to its users [8]. This task is known as the link prediction problem, where the goal is to predict the existence of a friendship link from user ‘A’ to user ‘B’. The large amounts of social network data accumulated in the recent years have made the link prediction problem possible, although very challenging. In this work, we aim at using the ability of machine learning algorithms to take advantage of the content (data from user profiles) and graph structure of social network sites, e.g., LiveJournal, to predict friendship links. User profiles in such social networks consist of data that can be processed into useful information. For example, interests specified by users of LiveJournal act as good indicators to whether two users can be friends or not. Thus, if two users ‘A’ and ‘B’ have similar interests, then there is a good chance that they can be friends. However, the number of interests specified by users can be very large and similar interests need to be grouped semantically. To achieve this, we use a topic modeling approach. Topic models provide an easy and efficient way of capturing semantics of user interests by grouping them into categories, also known as topics, and thus reducing the dimensionality of the problem. In addition to using user interests, we also take advantage of the graph structure of the LiveJournal network and extract graph information (e.g., mutual friends of two users) that is helpful for predicting friendship links [9]. The contributions of this paper are as follows: (i) an approach for applying topic modeling techniques, specifically LDA, on user profile data in a social network; and (ii) experimental results on LiveJournal datasets showing that a) the best performance results are obtained when information from interest topic modeling is combined with information from the network graph of the social network b) the performance of the proposed approach improves as the number of users in the social network increases. The rest of the paper is organized as follows: We discuss related work in Section 2. In Section 3, we review topic modeling techniques and Latent Dirichlet Allocation (LDA). We provide a detailed description of our system’s architecture in Section 4 and present the experimental design and results in Section 5. We conclude the paper with a summary and discussion in Section 6.
2
Related Work
Over the past decade, social network sites have attracted many researches as sources of interesting data mining problems. Among such problems, the link prediction problem has received a lot of attention in the social network domain and also in other graph structured domains. Hsu et al. [9] have considered the problems of predicting, classifying, and annotating friendship relations in a social network, based on the network structure and user profile data. Their experimental results suggest that features constructed from the network graph and user profiles of LiveJournal can be effectively used for predicting friendships. However, the interest features proposed in [9] (specifically, counts of individual interests and the common interests
Friendship Link Prediction in Social Networks
77
of two users) do not capture the semantics of the interests. As opposed to that, in this work, we create an implicit interest ontology to identify the similarity between interests specified by users and use this information to predict unknown links. A framework for modeling link distributions, taking into account object features and link features is also proposed in [5]. Link distributions describe the neighborhood of links around an object and can capture correlations among links. In this context, the authors have proposed an Iterative Classification Algorithm (ICA) for link-based classification. This algorithm uses logistic regression models over both links and content to capture the joint distributions of the links. The authors have applied this approach on web and citation collections and reported that using link distribution improved accuracy in both cases. Taskar et al. [8] have studied the use of a relational Markov network (RMN) framework for the task of link prediction. The RMN framework is used to define a joint probabilistic model over the entire link graph, which includes the attributes of the entities in the network as well as the links. This method is applied to two relational datasets, one involving university web pages, and the other a social network. The authors have reported that the RMN approach significantly improves the accuracy of the classification task as compared to a flat model. Castillo et al. [7] have also shown the importance of combining features computed using the content of web documents and features extracted from the corresponding hyperlink graph, for web spam detection. In their approach, several link-based features (such as degree related measures) and various ranking schemes are used together with content-based features such as corpus precision and recall, query precision, etc. Experimental results on large public datasets of web pages have shown that the system was accurate in detecting spam pages. Caragea et al. [10], [11] have studied the usefulness of a user interest ontology for predicting friendships, under the assumption that ontologies can provide a crisp semantic organization of the user information available in social networks. The authors have proposed several approaches to construct interest ontologies over interests of LiveJournal users. They have reported that organizing user interests in a hierarchy is indeed helpful for predicting links, but computationally expensive in terms of both time and memory. Furthermore, the resulting ontologies are large, making it difficult to use concepts directly to construct features. With the growth of data on the web, as new articles, web documents, social networking sites and users are added daily, there is an increased need to accurately process this data for extracting hidden patterns. Topic modeling techniques are generative probabilistic models that have been successfully used to identify inherent topics in collections of data. They have shown good performance when used to predict word associations, or the effects of semantic associations on a variety of language-processing tasks [12], [13]. Latent Dirichlet Allocation (LDA) [15] is one such generative probabilistic model used over discrete data such as text corpora. LDA has been applied to many tasks such as word sense disambiguation [16], named entity recognition [17], tag recommendation [18], community recommendation [19], etc. In this work, we apply LDA
78
R. Parimi and D. Caragea
on user profile data with the goal of producing a reduced set of features that capture user interests and improve the accuracy of the link prediction task in social networks. To the best of our knowledge, LDA had not been used for this problem before.
3
Topic Modeling and Latent Dirichlet Allocation (LDA)
Topic models [12], [13] provide a simple way to analyze and organize large volumes of unlabeled text. They express semantic properties of words and documents in terms of probabilistic topics, which can be seen as latent structures that capture semantic associations among words/documents in a corpus. Topic models treat each document in a corpus as a distribution over topics and each topic as a distribution over words. A topic model, in general, is a generative model, i.e. it specifies a probabilistic way in which documents can be generated. One such generative model is Latent Dirichlet Allocation, introduced by Blei et al. [15]. LDA models a collection of discrete data such as text corpora. Figure 1 (adapted from [15]) illustrates a simplified graphical model representing LDA. We assume that the corpus consists of M documents denoted by D = {d1 , d2 · · · dM }. Each document di in the corpus is defined as a sequence of Ni words denoted by di = (wi1 , wi2 · · · wiNi ), where each word wij belongs to a vocabulary V . A word in a document di is generated by first choosing a topic zij according to a multinomial distribution and then choosing a word wij according to another multinomial distribution, conditioned on the topic zij . Formally, the generative process of the LDA model can be described as follows [15]: 1. Choose the topic distribution θi ∼ Dirichlet(α). 2. For each of the Ni words wij : (a) Choose a topic zij ∼ M ultinomial(θi). (b) Choose a word wij from p(wij |zij , β) (multinomial conditioned on zij ). From Figure 1, we can see that the LDA model has a three level representation. The parameters α and β are corpus level parameters, in the sense that they are assumed to be sampled once in the process of generating a corpus. The variables θi are document-level variables sampled once per document and the
Fig. 1. Graphical representation of the LDA model
Friendship Link Prediction in Social Networks
79
variables zij and wij are at the word level. These variables will be sampled once for each word in each document. For the work in this paper, we have used the LDA implementation available in MALLET, A Machine Learning for Language Toolkit [20]. MALLET uses Gibbs sampling for parameter estimation.
4
System Architecture
As can be seen in Figure 2, the architecture of the system that we have designed is divided into two modules. The first module of the system is focused on identifying and extracting features from the interests expressed by each user of the LiveJournal. These features are referred to as interest based features. The second module uses the graph network (formed as a result of users tagging other users in the network as ‘friends’) to calculate certain features which have been shown to be helpful at the task of predicting friendship links in LiveJournal [9]. We call these features, graph based features. We use both types of features as input to learning algorithms (as shown in Section 5). Sections 4.1 and 4.2 describe in detail the construction of interest based and graph based features, respectively.
Fig. 2. Architecture of the system used for link prediction
4.1
Interest Based Features
Each user in a social network has a profile that contains information characteristic to himself or herself. Users most often tend to describe themselves, their likes, dislikes, interests/hobbies in their profiles. For example, users of LiveJournal can specify their demographics and interests along with tagging other users of the social network as friends. Data from the user profiles can be processed into
80
R. Parimi and D. Caragea
useful information for predicting/recommending potential friends to the users. In this work, we use a topic modeling technique to capture semantic information associated with the user profiles, in particular, with interests of LiveJournal users. Interests of the users act as good indicators to whether they can be friends or not. The intuition behind interest based features is that two users ‘A’ and ‘B’ might be friends if ‘A’ and ‘B’ have some similar interests. We try to capture this intuition through the feature set that we construct using the user interests. Our goal is to organize interests into “topics”. To do that, we model user interests in LiveJournal using LDA by treating LiveJournal as a document corpus, with each user in the social network representing a “document”. Thus, interests specified by each user form the content of the “user document”. We then run the MALLET implementation of LDA on the collection of such user documents. LDA allows us to input the number of inherent topics to be identified in the collection used. In this work, we vary the number of topics from 20 to 200. In general, the smaller the number of topics, the more abstract will be the inherent topics identified. Similarly, the larger the number of topics, the more specific the topics identified will be. Thus, by varying the number of topics, we are implicitly simulating a hierarchical ontology: a particular number of topics can be seen as a cut through the ontology. The topic probabilities obtained as a result of modeling user interests with LDA provide an explicit representation of each user and are used to construct the interest based features for the friendship prediction task, as described in what follows: suppose that A [1 · · · n] represents the topic distribution for user ‘A’ and B [1 · · · n] represents the topic distribution for user ‘B’ at a particular topic level n. The feature vector, F (A, B) for the user pair (A, B) is constructed as: F (A, B) = (|A [1]−B [1] |, |A [2]−B [2] |, · · · , |A [n]−B [n] |). This feature vector is meant to capture the intuition that the smaller the difference between the topic distributions, the more semantically related the interests are. 4.2
Graph Based Features
Previous work by Hsu et al. [9] and Caragea et al. [10], [11], among others, have shown that the graph structure of the LiveJournal social network acts as a good source of information for predicting friendship links. In this work, we follow the method described in [9] to construct graph-based features. For each user pair (A, B) in the network graph, we calculate in-degree of ‘A’, in-degree of ‘B’, outdegree of ‘A’, out-degree of ‘B’, mutual friends of ‘A’ and ‘B’, backward deleted distance from ‘B’ to ‘A’ (see [9] for detailed descriptions of these features).
5
Experimental Design and Results
This section describes the dataset used in this work and the experiments designed to evaluate our approach of using LDA for the link prediction task. We have conducted various experiments with several classifiers to investigate their performance at predicting friendship links between the users of LiveJournal.
Friendship Link Prediction in Social Networks
5.1
81
Dataset Description and Preprocessing
We used three subsets of the LiveJournal dataset with 1000, 5000 and 10,000 users, respectively, to test the performance and scalability of our approach. As part of the preprocessing step, we clean the interest set to remove symbols, numbers, foreign language. Interests with frequency less than 5 in the dataset are also removed. Strings of words in a multi-word interest are concatenated into a single “word,” so that MALLET treats them as a single entity. For example, the interest ‘artificial neural networks’ is transformed into ’ArtificialNeuralNetworks’ after preprocessing. Users whose in-degree and out-degree is zero, as well as users who do not have any interests declared are removed from the dataset. We are left with 801, 4026 and 8107 users in the three datasets, respectively, and approximately 14,000, 32,000 and 39,700 interests for each dataset after preprocessing. Furthermore, there are around 4,400, 40,000, 49,700 declared friendship links in the three datasets. We generate topic distributions for the users in the dataset using LDA; hyper-parameters (α, β) are set to the default values. We make the assumption that the graph is complete, i.e. all declared friendship links are positive examples and all non declared friendships are negative examples [10], although this assumption does not hold in the real world. The user network graph is partitioned into two subsets with 2/3rd of the users in the first set and 1/3rd of the users in the second set (this process is repeated five times for crossvalidation purposes). We used the subset with 2/3rd of the users for training and the subset with 1/3rd of the users for test. We ensure that the training and the test datasets are independent by removing the links that go across the two datasets. We also balance the data in the training set, as the original distribution is highly skewed towards the negative class. 5.2
Experiments
The following experiments have been performed in this work. 1. Experiment 1: In the first experiment, we test the performance of several predictive models trained on interest features constructed from topic distributions. The number of topics to be modeled is varied from 20 to 200. The 1000 user dataset described above is used in this experiment. 2. Experiment 2: In the second experiment, we test several predictive models that are trained on graph features, for the 1000 user dataset. To be able to construct the graph features for test data, we assume that a certain percentage of links is known [8] (note that this is a realistic assumption, as it is expected that some friends are already known for each user). Specifically, we explore scenarios where 10%, 25% and 50% links are known, respectively. Thus, we construct features for the unknown links using the known links. 3. Experiment 3: In the third experiment, graph based features are used in combination with interest-based features to see if they can improve the performance of the models trained with graph features only on the 1000 user dataset. For the test set graph features constructed by assuming 10%, 25% and 50% known links, respectively, are combined with interest features.
82
R. Parimi and D. Caragea
We repeat the above mentioned experiments for the 5000 user dataset. The corresponding experiments are referred to as Experiment 4, Experiment 5 and Experiment 6, respectively. For the 10,000 user dataset, we build predictive models using just interest based features (construction of graph features for the 10,000 user dataset was computationally infeasible, given our resources). This experiment is referred to as Experiment 7. We use results from Experiments 1, 4 and 7 to study the performance and the scalability of the LDA approach to link prediction based on interests, as the number of users increases. For all the experiments, we used WEKA implementations of the Logistic Regression, Random Forest and Support Vector Machine (SVM) algorithms. 5.3
Results
Importance of the Interest Features for Predicting Friendship Links. As mentioned above, several experiments have been conducted to test the usefulness of the topic modeling approach on user interests for the link prediction problem in LiveJournal. As expected, interest features (i.e., topic distributions obtained by modeling user interests) combined with graph features produced the most accurate models for the prediction task. This can be seen from Tables 1 and 2. In both tables, we can see that interest+graph features with 50% known links outperform interest or graph features alone in terms of AUC values1 , for all three classifiers used. Interesting results can be seen in Table 2, where interest features alone are better than graph features alone when only 10% links are known, and sometimes better also than interest+graph features with 10% links known, thus, showing the importance of the user profile data, captured by LDA, for link prediction in social networks. Furthermore, a comparison between our results and the results presented in [21], which uses an ontology-based approach to construct interest features, shows that the LDA features are better than the ontology features on the 1,000 user dataset. As another drawback, the ontology based approach is not scalable (no more than 4,000 users could be used) [21]. Figure 3 depicts the AUC values obtained using interest, graph and interest+graph features with Logistic Regression and SVM classifiers across all numbers of topics modeled for the 1,000 and 5,000 user datasets, respectively. We can see that the AUC value obtained using interest+graph features is better than the corresponding value obtained using graph features alone across all numbers of topics, for all scenarios of known links, in the case of the 5000 user dataset. This shows that the contribution of interest features increases with the number of users. Also based on Figure 3, it is worth noting that the graphs do not show significant variation with the number of topics used. Performance of the Proposed Approach with the Number of Users. In addition to studying the importance of the LDA interest features for the link prediction task, we also study the performance and scalability of the approaches considered in this work (i.e., graph-based versus LDA interest based, and combinations) as the number of users increases. We are interested in both a) the 1
All AUC values reported are averaged over five different train and test datasets.
Friendship Link Prediction in Social Networks
83
Table 1. AUC values for Logistic Regression (LR), Random Forests (RF) and Support Vector Machines (SVM) classifiers with interest, graph and interest+graph based features for the 1,000 user dataset. k% links are known in the test set, where k is 10, 25 and 50, respectively. The known links are used to construct graph features. Exp# Features Logistic Regression 1 Interest 0.625 ± 0.03 2 (10%) Graph 10% 0.74 ± 0.08 3 (10%) Interest+Graph 10% 0.6226 ± 0.05 2 (25%) Graph 25% 0.7684 ± 0.07 3 (25%) Interest+Graph 25% 0.7406 ± 0.04 2 (50%) Graph 50% 0.8526 ± 0.03 3 (50%) Interest+Graph 50% 0.8648 ± 0.03
Random Forest 0.5782 ± 0.04 0.578 ± 0.04 0.6664 ± 0.04 0.7106 ± 0.05 0.8188 ± 0.03 0.8008 ± 0.03 0.877 ± 0.04
SVM 0.6198 ± 0.04 0.7738 ± 0.05 0.6606 ± 0.02 0.8104 ± 0.05 0.7983 ± 0.03 0.8692 ± 0.03 0.8918 ± 0.03
Table 2. AUC values similar to those in Table 1, for the 5,000 user dataset. Exp# Features Logistic Regression 4 Interest 0.6954 ± 0.01 5 (10%) Graph 10% 0.649 ± 0.03 6 (10%) Interest+Graph 10% 0.6718 ± 0.02 5 (25%) Graph 25% 0.7022 ± 0.05 6 (25%) Interest+Graph 25% 0.7384 ± 0.03 5 (50%) Graph 50% 0.8456 ± 0.02 6 (50%) Interest+Graph 50% 0.8696 ± 0.02
Random Forest 0.6276 ± 0.01 0.5936 ± 0.02 0.6566 ± 0.01 0.6716 ± 0.02 0.7846 ± 0.03 0.7086 ± 0.02 0.8908 ± 0.02
SVM 0.7008 ± 0.01 0.692 ± 0.02 0.6998 ± 0.01 0.7896 ± 0.03 0.7986 ± 0.03 0.883 ± 0.02 0.9046 ± 0.01
quality of the predictions that we get for the LiveJournal data as the number of users increases; and b) the time and memory requirements for each approach. From Figure 4, we can see that the prediction performance (expressed in terms of AUC values) is improved in the 5,000 user dataset as compared to the 1,000 user dataset, across all numbers of topics modeled. Similarly, the prediction performance for the 10,000 user dataset is better than the performance for the 5,000 user dataset, for all topics from 20 to 200. One reason for better predictions with more users in the dataset is that, when we add more users, we also add the interests specified by the newly added users to the interest set on which topics are modeled using LDA. Thus, we get better LDA probability estimates for the topics associated with each user in the dataset, as compared to the estimates that we had for a smaller set of data, and hence better prediction results. However, as expected, both the amount of time it takes to compute features for the larger dataset, as well as the memory required increase with the number of users in the data set. The amount of time it took to construct features for the 10,000 user dataset for all numbers of topics modeled in the experiments is around 14 hours on a system with Intel core 2 duo processor running at 3.16GHz and 20GB of RAM. This time requirement is due to our complete graph assumption (which results in feature construction for 10,000*10,000 user pairs in the case of a 10,000 user dataset) and can be relaxed if we relax the completeness assumption. Still the LDA feature construction is more efficient than the construction of graph features, which was not possible for the 10,000 user dataset used in our study.
84
R. Parimi and D. Caragea
Fig. 3. Graph of reported AUC values versus number of topics used for modeling, using Logistic Regression and SVM classifiers, for the 1,000 user dataset (top-left and topright, respectively) and 5,000 user dataset (bottom-left and bottom-right, respectively)
Fig. 4. AUC values versus number of topics for LR (left) and SVM (right) classifiers for the 1,000, 5,000 and 10,000 user datasets using interest-based features
6
Summary and Discussion
We have proposed an architecture, which takes advantage of both user profile data and network structure to predict friendship links in a social network. We have shown how one can model topics from user profiles in social networks using
Friendship Link Prediction in Social Networks
85
LDA. Experimental results suggest that the usefulness of the interest features constructed using the LDA approach increases with an increase in the number of users. Furthermore, the results suggest that the LDA based interest features can help improve the prediction performance when used in combination with graph features, in the case of the LiveJournal dataset. Although in some cases the improvement in performance due to interest features is not very significant compared with the performance when graph features alone are used, the fact that computation of graph features becomes intractable for 10,000 users or beyond emphasizes the importance of the LDA based approach. However, while the proposed approach is effective and shows improvement in performance as the number of users increases, it also suffers from some limitations. First, adding more users to the dataset increases the memory and time requirements. Thus, as part of the future work, we plan to take advantage of the MapReduce framework to support distributed computing for large datasets. Secondly, our approach takes into account, the static image of the LiveJournal social network. Obviously, this assumption does not hold in the real world. Based on user interactions in the social network, the graph might change rapidly due to the addition of more users as well as friendship links. Also, users may change their demographics and interests regularly. Our approach does not take into account such changes. Hence, the architecture of the proposed approach has to be changed to accommodate the dynamic nature of a social network. We also speculate that the approach of modeling user profile data using LDA will be effective for tasks such as citation recommendation in scientific document networks, identifying groups in online scientific communities based on their research/tasks and recommending partners in internet dating, ideas that are left as future work.
References 1. Boyd, M.D., Ellison, B.N.: Social Network Sites: Definition, History, and Scholarship. Journal of Computer-Mediated Communication 13 (2007) 2. comScore Press Release, http://www.comscore.com/Press Events/Press Releases/2007/07/Social Networking Goes Globa 3. TechCrunch Report, http://eu.techcrunch.com/2010/06/08/report-socialnetworks-overtake-search-engines-in-uk-should-google-be-worried 4. Fitzpatrick, B.: LiveJournal: Online Service, http://www.livejournal.com 5. Geetor, L., Lu, Q.: Link-based Classification. In: Twelth International Conference on Machine Learning (ICML 2003), Washington DC (2003) 6. Na, J.C., Thet, T.T.: Effectiveness of web search results for genre and sentiment classification. Journal of Information Science 35(6), 709–726 (2009) 7. Castillo, C., Donato, D., Gionis, A., Murdock, V., Silvestri, F.: Know your Neighbors: Web Spam Detection using the web Topology. In: Proceedings of SIGIR 2007, Amsterdam, Netherlands (2007) 8. Taskar, B., Wong, M., Abbeel, P., Koller, D.: Link Prediction in Relational Data. In: Proc. of 17th Neural Information Processing Systems, NIPS (2003) 9. Hsu, H.W., Weninger, T., Paradesi, R.S.M., Lancaster, J.: Structural link analysis from user profiles and friends networks: a feature construction approach. In: Proceedings of International Conference on Weblogs and Social Media (ICWSM), Boulder, CO, USA (2007)
86
R. Parimi and D. Caragea
10. Caragea, D., Bahirwani, V., Aljandal, W., Hsu, H.W.: Link Mining: OntologyBased Link Prediction in the LiveJournal Social Network. In: Proceedings of Association of the Advancement of Artificial Intelligence, pp. 192–196 (2009) 11. Haridas, M., Caragea, D.: Link Mining: Exploring Wikipedia and DMoz as Knowledge Bases for Engineering a User Interests Hierarchy for Social Network Applications. In: Proceedings of the Confederated International Conferences on On the Move to Meaningful Internet Systems: Part II, Portugal, pp. 1238–1245 (2009) 12. Steyvers, M., Griffiths, T.: Probabilistic Topic Models. In: Landauer, T., Mcnamara, D., Dennis, S., Kintsch, W. (eds.) Handbook of Latent Semantic Analysis. Lawrence Erlbaum Associates, Mahwah (2007) 13. Steyvers, M., Griffiths, T., Tenenbaum, J.B.: Topics in Semantic Representation. American Psychological Association 114(2), 211–244 (2007) 14. Steyvers, M., Griffiths, T.: Finding Scientific Topics. Proceedings of National Academy of Sciences, U.S.A, 5228–5235 (2004) 15. Blei, D., Ng, Y.A., Jordan, I.M.: Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993–1022 (2003) 16. Blei, D., Boyd-Graber, J., Zhu, X.: A Topic Model for Word Sense Disambiguation. In: Proc. of the 2007 Joint Conf. on Empirical Methods in Natural Language Processing and Comp. Natural Language Learning, pp. 1024–1033 (2007) 17. Guo, J., Xu, G., Cheng, X., Li, H.: Named Entity Recognition in Query. In: Proceedings of SIGIR 2009, Boston, USA (2009) 18. Krestel, R., Fankhauser, P., Nejdl, W.: Latent Dirichlet Allocation for Tag Recommendation. In: Proceedings of RecSys 2009, New York, USA (2009) 19. Chen, W., Chu, J., Luan, J., Bai, H., Wang, Y., Chang, Y.E.: Collaborative Filtering for Orkut Communities: Discovery of User Latent Behavior. In: Proceedings of International World Wide Web Conference (2009) 20. McCallam, K.A.: Mallet: A Machine Learning for Language Toolkit (2002), http://mallet.cs.umass.edu 21. Phanse, S.: Study on the Performance of Ontology Based Approaches to Link Prediction in Social Networks as the Number of Users Increases. M.S. Thesis (2010)
Info-Cluster Based Regional Influence Analysis in Social Networks Chao Li1,2,3 , Zhongying Zhao1,2,3 , Jun Luo1 , and Jianping Fan1 1
3
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China 2 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100080, China Graduate School of Chinese Academy of Sciences, Beijing 100080, China {chao.li1,zy.zhao,jun.luo,jp.fan}@siat.ac.cn
Abstract. Influence analysis and expert finding have received a great deal of attention in social networks. Most of existing works, however, aim to maximize influence based on communities structure in social networks. They ignored the location information, which often imply abundant information about individuals or communities. In this paper, we propose Info-Cluster, an innovative concept to describe how the information originated from a location cluster propagates in or between communities. According to this concept, we propose a framework for identifying the Info-Cluster in social networks, which uses both location information and communities structure. Taking the location information into consideration, we first adopt the K-Means algorithm to find location clusters. Next, we identify the communities for the whole network data set. Given the location clusters and communities, we present the information propagation based Info-Cluster detection algorithm. Experiments on Renren networks show that our method can reveal many meaningful results about regional influence analysis.
1
Introduction
Web-based social networks have attracted more and more research efforts in recent years. In particular, community detection is one of the major directions in social network analysis where a community can be simply defined as a group of objects sharing some common properties. Nowadays, with the rapid development of positioning techniques (eg., GPS), one can easily collect and share his/her positions. Furthermore, with a large amount of shared positions or trajectories, individuals expect to form their social network based on positions. On the other hand, a social network, the graph of relationships and interactions within a group of individuals, plays a fundamental role as a medium for disseminating information, ideas, and influence among its members. Most people consider the problem of how to maximize influence propagation in social
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 87–98, 2011. c Springer-Verlag Berlin Heidelberg 2011
88
C. Li et al.
networks, by targeting certain influential individuals that have the potential to influence many others. This problem has attracted some recent attention due to potential applications in viral marketing, which is based on the idea of leveraging existing social structures for world-of-mouth advertising of products [4,9]. However, here we consider a related problem of maximizing influence propagation in networks. We propose Info-Cluster, an innovative concept to describe how effective the information originated from a location cluster propagate in or between communities. According to this concept, we propose a framework for identifying the Info-Cluster in social networks, which uses both location information and communities structure. Taking the location information into consideration, we first adopt the K-Means algorithm [10] to find location clusters. Next, we identify the communities for the whole network data set. Given the location clusters and communities[3], we present the information propagation based Info-Cluster detection algorithm (IPBICD). The paper is organized as follows. we give the related work in section 2. In section 3, we first present the data model with locations into consideration. Then we formulate the Info-Cluster detection problem and propose our framework. Section 4 details the main algorithms. Experiments on Renren data set are shown in section 5. Finally, we conclude the whole paper in section 6.
2
Related Work
The success of large-scale online social network sites, such as Facebook and Twitter, has attracted a large number of researchers. many of them are focusing on modeling the information diffusion patterns within social networks. Domingos and Richardson [4] are the first ones to study the information influence in social networks. They used probabilistic theory to maximize the influence in social network. However, Kempe, Kleinberg and Tardos [8] are the first group to formulate the problem as a discrete optimization problem. They proposed namely the independent cascade model, the weight cascade model, and the linear threshold model. Chen et al. [2] collected the blog dataset to identify five features (namely the number of friends, popularity of participants, number of participants, time elapsed since the genesis of the cascade, and citing factor of the blog) that may play important role in predicting blog cascade affinity, so as to identify most easily influenced bloggers. However, since the influence cascade models are different, they do not directly address the efficiency issue of the greedy algorithms for the cascade models studied in [1]. With the growth of the web and social network, communities mining (community detection) are of great importance recently. In social network graph, the community is with high concentrations of edges within special groups of vertices, and low concentrations between these groups [6]. Therefore, Galstyan and Musoyan [5] show that simple strategies that work well for homogenous networks can be overly sub-optimal, and suggest simple modification for improving the performance, by taking into account the community structure. Spatial clustering is the process of grouping a set of objects into classes or clusters so that objects within a cluster are close to each other, but are far away
Info-Cluster Based Regional Influence Analysis in Social Networks
89
to objects in other clusters. As a branch of statistics, spatial clustering algorithms has been studied extensively for many years. According to Han et al. [7], those algorithms can be categorized into four categories: density-based algorithms, algorithms for hierarchical clustering, grid-based algorithms, partitioning algorithms. As we all know, individuals have spatial location in social networks. Therefore, we present an innovative concept to describe the information propagation in social networks, by taking into account the individual location and the community structure in social network.
3
Frameworks
A large volume of work has been done on community discovery as discussed above. Most of them, however, ignored the location information of individuals. The location information often plays a very important role in the community formulation and evolution. Therefore, it should be paid attention to. In this paper, we take the location of individuals into consideration to guide us detecting the Info-Clusters of communities. In this section, we first give our model for the social networks with location information. And then we present the problem formulation. The frameworks of our solutions are described in section 3.3. 3.1
Modeling the Social Network Data
Taking the location into consideration, we model the social network data as an undirected graph (see Fig. 1), denoted by G = (V, Ev , L, El ), where: V : is the set of individuals, and V = {v1 , v2 , ...vm }. We use circle to represent each individual. Ev : is the set of edges that represent the interactions between individuals. We use solid lines to represent edges. L: is the set of individuals’ locations or positions, and L = {l1 , l2 , ...ln }. We use square to represent each location. El : is the set of links which refer to the associations of the individuals and his/her locations. One location often corresponds to many individuals, which means these individuals belong to the same location. We use dotted lines to represent them. 3.2
Problem Formulation
In this part, we first give some definitions to formulate the problem. And then, we present four key steps for regional influence analysis. Definition 1. Location-Cluster-Set (Slc ): Slc = {LC1 , LC2 , ..., LCk } where LCi is a cluster of locations resulted from a spatial clustering algorithm, and LCi ∩ LCj = ∅ (i, j = 1, 2, ...k, i = j). Definition 2. Communities-Set (Scom ): Scom = {Com1 , Com2 , ..., Comp }, where Comi is a community identified by community detection method, and Comi ∩ Comj = ∅ (i, j = 1, 2, ...k, i = j).
90
C. Li et al.
D
B
A
10
9
3 5 4
1
11
2 8
12
7 6
C
E
Fig. 1. The model of social networks with location information. In this example, there are 12 individuals that belong to 5 locations. The individuals are connected with each other through 17 edges. 1 2 Definition 3. Capital of Info-Cluster (Scapital ): Scapital = {Scapital , Scapital k i , ..., Scapital }, where k denotes the number of location clusters, Scapital represents i the set of individuals whose locations belong to LCi : Scapital = {vj |L(vj ) ∈ LCi }, where L(vj ) denotes the location of vj . 1 Definition 4. Influence of Info-Cluster (Sinf luence ): Sinf luence = {Sinf luence , 2 k Sinf , ..., S }, where k denotes the number of location clusters, and luence inf luence i Sinf luence represents the set of individuals who learn the information from active individuals and are activated. Meanwhile, the information are created or obtained i by the individuals of Scapital who are active individuals initially. 1 2 k Definition 5. Know of Info-Cluster (Sknow ): Sknow = {Sknow ,Sknow , ..., Sknow }, i where k denotes the number of location clusters, and Sknow represents the set i of individuals who learn the information from active individuals from Sinf luence but are inactive. 1 2 k Definition 6. Info-Cluster (SInf o−Cluster ): SInf o−Cluster = {SIC ,SIC , ...,SIC }, i i where k is represent the number of location clusters, and SIC = Scapital ∪ i i Sinf luence ∪ Sknow
Definition 7. Covering Rate (CR) and Average Covering Rate(ACR): i=K i |V | and ACR = 1 CR(LCi ) = SIC CR(LCi ) K i=1
i i Definition 8. Influence Power (IP ): IP (LCi ) = |SInf luence |/|SCapital |.
According to the above definitions, we present the main works about the regional influence analysis as follows: – Location clustering: It aims to find the clusters (Slc ) of locations through some spatial clustering algorithm. – Community detection: Community detection in social network aims to find groups (Scom ) of vertices within which connections are dense, but between which connections are sparse.
Info-Cluster Based Regional Influence Analysis in Social Networks
91
– Info-Cluster identification: This process focuses on the identification of InfoCluster based on influence propagation in internal/external communities. With the location clusters (Slc ) we can get the corresponding Scapital , but how to use influence propagation to find Sinf luence and Sknow , and then discover the Info-Cluster(SInf o−Cluster ) is the third major work. – Regional influence analysis: Analyzing Covering Rate (CR), Average Covering Rate (ACR) and Influence Power (IP ) based on SInf o−Cluster for each LCi . 3.3
Framework of Our Solutions
In this part we present the framework of our method. The whole process for Info-Cluster detection includes four steps: 1. Data Preparation: We store the data to be processed in some kind of database, such as spatial location database, social network database. 2. Preprocessing: With a proper pre-processing approach, it is possible to improve the performance and speed. In our framework, we use two models (Data Conversion and Data Fusion) to process the location data and social network. The main function of the Data Conversion is to rewrite the spatial data. While the Data Fusion is used to merge the location data and social network data together. At last, a new data file resulted from preprocessing is sent to step 3. 3. Algorithm Design: It is the main part of the framework. And it consists of three key components: location clustering, community detection and InfoCluster detection. (Details in section 4) 4. Result Visualization: This step aims to view and analyze the results. It contains the visualization platform and the results analysis, which are detailed in section 5.
4
Algorithms
In this section, we describe two main algorithms which aim to solve clustering and Info-Cluster detection based on influence propagation. Firstly, the K-Means clustering algorithm is used to cluster locations, and community detection algorithm based on modularity maximization is used to cluster individuals. Secondly, influence propagation based Info-Cluster detection (IPBICD) algorithm is presented in section 4.2. 4.1
Clustering
K-Means[10] is one of the simplest unsupervised learning algorithms. It is often employed to solve the clustering problem. In this paper, we adopt K-Means method to cluster the locations of the social networks. What should be paid attention to is that, other clustering methods can also be used for our location clustering. We adopt K-Means method here, only to show the feasibility of our
92
C. Li et al.
algorithms. The K-Means algorithm requires us to specify the parameter K, which means the number of clusters. As to the location clustering, K often reflects the scale of the locations. The larger K means the finer scale. That is, if we set K = 10, each cluster for the location may accurate into province/city. With increasing K, each cluster may represent a street. Modularity [3], is used to evaluate the quality of a particular partition for dividing network into communities. It motivates various kinds of methods for detecting communities, which aims to maximize the modularity function. The modularity is defined as follow: Q=
1 kv kw (Avw − ) δ(cv , i)δ(cw , i) = (eii − a2i ) 2m vw 2m i i
(1)
where – – – – – – – – –
v, w are vertices within V ; i represent the ith community; cv is the community to which vertex v is assigned; Avw is an element of the adjacency matrix corresponding to the G = (V, Ev ); m = 12 Avw ; vw kv = Avu , where u is a vertex; u 1 eij = 2m Avw δ(cv , i)δ(cw , j); vw 1 ai = 2m kv δ(cv , i); v 1 x=y δ(x, y) = 0 otherwise
We start off with each vertex being a community which contains only one member. Then the process includes finding the changes of Q, choosing the largest of them, and performing the merge of communities. 4.2
Info-Cluster Detection
Before our Info-Cluster detection, we show the data after being processed by the K-Means and community detection. We use L(vi ) to represent the location of the vertex vi , and LC(vi ) to represent the location cluster of the vertex vi . LC(vi ) is assigned from the result of the K-Means algorithm. Similarly, we use Com(vi ) to denote the community of the vertex vi . Com(vi ) is assigned from the result of the community detection algorithm. Table 1 shows an example of location clusters and communities computed by K-Means and community detection methods. For the social network G = (V, Ev , L, El ), represented by an undirected graph, an idea or innovation can be spreaded from some source individuals to others. Here we consider each individual has two states: active and inactive. If the individual accept or adopt the idea/innovation, we say this individual is activated. Otherwise, it is inactive. According to Kempe [8], each individual’s tendency to become active increases monotonically as more of its neighbors become active.
Info-Cluster Based Regional Influence Analysis in Social Networks
93
Table 1. An example of location clusters and communities computed by K-Means and community detection methods
L(vi ) LC(vi ) Com(vi )
v1 l1 1 1
v2 l2 2 1
v3 l3 1 2
v4 l4 2 2
... ... ... ...
vn ln k m
During the activation process, a passive individual (inactive) will be activated depending on the comparison between the threshold and probability that depends on the state of its neighbor individuals. If the probability is larger than the threshold, the individual will be activated (in our paper,we put it into the set SInf luence ). If the probability is smaller than the threshold, the individuals will be not activated (In our paper, if the probability is larger than zero, we put it into Sknow . Otherwise, we send it into Snothing ). We firstly define the threshold θ, which is related to the α1 , α2 , β1 and β2 . The equation is as follows: θ = λ(
e(α1 +α2 ) lg(2+α1 +α2 )+(β1 +β2 ) ) 3 + e(α1 +α2 ) lg(2+α1 +α2 )+(β1 +β2 )
(2)
where p (1) α1 : is the influence probability from vi to vj , where vi ∈ Scapital and p Com(vi ) = Com(vj ); Scapital are the active individuals initially. p (2) α2 : is the influence probability from vi to vj , where vi ∈ / Scapital and Com(vi ) = Com(vj ); p (3) β1 : is the influence probability from vi to vj , where vi ∈ Scapital and Com(vi ) = Com(vj ); p (4) β2 : is the influence probability from vi to vj , where vi ∈ / Scapital and Com(vi ) = Com(vj ); (5) λ: is a regulation parameter. Generally λ = 1. If the network is superinteractive between the individuals, we can set λ < 1. Otherwise, we set λ > 1 . In this paper, we fix the λ = 1. Fig. 2 shows examples about the α1 , α2 , β1 and β2 . Table 2 shows the range of the α1 , α2 , β1 and β2 and gives three examples about the θ. In this paper, the activated individuals are known at first based on the active location cluster, which is different to other papers. Table 2. The range of the α1 , α2 , β1 and β2 and three examples
α1 α2 β1 β2 θ (0 < α1 < 1) (0 < α2 < α1 ) (0 < β1 < α1 ) (0 < β2 < β1 ) (0 < θ < 1) 0.9 0.8 0.7 0.6 0.7670 0.8 0.6 0.6 0.4 0.6560 0.3 0.2 0.2 0.1 0.3544
94
C. Li et al. Scapital
Scapital
A
Scapital
B
C
Active Node
D1
ity
E2
D2
D1
Inactive Node
Scapital
Boundary between capital nodes and noncapital nodes
nity mu com
D1
E
y
D1
Scapital
D
un mm co
D1
D1 D1
it un
Scapital
D1
E1
mm
D1
D1
D1
Co
D1
D1
D1
F
D1
Dividing line between different communities
Fig. 2. Illustration of α1 , α2 , β1 and β2
Secondly, we show the individuals activate probability that depends on the state of its neighborhood individuals. In Watt’s original model [11] this probability is defined by the number of active neighbors, the total number of the neighboring individuals and the activation threshold. However, in our framework we have four types of probability (α1 , α2 , β1 and β2 ) between the individuals. We define the function Y (z) to be z-th individual’s probability to be activated that depends on its neighbor individuals. The function Y (z) is defined as follows: Y (z) =
N (z)
(α1 + β1 ) +
Activenum (N (z)) ( α2 + β 2 ) N um(N (z))
(3)
N (z)
where: – N (z): The z-th individual’s neighborhood individuals set. – N um(N (z)): The number of the neighbors’ individuals of z-th individuals. – Activenum (N (z)): The number of active individuals in neighbors’ individuals of z-th. Finally, according to Y (z) and θ, we can easily generate Info-Cluster in our social network graph. Specifically, for each location cluster, we first set all individuals from one location cluster to be active, and add those individuals into group i Capital(Scapital ). And then, for each inactive individual calculate its Y (z) and i compare it with θ. If Y (z) > θ, then add it into Inf luence group Sinf luence . If i 0 < Y (z) ≤ θ, then add z-th node into Know group Sknow . However, if Y (z) = 0, i then add it into N othing group . At last, merge Capital group (Scapital ) indii i viduals, Inf luence group (Sinf luence ) individuals and Know group(Sknow ) individuals into one Info-Cluster and repeat the next location cluster. The process of the Info-Cluster detection is shown in Algorithm 1.
5
Experiments
In order to test the performance of our algorithm, we conduct an experiment on real social networks. We first obtain the information of 5000 individuals from the
Info-Cluster Based Regional Influence Analysis in Social Networks
95
Algorithm 1. Influence Propagation Base Info-Cluster Detection (IPBICD) Input : G = (V, Ev , L, El ); Output : Info-Cluster: SInf o−Cluster ; 1: Calculate θ according to equation 2 and Slc , Scom according to section 4.1; 2: for each C from Qlc do 3: Active vi if LC(vi ) == C, and make vi .label = Capital, vi .value = 1 4: Compute all the nodes values according to Equation 3 for graph G based on Breadth-First Search(BFS); 5: for each node v from G do 6: if (v.values > θ) then 7: v.label = Inf luence; 8: end if 9: if (0 < v.values < θ) then 10: v.label = Know; 11: end if 12: if (v.values == 0) then 13: v.label = N othing; 14: end if 15: end for i 16: Build Info-Cluster(SInf o−Cluster ) if v.label == Capital v.label == Inf luence v.label == Know; i 17: return Info-Cluster: SInf o−Cluster ; 18: end for
Renren friend network by crawling the Renren online web site (www.renren.com, which is similar as the web site of Facebook). After preprocessing, there are 2314 circle vertices, 1400 square vertices and 56456 edges in final Renren data sets. Each circle vertex denotes an individual registered in Renren web site, while the square vertex represents the location of the corresponding individual. And each edge between circle vertices denotes the friendship of two individuals. Then we do experiments on Renren data sets. In the experiment, we set three kinds of influence probabilities which are shown in Table 2. Fig. 3 illustrates the changing of Average Covering Rate (ACR) with different K. From this figure, we can see that the Average Covering Rate decreases with the increasing of K macroscopically. One main reason may lie in that the higher K value often leads to less people in each location cluster. That is, the information propagates from less sources. In order to study the relations between covering rate and the number of sources, we randomly select some K values and then analyze the experiments results microscopically. Suppose that all the locations are grouped into 50 clusi ters (K = 50). Then we can get 50 capitals denoted by Scapital , i = 1, 2, . . . , K as described in Definition 3. After the experiment, we get 50 Info-Clusters, each of which is composed of capital individuals, influence individuals and know individuals. Fig.4 shows the change of Covering Rate with the number of individuals
96
C. Li et al.
1 0.9
D1=0.9 D2=0.8 E1=0.7 E2=0.6 D =0.8 D =0.6 E =0.6 E =0.4 1
0.8
2
1
2
D =0.3 D =0.2 E =0.2 E =0.1 1
2
1
2
Average Cover Rate(ACR)
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
200
400
600 800 Number of cluster:K
1000
1200
1400
Fig. 3. The changing of Average Covering Rate(ACR) with K increasing. Here, K ∈ [1, 1400] since only 1400 locations are involved in the Renren data set.
of 50 capitals. According to the Fig. 4, we find that more people as sources may result in higher covering rate as a whole. But it is not absolutely true. With the third parameter settings, the covering rate of 30 individuals is higher than that of 100 individuals. For other two kinds of parameter settings, the covering rate of 50 individuals is nearly equal to that of 150 individuals. That implies the former individuals have stronger influential power than the latter ones. Even to the same number of individuals, the covering rates are often different. One such example is when the number of individuals is 20. From Fig. 4, we can see that there are two capitals composed of 20 individuals. And each cluster reaches different covering rates even with the same parameter settings. 1 0.9 0.8
Covering Rate(CR)
0.7 0.6 0.5 0.4 D =0.9 D =0.8 E =0.7 E =0.6
0.3
1
2
1
2
D1=0.8 D2=0.6 E1=0.6 E2=0.4 0.2
D1=0.3 D2=0.2 E1=0.2 E2=0.1
0.1 0 0
50
100 150 Number of individuals
200
250
Fig. 4. The change of Covering Rate with the number of individuals of 50 capitals
To better understand the scope of information propagation from different clusters, we depict the covering rate of each Info-Cluster on Chinese map, which is shown in Fig. 5. The sub-figures (Fig. 5(a),Fig. 5(b) and Fig. 5(c)) correspond to three kinds of parameter settings respectively. For each of them, we adopt five representative colors to differentiate different covering rates. The corresponding values are labeled at the right bottom of each sub-figure. For the Info-Cluster whose covering rate is between two of our selected values, we paint it with the transitional color. And the darkness of the transitional color is determined by
Info-Cluster Based Regional Influence Analysis in Social Networks
(a) (0.9, 0.8, 0.7, 0.6)
(b) (0.8, 0.6, 0.6, 0.4)
97
(c) (0.3, 0.2, 0.2, 0.1)
Fig. 5. The distribution of Covering Rate in Chinese map 70
D =0.9 D =0.8 E =0.7 E =0.6 1
60
2
1
2
D1=0.8 D2=0.6 E1=0.6 E2=0.4 D1=0.3 D2=0.2 E1=0.2 E2=0.1
Influential Power(IP)
50
40
30
20
10
0
0
5
10
15
20
25 clusterid
30
35
40
45
50
Fig. 6. The Influential Power of each capital set
value of the covering rate. According to the Fig. 5, we find that the information of eastern Info-Cluster spreads more widely than that of west. Particularly, the region of Beijing has the highest covering rate. This may attribute to the higher density of its population. The influence power for each capital set is shown in Fig. 6. From this figure, we find that the last cluster (clusterid=50) achieves the highest influential power at parameter settings (0.9, 0.8, 0.7, 0.6) and (0.8, 0.6, 0.6, 0.4). Therefore, the region which contains those individuals, is an influential region. On the contrary, the 15th cluster (clusterid=15) represents the lowest influential power. That means the region which contains those individuals, is weak influential region.
6
Conclusion
In this paper, we propose an innovative concept Info-Cluster. And then, based on the information propagation, we present the framework of identifying InfoCluster, which uses both community and location information. With the social network data set, we first adopt the K-Means algorithm to find location clusters. Next, we identify the communities for the whole network. Given the location clusters and communities, we present the information propagation based InfoCluster detection algorithm (IPBICD). Experiments on Renren data sets show that the Info-Clusters have many characteristics. The Info-Clusters identified have many potential applications, such as analyzing and predicting the influential range of the information or advertisement from a certain location.
98
C. Li et al.
References 1. Cha, M., Mislove, A., Gummadi, K.P.: A measurement-driven analysis of information propagation in the flickr social network. In: WWW 2009: Proceedings of the 18th International Conference on World Wide Web, pp. 721–730. ACM, New York (2009) 2. Chen, W., Wang, Y., Yang, S.: Efficient influence maximization in social networks. In: KDD 2009: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 199–208. ACM, New York (2009) 3. Clauset, A., Newman, M.E.J., Moore, C.: Finding community structure in very large networks. Physical Review E 70, 066111 (2004) 4. Domingos, P., Richardson, M.: Mining the network value of customers. In: KDD 2001: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 57–66. ACM, New York (2001) 5. Galstyan, A., Musoyan, V., Cohen, P.: Maximizing influence propagation in networks with community structure. Physical Review E 79(5), 56102 (2009) 6. Girvan, M., Newman, M.E.J.: Community structure in social and biological networks. Proceedings of the National Academy of Sciences of the United States of America 99, 7821 (2002) 7. Han, J., Kamber, M., Tung, A.K.H.: Spatial clustering methods in data mining: A survey. In: Geographic Data Mining and Knowledge Discovery, Research Monographs in GIS. Taylor & Francis, Abington (2001) 8. Kempe, D., Kleinberg, J., Tardos, E.: Maximizing the spread of influence through a social network. In: KDD 2003: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 137–146. ACM, New York (2003) 9. Leskovec, J., Adamic, L.A., Huberman, B.A.: The dynamics of viral marketing. ACM Trans. Web 1(1), 5 (2007) 10. MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297. University of California Press, Berkeley (1967) 11. Watts, D.: A simple model of global cascades on random networks. Proceedings of the National Academy of Sciences of the United States of America 99(9), 5766 (2002)
Utilizing Past Relations and User Similarities in a Social Matching System Richi Nayak Faculty of Science and Technology Queensland University of Technology GPO Box 2434, Brisbane Qld 4001, Australia
[email protected]
Abstract. Due to the higher expectation more and more online matching companies adopt recommender systems with content-based, collaborative filtering or hybrid techniques. However, these techniques focus on users explicit contact behaviors but ignore the implicit relationship among users in the network. This paper proposes a personalized social matching system for generating potential partners’ recommendations that not only exploits users’ explicit information but also utilizes implicit relationships among users. The proposed system is evaluated on the dataset collected from an online dating network. Empirical analysis shows the recommendation success rate has increased to 31% as compared to the baseline success rate of 19%.
1
Introduction
With the improved Web technology and increased Web popularity, users are commonly using online social networks to allow them to contact new friends or ’alike’ users. Similarly, people from various different demographics have also increased the customer base of online dating networks [9]. It is reported [1] that there are around 8 million singles in Australia and 54.32% of them are using online dating services. Users of online dating services are overwhelmed by the number of choices returned by these services. The process of selecting the right partner among a vast amount of candidates becomes tedious and nearly ineffective if the automatic selection process is not available. Therefore, a matching system, utilizing data mining to predict behaviors and attributes that could lead to successful matches, becomes a necessity. Recommendation systems have existed for a long time to suggest users a product according to their web visit histories or based on the product selections of other similar users [2],[7]. In most cases, the recommendation is an item recommendation which is inanimate. On the contrary, the recommendation in dating networks is made about people who are animate. Different from item recommendation, people recommendation is a form of two way matching where a person can refuse an invitation but products cannot refuse to be sold. In other words, a product does not choose the buyer but dating service users can choose the dating J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 99–110, 2011. c Springer-Verlag Berlin Heidelberg 2011
100
R. Nayak
candidates. The goal of an e-commerce recommendation system is to find products most likely to interest a user, whereas, the goal of a social matching system is to find the user who is likely to respond favorable to them. Current recommendation systems cannot handle this well [6]. There are few published examples of recommendation systems applied explicitly to online dating. Authors in [4] use the traditional user-user algorithms and item-item algorithms with the use of user rating data for online dating recommendation, and failed to use many factors, such as age, job, ethnicity, education etc that play important roles in match making. Authors in [6] proposed a theoretical generic recommendation algorithm for social networks that can easily be applied to an online dating context. Their system is based on a concept of social capital which combines direct similarity from static attributes, complementary relationship(s), general activity and the strength of relationship(s). However, this work is at a theoretical level and there have been no experiments carried out to prove the effectiveness of this theory. There are many weight factors in the proposed algorithm which may negatively influence it being an effective algorithm. Efficiency is another problem for these pairwise algorithms with a very high computation complexity. This paper proposes a social matching system that combines the social network knowledge with content-based and collaborative filtering techniques [7] by utilizing users’ past relations and user similarities to improve recommendation quality. This system includes a nearest neighbour algorithm which provides the system an add-on layer to group similar users to deal with the cold-start problem (i.e., handling new users). It also includes a relationship-based user similarity prediction algorithm which is applied to calculate similarity scores and generate candiadtes. Finally, a support vector machine [3] based algorithm is employed to find out the compatibility between the matching pairs. The similarity scores and the compatibility scores are combined to propose a ranked list of potential partners to the network users. The proposed system is evaluated on the dataset collected from a popular dating network. Empirical analysis shows that the proposed system is able to recommend the top-N users with high accuracy. The recommendation success rate has increased to 31% as compared to the baseline success rate of 19%. The baseline recall of the underlying dating network is also increased from 0.3% to 9.2% respectively.
2
The Proposed Social Matching Method
Data required by a dating network for recommending potential partners can be divided into the following features: (1) Personal profile for each user which includes self details on demographic; fixed-choice responses on Physical, Identity, Lifestyle, Career and Education, Politics and Religion and other attributes; freetext responses to various interests such as sport, music etc; and optionally, one or more photographs; (2) Ideal partner profile for each user which includes information about what user prefers in Ideal partner, usually the multiple choices on the attributes discussed before; (3) User activities on the network such as viewing the profiles of other members, sending pre-typed messages to other users;
Utilizing Past Relations and User Similarities in a Social Matching System
101
sending emails or chat invitations; and (4) Measure of relationships with other users such as willingness to initialize relationships and responding to invitations, and frequency and intensity with which all relationships are maintained. A relationship is called successful for the purpose of match making when a user initiates a pre-typed message as a token of interest and the target user sends back a positive reply. Let U be the set of m users in the network, U = u1 , . . . , um . Let X be a user personal profile that includes a list of personal profile attributes, X = x1 , . . . , xn where each attribute xi is an item such as body type, dietary preferences, political persuasion and so on. Consider the list of user’s ideal partner profile attributes as a set Y = y1 , . . . , yn where each attribute yi is an item such as body type, dietary preferences, political persuasion and so on. For a user uj , value of xi is unary, however, the values of yi can be multiple. Let P = X + Y denote a user profile containing both the personal profile attributes and partner preference attributes. The profile vector of a user is shown as P (uj ). There can be many types of user activities in a network that can be used in the matching process. Some of the main activities are “viewing profiles”, “initiating and/or responding pre-defined messages (or kisses1 , “sending and/or receiving emails” and “buying stamps”. The profile viewing is a one-sided interaction from the viewer perspective; therefore it is hard to define the viewers’ interests. The “kiss interactions” are more promising to be considered as an effective way to show the distinct interests between two potential matches. A user is able to show his/her interest by sending a “kiss”. The receiver is able to ignore the ”kiss” received or return a positive or negative reply. When a receiver replies a kiss with positive predefined message it is considered as a “successful” kiss or a “positive” kiss. Otherwise, it is judged as an “unsuccessful” kiss or a “negative” kiss. Generation of Small Social Networks. A number of social networks which describe the past relations between users and their previous contacted users, are derived. Let user ub be the user who has successfully interacted with more than a certain number of previous partners for a particular period. Let GrA be the set of users, GrA ∈ U , who user ub has positively interacted. Let user ua be the user who user ub has positively interacted last, ua ∈ GrA. Let GrB be the set of users who are ex-partners of user ua , ub ∈ GrB. Note: gender(ub ) = gender(GrB) and gender(ua ) = gender(GrA). Users ub and ua are called seed users as they provide us the network. Users in GrA and GrB are called as relationship-based users. The relationship between user ub and a user in GrB, and the relationship between user ua and a user in GrA reflect the personal profile similarity between the two same gender users. This similarity value is evaluated by using an instance-based learning algorithm. Figure 1 summarizes the proposed method. The process starts with selecting a number of seed pairs. A network of relationship- based users (GrA, GrB) is generated by locating the ex-partners of the seed users. The size of GrA and GrB is increased by applying clustering to include new users and to overcome 1
We call a pre-defined message as ”kiss” in this paper.
102
R. Nayak
the lack of relationship based users for a seed pair. The similarity between ub and users in GrB, and the similarity between ua and users in GrA are calculated to find “closer” members in terms of profile attributes. This step determines the users whose profiles match since they are the same gender users. Each pair in (GrA, GrB) is also checked for their compatibility using a two-way matching. These three similarity scores are combined by using weighted linear strategy. Finally, a ranked list of potential partner matches from GrB for each of GrA is formed.
The Personalised Social Matching Approach Input: Network Users: U = { u1…. um}; Users Profiles : {P(u1),…P(um)} ; User Clusters (Female or Male): C = {(u1..ui), (uj,,uk),…(uk…um)} ;Users’ communication Output: Matching pairs: { (ui , uj)..... (ul, um)} Begin a. Select good seed pair of users based on communication between users U; b. For each unique good seed pair (ub, ua): a. Form GrA by finding ex-partners of ub; b. Form GrB by finding ex-partners of ua; c. Extend GrA and GrB by similar users with the corresponding gender from the clusters C using the k-means algorithm; d. For each user in GrA: i. Find ex-partner GrAi whose profiles are similar to ua using the instance-based learning algorithm; ii. Assign SimScore(GrAi, ua) e. For each user in GrB: i. Find ex-partner GrBi whose profiles are similar to ub using the instance-based learning algorithm; ii. Assign SimScore(GrBi, ub) f. For each pair (GrAi, GrBj) in GrA, GrB; i. Apply the user compatibility algorithm to score the compatibility score; g. Combine the three scores and rank the matching pairs; End Fig. 1. High level definition of the proposed method
Personal Profile Similarity. An instance based learning algorithm is developed to calculate similarity between the seed user and the relationship-based users. Attributes in personal profile are of categorical domain. So an overlap function is used that determines how close the two users are in terms of attribute xi . Sxi (u1 , u2 ) =
1, xi (u1 ) = xi (u2 ) 0, Otherwise
(1)
Utilizing Past Relations and User Similarities in a Social Matching System
103
where u1 is a seed user and u2 ∈ GrB or u2 ∈ GrA. This matching process is conducted between a seed user ub and GrB users, as well as the corresponding partner seed user ua and GrA users. All attributes are not equally important when selecting a potential match [5],[8]. For example, analysis of dataset of a popular dating network2 shows that attributes such as height, body type, and have children are specified more frequently than attributes such as nationality, industry and have pets in user personal profiles. Therefore, each attribute score is assigned with a weight when combined them all together. The weight is set according to the percentage that all members have indicated that attribute in their personal profiles existing in the network. Inclusion of the weight values according to the network statistic allows us to reflect the user interest in the network. SimScore(u1 , u2 ) =
n i=1
Sxi (u1 , u2 ) × weightxi
(2)
Solving cold-start problem in the network. A recommendation system can suffer with cold-start problem [2] when in a network the number of relationshipbased users is very low or new users are to be included in the matching process. This research utilizes the k-means clustering algorithm that helps to increase the size of GrA and GrB by finding similar users according to seed users ua and ub respectively. Users in the network for a specified duration are grouped according to their personal profiles. Let Cm = {C1 , . . . , Cc }m be the cluster of male members of the network where ck is the centroid vector of cluster Clm . Let Cf = C1 , . . . , Cc f f be the cluster of female members of the network where ck is the centroid vector of cluster Clf . The user personal profile and preference attributes P = X + Y where X = {x1 , . . . , xn } and Y = {y1 , . . . , yn } are used in the clustering process. Appropriate clusters corresponding to gender of the seed users, ub and ua, are used to test which cluster matches best with the seed user such as (max∀(ck ∈C (m|f) ) (S k (P (ub ), ck ))) where S k shows the maximum similarity between a centroid vector and a user profile vector. Cosine similarity is employed in the process. Members of the matched cluster are used in extending the size of GrA or GrB. The User Compatibility Algorithm. A recommendation system for social matching should consider two-way matching. For each personal attribute, the user’s preference is compared to the potential match’s stated value for the attribute. The result of the comparison is a single match score for that attribute that incorporates (1) the user’s preference for the match, (2) the potential match’s preference for the user, and (3) the importance of the attribute to both. The attribute cross score match can be thought of as a distance measure between two users along a single dimension. The attribute cross match scores for all attributes are combined into a vector that indicates how closely the match is between a user and their potential match. The first step is to calculate the attribute cross match score CSxi (ub , ua ) that quantifies how closely a potential 2
Due to privacy reason the detail of this network is not given.
104
R. Nayak
match fits the preferences of user ub based on profile attribute xi . That is, does a potential match’s stated value for an attribute fit the user’s preference yi for that attribute? If the user has explicitly stated a preference for the attribute then the measure becomes trivial. If the user has not explicitly stated a preference, then a preference can be inferred from the preferences of other members in the same age and gender group. Though a user may not explicitly state their preference, an assumption can be made that their preference is similar to others in the same age and gender group. The score then becomes the likelihood of a potential match up meets the preferences of a user ub for the attribute xi . ⎧ ⎪ ⎪1 ⎪ ⎪ ⎨ CSxi (ub , ua ) =
N (xi (ub )=x,xi (ua )∈yi (ub )|xiGender (ub ),xi A ge(ub ))
, xi (ua ) ∈ yi (ub )
, y (u ) = φ (N (xi (ub ) = x|xiGender (ub ), xiA ge(ub ))) i b ⎪ ⎪ ⎪ −N (xi (ub ) = x, yi (ub ) = “N ot Specif ied” |xiGender (ub ), xiA ge(ub )) ⎪ ⎩ 0 , Otherwise (3)
where xi (ub ) is the user ub ’s profile value for attribute xi and yi (ub ) is user ub ’s preferred match value for attribute xi . By the definition in above equation, scores range from 0 to 1. The attribute cross match score is moderated by a comparative measure of how important the attribute xi is to users within the same age and gender demographics of user ub . This measure, called the importance, is estimated from the frequency of the users within the same age band and gender of the user that specify a preference for the attribute. This is done to ensure that all attributes are not equally important when selecting a potential match. Attributes such as height, body type, and have children are specified more frequently than attributes such as nationality, industry and have pets. If a user explicitly specifies a preference then it is assumed the attribute is highly important to them (e.g. when a user makes an explicit religious preference). When it is not specified a good proxy for the importance of the attribute is the complement of the proportion of users in the same age and gender group who did specify a preference for the attribute. Mathematically, this is defined by ⎧ ⎪ 1 ⎪ ⎪ ⎪ ⎨1 − Ixi (ub ) = ⎪ ⎪ ⎪ ⎪ ⎩ 0
N (xi (ub )=x,yi (ub )“N ot Specif ied”|xiGender (ub ),xi Age (ub )) (N (xi (ub )=x|xiGender (ub ),xiAge (ub )))
, xi (ua ) ∈ yi (ub ) yi (ub ) = φ
(4)
, Otherwise
By the definition in this equation, scores range from 0 to 1. The attribute score for xi between potential partners is calculated as follows: Axi (ub , ua ) = Ixi (ub ) × CSxi (ub , ua )
(5)
Including the importance information upfront may simplify the task of training an optimisation model to map the attribute scores to a target variable. By reducing complexity of the model, accuracy may be improved. An alternative to the importance measure would be to leave the weightings for an optimisation model to estimate. It is assumed including the importance measure as part of the score calculations will assist training of the optimisation model.
Utilizing Past Relations and User Similarities in a Social Matching System
105
Both the attribute match score and importance are also calculated from the perspective of the potential match’s up preference towards user ub . Finally, a single attribute match score Mxi between two users for attribute xi is obtained as follows: Mxi (ub , ua ) = Mxi (ua , ub ) = 1/2((Axi (ub , ua ) + Axi (ua , ub )))
(6)
By combining the four measures per attribute into one cross-match score per attribute the search space is reduced by three quarters. Finding user compatibility requires a measurement that allows different potential matches to be compared. The measure should allow a user’s list of potential matches to be ranked in order of ”closest” matches. This is achieved by combining all attribute cross-matches scores into a singular match score. M(ub , ua ) = [Mx1 (ub , ua ), . . . , Mxn (ub , ua )]
(7)
The goal then becomes to intelligently summarise the vector M (ub , ua ) in a way that increases the score for matches that are likely to lead to a relationship. Technically, this becomes a search for an optimal mapping from a user ub to a potential match up based on their shared attribute cross-match vector M (ub , ua ) to a target variable that represents a good match. We will call this target variable the compatibility score Comp(ub , ua) such that: Comp(ub , ua ) = f (M(ub , ua ))
(8)
An optimal mapping could be found by training a predictive data mining algorithm to learn the mapping function f provided a suitable target variable can be identified. Ideally the target relationship score would be a variable based on user activities that (1) identifies successful relationships, and (2) increases the company revenue through more contacts. In this research we have used the “kiss” communication between users as the target to learn the successful relationship. Calculation of the mapping function f from a potential match’s attribute crossmatch vector to the compatibility score is performed using a support vector machine (SVM) algorithm [3]. Each input to SVM is a real value ranging from -1 to 1. The trained SVM has a single real valued output that becomes the compatibility score. Putting it all together. Once the three similarity scores, SimScore(ub , GrBj ) identifying profile similarity between the seed user and a potential match, SimScore(ua , GrAi ) identifying profile similarity between the seed partner and a recommendation object and Comp(GrAi , GrBj ) compatibility score between a potential match pair (GrAi , GrBj ) are obtained, these scores can be combined using weighted linear strategy. Match(GrAi , GrBj ) = w1 × SimScore(ub , GrBj ) + w2 × SimScore(ua , GrAi )+ w3 × Comp(GrAi , GrBj )
(9)
106
R. Nayak
To determine these weights setting, a decision tree model was built using 300 unique seed users, 20 profile attributes and about 300,711 recommendations generated from the developed social matching system along with an indicator of their success. The resulting decision tree showed that higher percentage of positive kisses are produced when w1 ≥ 0.5 and w2 ≥ 0.3 and w3 ≥ 0.2. Therefore w1 , w2 and w3 are set as 0.5, 0.3 and 0.2 respectively. It is interesting to note the lower value of w3 . It means that when two members are interested in each other, there exists high probability that both of them are similar to their ex-partners respectively. For each recommendation object GrAi , matching partners are ranked according to their M atch(GrAi , GrBj ) score and top-n partners from GrB become the potential match of GrAi .
3 3.1
Empirical Analysis Dataset: The Online Dating Network
The proposed method is tested with the dataset collected from a real life online dating network. There were about 2 million users in the network. We used the three months of data to generate and test networks of relationship-based users and recommendations. The activity and measure of relationship between two users in this research is “kiss”. The number of positive kisses is used in testing the proposed social matching system. Figure 2 lists the details of the users and kisses in the network. A user who has logged on in the website during the chosen three months period is called as “active” user. The seed users and relationship based users come from this set of users. A kiss sender is called “successful” when the target user sends back a positive kiss reply. There are about 50 predefined messages (short-text up to 150 characters) used in the dating network. These kiss messages are manually defined as positive or negative showing the user interest towards another member. There are a large number of kisses exist in the network that have never been replied by the target users and they are called as “null kiss”.
3 Months Data # of distinct active users (female + male) # unique kiss senders # unique successful senders # unique kiss recipients in the network # unique kiss recipients who are active during the chosen period # unique kisses # unique successful kisses # unique negative kisses # unique null kisses
Value 163,050 (82,500 + 80, 550) 122,396 91,487 198,293 83,865 886,396 171,158 346,193 369,045
Fig. 2. User and Kiss Statistics for the three months chosen period
Utilizing Past Relations and User Similarities in a Social Matching System
107
It can be noted that for each kiss sender, there is about 4 kiss replies (including successful and negative both) on an average. It can also be seen that about 75% kiss senders have received at least one positive kiss reply. The amount of successful kisses is less than one fourth of the sum of negative and null kisses. A further kiss analysis shows a strong indication of Male members in the network for initiating the first activities such as sending kisses (78.9% vs 21.1%). They are defined as proactive behavior users in the paper. While female members who are reactive behaviour users usually wait for receiving kisses. 3.2
Evaluation Criteria
Let U be the set of network’s active users. Let GrA be the group of users who are going to receive potential partners’ recommendations andGrB be the group of users who become the potential partners. Let U = GrA GrB where GrA GrB = φ . The recommendation performance is tested by whether the user has made initial contact to the users in the recommendation list. SuccessRate(SR) =
N umber of unique successf ul kisses GrA to GrB N umber of unique kisses GrA to GrB
BaselineSuccessRate(BSR) =
N umber of unique successf ul kisses GrA to U N umber of unique kissesGrA to U
Success Rate Improvement (SRI) =
Recall =
Kernel Function
Linear Gaussian Gaussian Gaussian Gaussian Gaussian Gaussian
Success Rate(SR) Baseline Success Rate(BSR)
40 40 50 50 70 70
Standard Deviation
0.5 1 0.5 1 0.5 1
Correctly Predicted (%) Training Dataset
Correctly Predicted (%) Test Dataset
(Mismatch)
(Mismatch)
44.4 79.6 79.5 70.5 62.7 68.9 59.7
(Match)
62.4 63.9 63.9 62.1 60.4 64.5 51.0
10 73.3 67.2 58.4 58.5 64.6 56.9
(11)
(12)
N umber of (Kissed P artners Recommended P artners) (N umber of (Kissed P artners))
Kernel Size
(10)
(13)
(Match)
90 60.1 59.8 53.7 56.6 59.9 48.2
Fig. 3. The SVM Model Performance
3.3
Results and Discussion
The SVM models were trained by changing different parameters on the datasets as shown in Figure 3. In the dataset, only a very small set of samples are positive matches. To avoid the model being overwhelmed by negative samples, a stratified training set was created. A set of at least 3500 unique users with about 20
108
R. Nayak
positive and 20 negative kiss responses per user were chosen. This created a sample training set of about 144,430 records which were used to train SVM models. The test dataset was created with 24498 records by randomly choosing users. A 10-fold cross validation experiments were performed and the average performance is shown in Figure 3. The best performing SVM model was used in the proposed matching system. Figure 4 shows that the Success Rate (SR) decreases as the number of potential matches (GrB) is increased for a user in GrA. This result confirms that higher the total score generated by the proposed matching system, M atch(GrAi , GrBj ), the more relevant and accurate matches are made. For example, users with higher total score in top-5 recommendation list received highest percentage of positive kiss reply. There are a number of null kiss replies in the dataset. A null kiss reply can be transformed to positive kiss reply and negative kiss reply. If all the null kiss reply is able to transform to positive kiss reply, the success rate (SR) can be obtained as 66% for top-20 users. The BSR of the underlying online dating network is 19%. Figure 2 shows that the proposed system is always better. This result describes that the potential matches offered by the system interest the user (as shown by figure 5) and also the receivers show high interests towards these users by sending the positive kiss message back as shown by SR in figure 4 and with the increased recall (figure 6). However, it can be seen that with the increased number of recommendations the value of SRI decreases as shown in figure 5. It concludes that more matching recommendations will attract user attention and trigger more kisses to be sent. However, more recommendation will also lead to low quality recommendations. When recommending potential matches, the user is more interested in examining a small set of recommendations rather than a long list of candidates. Based on all results, high quality top-20 recommendation maximize SRI without letting recall drop unsatisfactorily. Experiments have also been performed to determine which kind of users are more important for generating high quality matches for the dating network, the similar users from clusters or relationship based users? Two sets of experiments are performed. – In the first setting, the size of GrA and GrB is fixed as 200. The usual size of GrA is about 30 to 50, populated with ex-partners. More similar users obtained from respective clusters are added into these two groups in comparison to relationship-based users. – In the second setting, the variance between the two groups Dif f (#GrA, #GrB) is covered by adding members from the respective cluster. In addition, only the 10% size of GrA, GrB is increased by clustering to add new members and to increase the user coverage. Results show that when more similar users rather than relationship-based users are added the success rate improvement (SRI) is lower than adding more relation ship-based users against the current pairs. The SR and SRI obtained from the first setting are 0.19 and 1.0 respectively, whereas in the second setting SR and SRI are 0.29 and 1.4 respectively considering all suggested matching pairs.
Utilizing Past Relations and User Similarities in a Social Matching System
2
109
0.3
1.5
0.2
1 0.5
SR
0.1
SRI
0 top-5 top-10 top-20 top-50 top-100
All
0 top-5
top-10 top-20 top-50 top-100
All
Reaction of S users
Fig. 4. Top-n success rate and success rate improvement
Fig. 5. Sender’s Interests Prediction Accuracy
10 8 6 4
Recall
2 0 top-5 top-10 top-20 top-50 top-100 All
Fig. 6. Top-n recall performance
Empirical analysis ascertains that by utilising clustering to increase the size of GrA and GrB by small amount and equalising two groups, the recommendation is improved as well as new users are considered in matching. Due to the use of small networks of relationship-based users, the proposed personalized social matching system is able to generate recommendations in acceptable time frame - it takes about 2 hours to generate recommendations for 100,000 users excluding the offline activities such as clustering of users, training of SVM model and calculation of importance table for the members according to common gender and age to be used in SVM. The proposed social matching system is able to generate high quality recommendations to users. The quality of the recommendations is enhanced by the following techniques: 1) All the recommendations are generated by good seed users who have above thirty previous partners in a defined period. 2) The recommendations are among relationshipbased users who are generated by the utilization of social network’s background knowledge. 3) To solve the cold-start issue but still ensuring the recommendation quality, the system add-on layer only groups users who are similar to the seed pairs. This method avoids introducing any random users and keeps the relationships among users. 4) Three similarity scores are utilised to determine
110
R. Nayak
the matching pair quality by measuring similarity level against seed pairs and relationship-based users, and compatibility between the matching pair. The decision tree model is used for producing the weights for these similarity scores.
4
Conclusion
The proposed system gathers relationship-based users, forms relationship-based users networks, explores the similarity level between relationship-based users and seed users, explores the compatibility between potential partners and then make partner recommendations in order to increase the likelihood of successful reply. This innovative system combines the following three algorithms to generate the potential partners: (1) An instance-based similarity algorithm for predicting similarity between the seed users and relationship-based user that forms potential high quality recommendation and reduces the number of users that the matching system needs to be considered; (2) A K-means similar user checking algorithm that helps to overcome the problems that the standard recommender techniques usually suffer, including the absence of knowledge, the cold-start problem and the sparse user data; and (3) A user compatibility algorithm that conducts the two-way matching between users by utilising the SVM predictive data mining algorithm. Empirical analysis show that the success rate has been improved from the baseline results of 19% to 31% by using the proposed system. Acknowledgment: We would like to acknowledge the industry partners and the Cooperative Research Centre for Smart Services.
References 1. 2006 census quickstats. Number March 2010 (2006) 2. Anand, S.S., Mobasher, B.: Intelligent techniques for web personalization. Online Information Review (2005) 3. Bennett, K.P., Campbel, C.: Support vector machines: Hype or hallelujah? SIGKDD Explorations 2, 1–13 (2000) 4. Brozovsky, L., Petricek, V.: Recommender system for online dating service (2005) 5. Fiore, A., Shaw Taylor, L., Zhong, X., Mendelsohn, G., Cheshire, C.: Who’s right and who writes: People, profiles, contacts, and replies in online dating. In: Hawai’i International Conference on System Sciences 43, Persistent Conversation Minitrack (2010) 6. Kazienko, P., Musial, K.: Recommendation framework for online social networks. In: 4th Atlantic Web Intelligence Conference (AWIC 2006). IEEE Internet Computing (2006) 7. Linden, G., Smith, B., York, J.: Amazon.com recommendations: Item-to-item collaborative filtering. IEEE Internet Computing 7 (2003) 8. Markey, P.M., Markey, C.N.: Romantic ideals, romantic obtainment, and relationship experiences: The complementarity of interpersonal traits among romantic partners. Journal of Social and Personal Relationships 24, 517–534 (2007) 9. Smith, A.: Exploring online dating and customer relationship management. Online Information Review 29, 18–33 (2005)
On Sampling Type Distribution from Heterogeneous Social Networks Jhao-Yin Li and Mi-Yen Yeh Institute of Information Science, Academia Sinica 128 Academia Road, Section 2, Nankang, Taipei 115, Taiwan {louisjyli,miyen}@iis.sinica.edu.tw
Abstract. Social network analysis has drawn the attention of many researchers recently. As the advance of communication technologies, the scale of social networks grows rapidly. To capture the characteristics of very large social networks, graph sampling is an important approach that does not require visiting the entire network. Prior studies on graph sampling focused on preserving the properties such as degree distribution and clustering coefficient of a homogeneous graph, where each node and edge is treated equally. However, a node in a social network usually has its own attribute indicating a specific group membership or type. For example, people are of different races or nationalities. The link between individuals from the same or different types can thus be classified to intra- and inter-connections. Therefore, it is important whether a sampling method can preserve the node and link type distribution of the heterogeneous social networks. In this paper, we formally address this issue. Moreover, we apply five algorithms to the real Twitter data sets to evaluate their performance. The results show that respondent-driven sampling works well even if the sample sizes are small while random node sampling works best only under large sample sizes.
1 Introduction Social network analysis has drawn more and more attention of the data mining community in recent years. By modeling the social network as a graph structure, where a node is an individual and an edge represents the relationship between individuals, many studies addressed the graph mining techniques to discover the interesting knowledge on the social networks. As the advance of communication technologies and the explosion of social web applications such as Facebook and Twitter, the scale of the generated network data is usually very large. Apparently, it is less possible to explore and store the entire large network data before extracting the characteristics of these social networks. Therefore, it is critical to develop an efficient and systematic approach to gathering data in an appropriate size while keeping the properties of the original network. To scale down the network data to be processed, there are two possible strategies: graph summarization and graph sampling. Graph summarization [1,2,3,4,5,6] aims to condense the original graph in a more compact form. There are lossless methods, where the original graph can be recovered from the summary graph, and loss-tolerant methods, where some information may be lost during the summarization. To obtain the summary graph, these methods usually need to examine the entire network first. On the other J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 111–122, 2011. c Springer-Verlag Berlin Heidelberg 2011
112
J.-Y. Li and M.-Y. Yeh
hand, sampling is a way of data collection by selecting a subset of the original data. By following some rules of sampling nodes and edges, a subgraph can be constructed with the characteristics of the original graph preserved. In contrast to graph summarization, a big advantage of sampling is that only a controlled number of nodes, instead of the entire network, are visited. In this work, as a result, we want to focus on sampling from large social networks. Prior studies on graph sampling [7,8], however, focused only on preserving statistics such as degree distribution, hop-plot, and clustering coefficient on homogeneous graphs, where each node and link is treated equally. In reality, the social network is heterogeneous, where each individual has its own attribute indicating a specific group membership or type. For example, people are of different races or nationalities. The link between individuals of the same or different types can thus be classified to intraconnection and inter-connection. The type distribution of nodes and the proportion of intra/inter-connection links is also key information that should be preserved to understand the heterogeneous social network, which, to the best of our knowledge, has not yet been addressed in the previous graph sampling works in the data mining community. To this end, we propose two goals on the heterogeneous social network. First is the type distribution preserving goal. Given a desired number of nodes of the sample size, a subgraph Gs is generated by some sampling method. The type distribution of Gs , Dist(Gs ), is expected to be the same as that of the original graph G. The second goal is the intra-relationship preserving goal. We expect that the ratio of the intra-connection numbers to the total edges of Gs should be preserved. In search of a better solution, we adopt five possible methods: Random Node Sampling (RNS), Random Edge Sampling (RES), EgoCenteric Exploration Sampling(ECE) [9], Multiple-Ego-Centric Sampling, (MES) and Respondent-Driven Sampling (RDS) [10], to see their effects on sampling type distribution in the heterogeneous social networks. RNS and RES are two methods of selecting nodes and edges randomly until some criteria are met. ECE is a chain-referral-based sampling proposed in [9]. Chain-referral sampling usually starts from a node called ego and selects neighbor nodes uniformly at random wave by wave [9]. MES is a variation of ECE we designed that the sampling starts with multiple initial egos. Finally, we adopt RDS, which is a sampling method used in social science for studying the hidden populations [10]. Many works on the social network analysis focus on the majority, i.e., the greatest or the second greatest connected components, of the network. However, sometimes the small or hidden group of a network hints more interesting messages. For example, the population of drug users or patients with rare diseases is usually hidden and relatively small. Essentially, RDS is a method combining snowball sampling, of which the recruiting of future samples is from acquaintances of the current subject, with a Markov Chain model to generate unbiased samples. In our implementation, we adopt RDS for the simulation of the human recruiting process and indicate how the Markov Chain is computed from the collected samples. To evaluate the sampling quality of the above five methods, we conduct experiments on the Twitter data sets provided in [11]. We measure the difference of the type distribution between the sampling results and the original network by two indexes: error ratio and D-statistic of Kolmogorov-Smirnov Test. In addition, we measure the
On Sampling Type Distribution from Heterogeneous Social Networks
113
difference of the intra-connection percentage between the samples and the original network. The results show that RDS works best in terms of preserving the type distribution and the intra-connection percentage when the sample size is small. MES and ECE perform next best, while MES has small improvement over ECE in the node type distribution. Finally, the sampling quality of RNS and RES are less stable. RNS outperforms other methods only when the sample size is large. The remainder of the paper is organized as follows. The related work is discussed in Section 2. The problem statement is formally defined in Section 3. The detailed implementation of the five sampling algorithms is described in Section 4. In Section 5, we show the experimental results. Finally, the paper is concluded in Section 6.
2 Related Work As the scale of social network data is getting very large, graph sampling is a useful technique to collect a smaller subgraph, without visiting the entire original network, but with some properties of the original network preserved. Krishnamurthy et al. [12] found that a simple random node selection to a 30% sampling size is already able to preserve some properties of an undirected graph. Leskovec and Faloutsos [7] provided a survey on three major kinds of sampling methods on graphs: sampling by random node selection, sampling by random edge selection and sampling by exploration. The sampling quality of preserving many graph properties such as degree distribution, hopplot, and clustering coefficient are examined. Moreover, they proposed a Forest Fire sampling algorithm, which preserves the power law property of a network very well during the sampling process. They concluded that there is no perfect sampling method for preserving every property under any conditions. The sampling performance depends on different criteria and graph structures. Hübler et al. [8] further proposed Metropolis algorithms to obtain representative subgraphs. However, all these sampling works did not concern the heterogeneous network where each node may have its own attribute indicating a specific group membership or type. The link (edge) may be a connection between two nodes of the same or different types. The type distribution of nodes and the proportion of intra/inter-connection links is also key information to understand the heterogeneous social network, which, to the best of our knowledge, has not yet been addressed in the previous graph sampling works in the data mining community. To sample the type distribution in a heterogeneous network, we further introduce the Respondent-Driven Sampling (RDS) proposed by Heckathorn [10]. RDS is a wellknown sampling approach for studying the hidden population, which combines snowball sampling and the Markov Chain process to produce unbiased sampling results for the hidden population. Furthermore, a newer estimator for the sampling results is designed [13] based on the reciprocity model assumption, i.e., the number of edges from group A to group B is equal to that from group B to group A in a directed graph with two groups (an undirected graph naturally complies with this assumption). In the real case of directed heterogeneous network, however, this assumption does not usually hold. For example, the Twitter data sets we use in the experiments do not have the reciprocity property between the tweets among users. In our study, as a result, we apply and simulate the original RDS [10] to sample the type distribution in a large heterogeneous social network.
114
J.-Y. Li and M.-Y. Yeh
3 Problem Statement Given a graph G =< V, E >, V denotes a set of n vertexes (nodes, individuals) vi and E is a set of m directed edges (link, relationships) ei . First, we define the heterogeneous graph which models the heterogeneous social network we are interested in. Definition 1. A heterogeneous graph G with k types is a graph where each node belongs to only one specific type out of k types. More specifically, given a finite set L = {L1 , ...Lk } denoting k types, the type of each node vi is T (vi ) = Li , where Li ∈ L. Suppose the number of vertex of G is n, and the number of nodes belonging to type Li is Ni , then the condition ∑ki=1 Ni = n must be true. In other words, (nodes ∈ Li ) ∩ (nodes ∈ L j ) =0, where i = j. The edges between nodes of different types are defined as follows. Definition 2. An edge ei connecting two nodes vi and v j is an intra-connection edge if T (vi ) = T (v j ). Otherwise, it is an inter-connection edge. With the above two definitions, our problem statements are presented as follows. Problem 1. Type distribution preserving goal Given a desired number of nodes, i.e., the sample size, a subgraph Gs is generated by some sampling method. The type distribution of Gs , Dist(Gs ), is expected to be the same as that of the original graph G. That is, d(Dist(Gs ), Dist(G)) = 0, where d() denotes the difference between two distributions. In other words, the percentage of each Ni in Gs is expected to be the same as that of G. Problem 2. Intra-relationship preserving goal Given a desired number of nodes, i.e., the sample size, a subgraph Gs is generated by some sampling method. The ratio of the intra-connection numbers to the total edges should be preserved. That is, d(IR(Gs ), IR(G)) = 0. On the other hand, the inter-relationship is equal to 1 − IR(Gs) which is also preserved. An example is given to illustrate these two problems. Given a social network which including 180 nodes (n = 180) and 320 edges (m = 320). Suppose there are totally 3 groups (k = 3) containing 20, 100, and 60 people respectively. Thus, the type distribution of the network Dist(G) is 0.11, 0.56, 0.33. Also suppose there are 200 intraconnection edges, the intra-connection ratio is thus 0.625. Our goal is to find out a sampling method that preserves the type distribution and the intra-connection ratio best. Suppose that a subgraph Gs is sampled under the given 10% sampling rate, which is 18 nodes. If the number of nodes of group 1, 2, and 3 is 5, 8, and 5, then the type distribution is 0.28,0.44,0.28. In addition, suppose there are 30 intra-connection edges out of 50 sampled edges, then the intra-connection ratio is 0.6. In the experiment section, we will provide several indexes to compute the difference between these distributions and ratios.
On Sampling Type Distribution from Heterogeneous Social Networks
115
4 Sampling Algorithms The five algorithms for sampling large heterogeneous networks to be described are divided into three categories: random-based sampling, chain-referral sampling and indirect inference sampling. For random-based sampling, we conduct Random Node Sampling and Random Edge Sampling. The chain-referral sampling method includes Ego-Centric Exploration Sampling and Multiple Ego-Centric-Exploration Sampling. Finally, we adopt Respondent-Driven Sampling, an indirect sampling method, that originated from the social science. 4.1 Random-Based Sampling Random Node Sampling (RNS) is an intuitive procedure for selecting desired number of nodes uniformly at random from the given graph. First of all, RNS picks up a set of nodes into a list. Then it constructs the vertex-induced subgraph by checking if there are edges between the selected nodes in the original graph. The logic of Random Edge Sampling (RES) is also intuitive and similar to RNS by selecting edges uniformly at random. Once an edge is selected during the sampling process, the two nodes, head and tail, that it connected are also be included. Note that once the node number exceeds the desired one when a latest edge is selected, one of the node, say, the head node, will be excluded. 4.2 Chain-Referral Sampling The chain-referral sampling is also known as exploration-based sampling. First, we illustrate the Ego-Centric Exploration Sampling (ECE) method proposed by Ma et al. [9], then we proposed an variation method which improves ECE. Essentially, ECE is based on the Random Walk (RW) methods [14]. Starting with a random node, RW chooses exactly one neighbor of that start node as the next stop. Following the same step, RW visits as many nodes as the desired sample size. The visited nodes and the edges along the walking path are collected. Similar to RW, ECE first randomly chooses a starting node called ego. Then all of its neighbor nodes, in contrast to RW where only one neighbor is considered, will be examined to be chosen or not by ECE according to a probability p respectively. The number of nodes chosen is expected to be p times the degree of that ego. Next, each new selected node itself is a new ego and the algorithm repeats the same step iteratively until the desired sample size is reached. Whenever the sampling process cannot move to the next wave, we select a new ego and restart this procedure to continue the sampling. Consider the case that the start ego of ECE falls in a strongly connected component where individuals tend to be connected to those of the same type, e.g., same race or same nationality. This may trap ECE in sampling only the same or very few types of nodes. To deal with this issue, we further propose the Multiple Ego-centric-exploration Sampling (MES) method, which allows multiple egos as the sampling starts at initial. In this way, we have more chance to reach nodes of different types and can avoid the bias of over-sampling nodes for a particular type.
116
J.-Y. Li and M.-Y. Yeh
Although the chain-referral sampling algorithms can both produce a reasonable connected subgraph and preserve community structure, the rich get richer flavor inherently exists in this family of sampling techniques. 4.3 Respondent-Driven Sampling To study the hidden population in social science, Respondent-Driven Sampling (RDS) [10], a non-probability method, has been proposed. Generally, RDS contains two phases that including the snowball sampling phase and the Markov Chain process. Snowball sampling works similarly to ECE/MES that the recruiting of future samples is from acquaintances of the current subject. To compensate for collecting the data in a nonrandom way, the second phase of RDS, the Markov Chain process, helps to generate unbiased samples. As opposed to the conventional sampling methods, the statistics are not obtained directly from the samples, but indirectly inferred from the social network information constructed through them. We simulate the snowball sampling phase of RDS as follows. First, the initial seeds, or individuals, must be chosen from a limited number of convenience samples. We just randomly select these initial seeds. Originally in RDS, each chosen seed is rewarded to encourage further recruiting. Here, we simply make all recruited nodes continue to recruit their peers. In addition, we set a coupon limit, which is the number of peers an individual can recruit, to prevent the sampling in favor of individuals who have many acquaintances. Then, we simulate the Markov Chain process. Suppose there are total k types of people in the network we study. From the collected samples we can organize a k by k recruitment matrix M, where the element Si, j of M represents the percentage of the type j people among those recruited by the people of type i. An example is illustrated in the following sample matrix. ⎛ ⎞ S11 . . . S1 j ⎜ ⎟ M = ⎝ ... . . . ... ⎠ Si1 · · · Si j Suppose that the recruiting should reach to an equilibrium state if more samples are recruited than currently we have. That is, the type distribution will stabilize at E = (E1 , ..., Ei , ..., Ek ), where Ei is the proportion of the type i at equilibrium. The law of large number of the regular Markov Chain process provides a way of computing that equilibrium state of M. It is computed by solving the following linear equations. E1 + E2 + ... + Ek = 1 S1,1 E1 + S2,1 E2 + ...Sk,1 Ek = E1 S1,2 E1 + S2,2 E2 + ...Sk,2 Ek = E2 .. . S1,k−1 E1 + S2,k−1 E2 + ...Sk,k−1 1Ek = Ek−1 . For instance, if there are only two groups in a social network, Male (m) and Female ( f ), Sm f the solution is Em = 1−Smm+S f m and E f = 1 − Em , thus provide the information about
On Sampling Type Distribution from Heterogeneous Social Networks
117
type distribution of the social network. According to [15], if M is a regular transition matrix, there is an unique E. Also, M to the the power of N, M N , approaches a probability matrix, where each row is identical and equal to E. Therefore, we can estimate E alternatively by finding a large enough N that makes M N converge and obtaining a row of it.
5 Evaluation In this section, we introduce the data sets we used and present our experimental results. Then, we show the results of all five sampling methods for both of the type reserving goal and the intra-relationship preserving goal. We also discuss the effects on the statistics with the number of types and the sampling size varied. The sampling probability p was set to 0.8 for ECE and MES based on the suggestion in [12]. The coupon limit for RDS was set to 5. We implemented all algorithms using VC++ and ran on a PC equipped with 2.66GHz dual CPUs and 2G memory. Moreover, we ran each experiment 200 times for each setting and computed the average to get a stable and valid result. 5.1 Twitter Data Sets We conducted our experiments on the Twitter data sets provided in [11] . This data set contains about 10.5 million tweets from 200,000 users along with their information such as time zone, location and so on between 2006 and 2009. The total following/follower information constitutes a directed heterogeneous graph we used. However, we preserved those users with the location label only. Starting from those 200,000 user IDs, we included all their neighbors that had valid location labels on them. Therefore. we constructed a social network that contained more than 200,000 nodes. The resulted social network data consists of n = 403874 accounts and e = 689541 relationships (tweets behavior of followers and followees) among those accounts. According to the location label that records the city/country of each user, we can divide them into several types according to the geographical areas. The division included three settings: 3, 5, and 7 types (groups). The group and type are interchangeably in the following content. The characteristics of these settings were listed in Table 1. The 7 groups are: East Asia, West Asia, Europe, North America, South America, Australia and Africa. The overall intra-connection ratio is 0.303, which means about 30% users contact with those of the same region types. The second setting, 5 region groups were classified as Asia, Europe, America, Australia and Africa. The overall intra-connection ratio is 0.466. In the third setting, 3 region groups are Asia, Europe and America. The overall intra-connection ratio is 0.486 meaning that almost half of the users contact with those in the same area. Detailed information between different groups were presented in Table 1. 5.2 Evaluation Index For the type distribution preserving goal, we used two statistics to measure the type distribution difference between the subgraph Gs extracted by the five algorithms and
118
J.-Y. Li and M.-Y. Yeh Table 1. Summary of the Twitter data sets
group count characteristics group ratio node count 7 intra-connection ratio intra-edge count edge count group ratio node count 5 intra-connection ratio intra-edge count edge count group ratio node count 3 intra-connection ratio intra-edge count edge count
group 1 0.24 97053 0.324 55943 185242 0.486 196230 0.574 198306 345336 0.509 205436 0.598 214351 358714
group 2 0.246 99177 0.332 53170 160094 0.149 60318 0.335 33132 98819 0.153 61791 0.334 34149 102349
group 3 0.149 60318 0.335 33132 98819 0.338 136647 0.381 86943 228451 0.381 136647 0.381 86943 228451
group 4 0.196 79290 0.265 38360 144920 0.023 9206 0.209 2798 13378 — — — — —
group 5 group 6 0.142 0.023 57357 9206 0.258 0.209 21558 2798 83531 13378 0.004 — 1473 — 0.02 — 70 — 3530 — — — — — — — — — — —
group 7 0.004 1473 0.02 70 3530 — — — — — — — — — —
the original graph G. First, the Error Ratio (ER) summed up the proportion difference ∑k
|O(i)−E(i)|
of all types. It is defined as i=1 2∗SN , where O(i) is the number of nodes in the ith group on Gs , E(i) is the theoretical number of nodes it should be in the sampled graph according to the type i’s real proportion in G, and SN is the sample size. Another evaluation statistic is the D-statistic for the Kolmogorov-Smirnov Test. We simply used it as an index rather than conducting a hypothesis test. The D-statistic, which can measure the agreement between two distributions, is defined as D = supx |F (x) − F(x)|, where F (x) is the type distribution of Gs and F(x) is that of G. ER provided a percentagelike form of the total errors between type distributions of Gs and G whereas D-statistic provided the information about the cumulative errors within the structure of Gs and G. For the intra-relationship preserving goal, we used the Intra-Relation Error (IRE) to measure the difference of the intra-relationship ratio among Gs and G. It was defined | mI − mI |, where I and I denoted the number of intra-connection edges in Gs and G respectively, and m and m were the total number of edges in Gs and G respectively. 5.3 Results of the Type Distribution Preserving Goal This goal is to preserve the type distribution of the sampled graph Gs as similar as that of the original graph G. The sample size varied from 50 to 200000 nodes, i.e., about 0.1% to 50% sampling rate. The experiment results in Fig. 1(a) and (b) showed the error in the type distribution for the 7-groups Twitter data set. In general, the error decreased as the sampling size increased. Fig. 1(a) showed the ER of all the five sampling methods, we found that RDS performed best when the sampling size was very small, but improved slowly in large sample size. Because the fast-converge rate in the Markov transition, RDS can provide more accurate results even when the information from the samples was limited. However, since the Markov process converged very fast, the result was determined until an enough number of nodes was reached. Thus, the following selected entities failed to improve the accuracy. On the other hand, MES outperformed ECE
On Sampling Type Distribution from Heterogeneous Social Networks Number of Groups = 7
Number of Groups = 7 0.35
0.25
0.2 0.15
0.2 0.15
0.1
0.1
0.05
0.05
0
0 10
100
1000
10000
100000
RNS RES ECE MES RDS
0.3 D Statistic
0.25 Error Ratio
0.35
RNS RES ECE MES RDS
0.3
1e+006
10
100
Sample Size
(a) ER 0.35
Error Ratio
0.25
0.25
0.2 0.15
0.2 0.15
0.1
0.1
0.05
0.05 1000
10000
100000
0
1e+006
10
100
Sample Size
(c) ER 0.35
0.25
0.15
0.2 0.15
0.1
0.1
0.05
0.05
100
1000 10000 Sample Size
(e) ER
1e+006
100000
RNS RES ECE MES RDS
0.3
0.2
10
100000
Number of Groups = 3 0.35
D Statistic
Error Ratio
0.25
10000
(d) D-statistic RNS RES ECE MES RDS
0.3
1000
Sample Size
Number of Groups = 3
0
1e+006
RNS RES ECE MES RDS
0.3 D Statistic
0.3
100
100000
Number of Groups = 5 0.35
RNS RES ECE MES RDS
10
1000 10000 Sample Size
(b) D-statistic
Number of Groups = 5
0
119
1e+006
0
10
100
1000
10000
100000
1e+006
Sample Size
(f) D-statistic
Fig. 1. ER and D-statistic at different sample sizes when the group number is 3, 5 and 7
only at the small sample size. It was because MES could avoid getting stuck in some particular group members and thus provided more accurate results. When the sample size increased, ECE had higher chances to travel from groups to groups thus provided the similar result with MES. Finally, RNS and RES behaved very unstable and sensitive to the sample size. When the sample size was very small, both methods produced terrible results. However, they improved significantly and obtained the best sampling results when the sample size was large enough. In Fig. 1(b), we found similar behavior patterns for all sampling methods except for RDS at the small sample size. This indicated that RDS heavily relied on the information provided by the recruitment matrix. When the sampling size was very small, the recruitment matrix could not provide enough information for the Markov Chain process thus produced a worse result.
120
J.-Y. Li and M.-Y. Yeh
For the 5-groups Twitter data set, all patterns from five methods were similar to those of 7-group Twitter data set as shown in Fig. 1(c) and (d). This is also true for the setting of 3-groups as shown in Fig. 1(e) and (f). Only at small sample sizes, the results showed that the error decreased as the number of groups was getting smaller. We will further discuss the results in Section 5.5. It is noted that, the results were similar for both ER and D-statistic. This was because the property of the Twitter data sets. Since ER is an index to measure the total error, it was sensitive to the performance on the largest or relatively great groups in terms of size. On the other hand, D-statistic measured the cumulative error that encountered on the greater groups as well in most cases. According to those reasons, we observed some similar patters between ER and D-statistic. 5.4 Results of the Intra-Relationship Preserving Goal Our second goal is to preserve the relationship among different groups in a network. Fig. 2 presented experimental results for this goal. We found that RDS produced the best result even at small sample sizes. It indicated that the sampling phase of RDS not only provided the network information to the Markov Chain process, but also somewhat preserved the relationship information (different tie types) as well. Still, its improvement slowed down when the sample size became very large. On the other hand, MES had a little higher errors compared to that of the ECE at a small sample size. Since the original concept of MES is to avoid sampling bias from the chain-referral procedure in the type distribution, it did not consider the issues about the relationship among individuals (edges on the graph). However, we can observe the advantage of MES when the sample size increased. RES outperformed RNS since it is an edge-based random selection. Thus it had more advantages than the node-based random selection did. Finally, RNS failed to describe the relationships among individuals with a small sample size since RNS tended to produce a set of nonconnective nodes, which was especially true when the network was sparse, that mislead the intra-connection ratio to 0. However, the situation changed while the sample size increased. Because RNS performed a vertex-induced procedure after sampling enough nodes into the sample pool, this process included both in-edge and out-edges between two nodes. Therefore, more selected edges resulted in the better performance of the intra-relationship preserving goal. We omitted the results of the 5-group data set due to the space limit. Its IRE values were between those of the 7-group and 3-group settings. 5.5 Analysis on the Effects of the Number of Groups and the Sample Size Here we provide some remarks on the performance at different numbers of groups. The sampling size chosen here was 100. We only presented ER in Fig. 3(a) and omitted the results of D-statistic since they had similar patterns. We found that both ER and D-statistic affected by the number of groups (k) positively. This is reasonable since the more k existed in a social network the more errors we observed lead to a lower accuracy. On the other hand, in Fig. 3(b), the number of groups k were almost independent of the intra-relationship error. It is noted that since RNS cannot sample any edge in the small
On Sampling Type Distribution from Heterogeneous Social Networks Number of Groups = 7
Number of Groups = 3 0.5
RNS RES ECE MES RDS
0.4 0.3 0.2 0.1 0
10
100
1000 10000 Sample Size
100000
RNS RES ECE MES RDS
0.4
Intra-Relation Error
Intra-Relation Error
0.5
121
0.3 0.2 0.1 0
1e+006
10
100
(a) IRE
1000 10000 Sample Size
100000
1e+006
(b) IRE
ቹ।ᑑᠲ
Fig. 2. IRE at different sample sizes when the group number is 7 and 3
0.600
0.400 0.350
IntraͲRelationError
0.500
ErrorRatio
0.300 0.250 0.200 0.150 0.100
0.400 0.300 0.200 0.100
0.050
0.000
0.000 3
5
3
7
(a) Group number versus ER
RNS
RES
ECE
5
7
(b) Group number versus IRE ቹ।ᑑᠲ MES RDS
Fig. 3. Error Ratio and IRE at different group number
sample setting, the IRE equals to the IR of the original graph. As a result, the IRE of RNS in Fig. 3(b) had significant differences of different group settings. One of the most important issues in the sampling problem is that how big the sample size should be chosen to get good enough results in terms of the sampling accuracy on preserving our two goals. According to all of our experiment results, Fig. 1 and Fig. 2, we concluded that 15% is the best point for this concern. We found that when the sample size grew to more than 15% (around 60000 nodes) of the population, all statistics were below 0.05 no matter which sampling method was used. In other words, the sampling quality improved limitedly when the sample size was even larger (up to 50% in this study). Although the purposes and research target are different, our finding is similar to that in [7], which also concluded with this argument.
6 Conclusion In this study, we proposed a novel and meaningful sampling problem, the type distribution preserving and the intra-relationship preserving problems, on the heterogeneous social network and conducted five algorithms to deal with this issue. In preserving the type distribution, we found that RDS was a good method especially at small sample
122
J.-Y. Li and M.-Y. Yeh
sizes. MES helped ECE a little at a small sample size. In addition, the random-based methods were sample size sensitive and failed to provide reasonable results at small sample sizes. In preserving the link relationship goal, we had a similar conclusion while some differences were discussed. Furthermore, we discussed the results under different group sizes. Finally, we provided a rule of thumb that a 15% sample size should be large enough on the type distribution preserving and the intra-relationship preserving sampling problems in our findings.
References 1. Navlakha, S., Rastogi, R., Shrivastava, N.: Graph summarization with bounded error. In: Proc. of ACM SIGMOD Int. Conf. on Management of Data, pp. 419–432 (2008) 2. Gibson, D., Kumar, R., Tomkins, A.: Discovering large dense subgraphs in massive graphs. In: Proc. of Int. Conf. on Very Large Data Bases, p. 732 (2005) 3. Raghavan, S., Garcia-Molina, H.: Representing web graphs. In: Proc. of IEEE Int. Conf. on Data Engineering, pp. 405–416 (2003) 4. Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.: Extracting large-scale knowledge bases from the web. In: Proc. of Int. Conf. on Very Large Data Bases, pp. 639–650 (1999) 5. Li, C.T., Lin, S.D.: Egocentric Information Abstraction for Heterogeneous Social Networks. In: Proc. of Int. Conf. on Advances in Social Network Analysis and Mining, pp. 255–260 (2009) 6. Tian, Y., Hankins, R., Patel, J.: Efficient aggregation for graph summarization. In: Proc. of ACM SIGMOD Int. Conf. on Management of Data, pp. 567–580 (2008) 7. Leskovec, J., Faloutsos, C.: Sampling from large graphs. In: Proc. of ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p. 636 (2006) 8. Hübler, C., Kriegel, H., Borgwardt, K., Ghahramani, Z.: Metropolis algorithms for representative subgraph sampling. In: Proc. of IEEE Int. Conf. on Data Mining, pp. 283–292 (2008) 9. Ma, H., Gustafson, S., Moitra, A., Bracewell, D.: Ego-centric Network Sampling in Viral Marketing Applications. In: Int. Conf. on Computational Science and Engineering, pp. 777– 781 (2009) 10. Heckathorn, D.: Respondent-driven sampling: a new approach to the study of hidden populations. Social problems 44, 174–199 (1997) 11. Choudhury, M.D.: Social datasets by munmun de choudhury (2010), http://www.public.asu.edu/~mdechoud/datasets.html 12. Krishnamurthy, V., Faloutsos, M., Chrobak, M., Lao, L., Cui, J.-H., Percus, A.G.: Reducing large internet topologies for faster simulations. In: Boutaba, R., Almeroth, K.C., Puigjaner, R., Shen, S., Black, J.P. (eds.) NETWORKING 2005. LNCS, vol. 3462, pp. 328– 341. Springer, Heidelberg (2005) 13. Heckathorn, D.: Respondent-driven sampling II: deriving valid population estimates from chain-referral samples of hidden populations. Social Problems 49, 11–34 (2002) 14. Lovász, L.: Random walks on graphs: A survey. Combinatorics, Paul Erdos is Eighty 2, 1–46 (1993) 15. Kemeny, J.G., Snell, J.L.: Finite Markov Chains, pp. 69–72. Springer, Heidelberg (1960)
Ant Colony Optimization with Markov Random Walk for Community Detection in Graphs Di Jin1,2 , Dayou Liu1 , Bo Yang1 , Carlos Baquero2, and Dongxiao He1 1
College of Computer Science and Technology, Jilin University, Changchun {jindi.jlu,hedongxiaojlu}@gmail.com,{liudy,ybo}@jlu.edu.cn 2 CCTD/DI, University of Minho, Braga, Portugal
[email protected]
Abstract. Network clustering problem (NCP) is the problem associated to the detection of network community structures. Building on Markov random walks we address this problem with a new ant colony optimization strategy, named as ACOMRW, which improves prior results on the NCP problem and does not require knowledge of the number of communities present on a given network. The framework of ant colony optimization is taken as the basic framework in the ACOMRW algorithm. At each iteration, a Markov random walk model is taken as heuristic rule; all of the ants’ local solutions are aggregated to a global one through clustering ensemble, which then will be used to update a pheromone matrix. The strategy relies on the progressive strengthening of within-community links and the weakening of between-community links. Gradually this converges to a solution where the underlying community structure of the complex network will become clearly visible. The performance of algorithm ACOMRW was tested on a set of benchmark computer-generated networks, and as well on real-world network data sets. Experimental results confirm the validity and improvements met by this approach. Keywords: Network Clustering, Community Detection, Ant Colony Optimization, Clustering Ensemble, Markov Random Walk.
1
Introduction
Many complex systems in the real world exist in the form of networks, such as social networks, biological networks, Web networks, etc., which are also often classified as complex networks. Complex network analysis has been one of the most popular research areas in recent years due to its applicability to a wide range of disciplines [1,2,3]. While a considerable body of work addressed basic statistical properties of complex networks such as the existence of “small world effects” and the presence of “power laws” in the link distribution, another property that has attracted particular attention is that of “community structure”: nodes in a network are often found to cluster into tightly-knit groups with a high density of within-group edges and a lower density of between-group edges [3]. Thus, the goal of network clustering algorithms is to uncover the underlying community structure in given complex networks. J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 123–134, 2011. c Springer-Verlag Berlin Heidelberg 2011
124
D. Jin et al.
The research on complex network clustering problems is of fundamental importance. It has both theoretical significance and practical applications on analyzing network topology, comprehending network function, unfolding network patterns and forecasting network activities. It has been used in many areas, such as terrorist organization recognition, organization management, biological network analyzing, Web community mining, etc [4]. So far, lots of network clustering algorithms have been developed. In terms of the basic strategies adopted by them, they mainly fall into two main categories: optimization and heuristic based methods. The former solves the NCP by transforming it into an optimization problem and trying to find an optimal solution for a predefined objective function such as the network modularity employed in several algorithms [5,6,7,8]. In contrast, there are no explicit optimization objectives in the heuristic based methods, and they solve the NCP based on some intuitive assumptions or heuristic rules, such as in the Girvan-Newman (GN) algorithm [3], Clique Percolation Method (CPM) [9], Finding and Extracting Communities (FEC) [10], Community Detection with Propinquity Dynamics (CDPD) [11], Opinion Dynamics with Decaying Confidence (ODDC) [12], etc. Though a lot of network clustering algorithms have been proposed, how to further improve the clustering accuracy, especially how to discover reasonable network community structure without prior knowledge (such as the number of clusters in the network), is still an open problem. In order to address this problem, a random walk based ant colony optimization for NCP is proposed in this paper enlightened by [10]. In this algorithm, each ant detects its community by using the transition probability of a random walk as heuristic rule; in each iteration, all the ants collectively produce the current solution via the concept of clustering ensemble [13], and update their pheromone matrix by using this solution; at last, after the algorithm has converged, the pheromone matrix is analyzed in order to attain the clustering solution for the target network.
2 2.1
Algorithm The Main Idea
Let N = (V, E) denote a network, where V is the set of nodes (or vertices) and E is the set of edges (or links). Let a k-way partition ofthe network be definedas π = N1 , N2 , . . . , Nk , where N1 , N2 , . . . , Nk satisfy 1≤i≤k Ni = N and 1≤i≤k Ni = ∅. If partition π has the property that within-community edges are dense and between-community edges are sparse, it’s called a welldefined community structure of this network. In a network, let pij be the probability that an agent freely walks from any node i to its neighbor node j within one step, this is also called the transition probability of a random walk. In terms of the adjacency matrix of N , A = (aij )n×n , pij is defined by aij pij = . (1) r air
ACO with Markov Random Walk for Community Detection in Graphs
125
Let D = diag(d1 , . . . , dn ) where di = j aij denotes the degree of node i. Let P be the transition probability matrix of random walk, we have P = D−1 ∗ A.
(2)
From the view of a Markov random walk, when a complex network has community structure, a random walk agent should be found it difficult to move outside its own community boundary, whereas it should be easy for it to reach other nodes within its community, as link density within a community should be high, by definition. In other words, the probability of remaining in the same community, that is, an agent starts from any node and stays in its own community after freely walking by a number of steps, should be greater than that of going out to a different community. It’s for the reason above, that in this ant colony optimization (ACO) strategy, each ant (only different from agent in the sense that it can consult and update a “pheromone” variable in each link) takes the transition probability of random walk as heuristic rule and is directed by pheromone to find its solution. At each iteration, the solution found by each ant only expresses its local view, but one can derive a global solution when all of the ants’ solutions are aggregated to one through a clustering ensemble, which will be used to update the pheromone matrix. As the process evolves, the cluster characteristic of the pheromone matrix will gradually become sharper and algorithm ACOMRW converges to a solution where the community structure can be accurately detected. In short, the pheromone matrix can be regarded as the final clustering result which aggregates the information of all the ants at all iterations in this algorithm. In order to further clarify this above idea, an intuitive description is presented as follows. Given a network N which has community structure. One lets some ants freely crawl along the links in the network. The ants have a given life-cycle, and the new ant colony will be generated immediately when all of the former ants die. At the beginning of this algorithm, there is yet no impact from the pheromone on network N . Only because of the restriction by community structure, the ant’s probability for remaining in its own community should be greater than that for going out to other communities, but there is no difference between these ants and the random walk agents at the moment since pheromone distribution is still homogeneous. As ants move, with the accumulation and volatilization of pheromone left by the former ants, the pheromone on within-community links will become thicker and thicker, and the pheromone on between-community links will become thinner and thinner. In fact, pheromone is simply a mechanism that can register in the network past walks and that leads to more informed decisions for subsequent walks. This strengthens the trend that any ant will more often remain in its own community. At last, when the pheromone matrix converges, the clustering result of network N will be got naturally. In one word, the idea behind of ACOMRW is that, by strengthening within-community links and weakening between-community links, an underlying community structure of the network will gradually become visible.
126
2.2
D. Jin et al.
Algorithm Description
A Solution by One Ant. Each ant is leading to a solution in ant colony optimization, so a solution produced by one ant should be a clustering solution of the network for the NCP. Given network N = (V, E), consider a stochastic process defined on network N with pheromone, in which an ant freely crawls from one node to another along the links between them. After the ant arrives at one node, directed by pheromone left by the former ants on the links, it will rationally select one of its neighbors and move there. Let X = {Xt , t ≥ 0} denote the ant’s positions, and P {Xt = i, 1 ≥ i ≥ n} be the probability that the ant arrives at node i after walking t steps. For ∀it ∈ V we have P {Xt = it |X0 = i0 , X1 = i1 , . . . , Xt−1 = it−1 } = P {Xt = it |Xt−1 = it−1 }. That is, the next state of the ant is only decided by its previous state, which is called a Markov property. So this stochastic process is a discrete Markov chain and its state space is V . Furthermore, Xt is homogeneous because of P {Xt = j|Xt−1 = i} = mij , where mij is the ant’s transition probability from node i to node j. Let the transition probability matrix of random walk, which is regarded as heuristic rule, be P = (pij )n×n , and the current pheromone matrix be B = (bij )n×n , then the probability mij that any ant walks from any node i to its neighbor node j within one step can be computed by Eq. (3). Therefore, the transition probability matrix of the ants should be M = (mij )n×n . bij pij mij = . r bir pir
(3)
Consider the Markov dynamics of each ant above. Let the start position of one ant be node s, the step number limitation be l, and Vst denote the t-step ( t ≤ l ) transition probability distribution of the ant, in which Vst (j) denotes the probability that this ant walks from node s to node j within t steps. There should be Vs0 = (0, . . . 0, 1, 0, . . . 0), where Vs0 (s) = 1. If we also consider the influence of power-law degree distribution in complex network, directed by matrix M , Vst can be given by Vst = Vst−1 ∗ M T . (4) In this algorithm, all the ants take the transition probability of random walk as heuristic rule, and are directed by pheromone at the same time. Thus, as the link density within a community is, in general, much higher than that between communities, an ant that starts from the source node s should have more paths to choose from to reach the nodes in its own community within l steps, where the value of l can’t be too large. On the contrary, the ant should have much lower probability to arrive the nodes outside its community. In other words, it will be hard for an ant that falls on a community to pass those “bottleneck” links and leave the existing community. Furthermore, with the evolution of algorithm ACOMRW, the pheromone on within-community links will become thicker and thicker, and the pheromone on between-community links will become thinner
ACO with Markov Random Walk for Community Detection in Graphs
127
and thinner. This makes the trend that any ant remains in its own community more and more obvious. Here we define Eq. (5), where Cs denotes the community where node s is situated. More formally speaking, Eq. (5) should be met better and better with the evolution of pheromone matrix. When algorithm ACOMRW converges at last, Eq. (5) will be completely satisfied, and underlying community structure will become visible. Later we will give detailed analysis on parameter l. ∀
∀
i∈Cs j ∈C / s
Vsl (i) > Vsl (j) .
(5)
The algorithm that each ant adopts to compute its l-step transition probability distribution Vsl is given bellow. It is described by using Matlab pseudocode. Procedure Produce V / ∗ Consider that each ant has already visited t + 1 nodes upon any t steps, thus the max t + 1 elements in Vst should be set at 1 af ter each step. ∗ / input: s / ∗ start position of this ant ∗ / B / ∗ current pheromone matrix ∗ / P / ∗ transition probability matrix of random walk ∗ / l / ∗ limitation of step number ∗ / output: V / ∗ l − step transition probability distribution of this ant ∗ /
begin 1 V ← zeros(1, n); 2 V (s) ← 1; 3 M ← P. ∗ B; 4 D ← sum(M, 2); 5 D ← diag(D); 6 M ← inv(D) ∗ M ; 7 M ← M ; 8 f or i = 1 : l 9 V ← V ∗ M; 10 if i = l 11 [sorted V, ix] ← sort(V, descend ); 12 V (ix(1 : i + 1)) ← 1; 13 end 14 end end After attaining Vsl , the current problem is how to find the ant’s solution which should be also a clustering solution of the network. However, each ant only has the ability of denoting that it can visit the nodes in its own community with high probability, but the nodes with low visited probability are not necessarily in one community, which may respectively belong to several different communities. Therefore one ant can only find its own community from its local view. This algorithm sorts Vsl in descending order, and then calculates differences between adjacent elements of the sorted Vsl , finding the point corresponding
128
D. Jin et al.
to the max difference. It’s obvious that the point corresponding to the biggest “valley” of the sorted Vsl should be the most suitable one as the cutoff point to identify the community of this ant. At last, we believe that the points whose visited probability value is greater than that of the cutoff point should be in a same community, but we don’t consider which communities the nodes left out belong to. It’s obvious, a solution produced by one ant is its own community. Given Vsl , the algorithm that divides Vsl and finds this ant’s solution is as follows. Procedure Divide V / ∗ As each ant has visited at least l + 1 nodes, the index of cutoff point shouldn t be less than l + 1. ∗ / input: V / ∗ l − step transition probability distribution of this ant ∗ / output: solution / ∗ solution of this ant ∗ /
begin 1 [sorted V, ix] ← sort(V, descend ); 2 diff V ← −diff (sorted V ); 3 diff V ← dif f V (l + 1 : length(diff V )); 4 [max diff , cut pos] ← max(diff V ); 5 cut pos ← cut pos + l; 6 cluster ← ix(1 : cut pos); 7 solution ← zeros(n, n); 8 solution(cluster, cluster) ← 1; 9 I ← eye(cut pos, cut pos); 10 solution(cluster, cluster) ← solution(cluster, cluster) − I; end In the network, let the number of total nodes be n, and the number of total edges be m. If the network is denoted by its adjacency matrix, the time complexity of Produce V is O(l ∗ n2 ). Divide V needs to sort all nodes according to their probability values. Because there exist linear-time sort algorithms (such as bin sort and counting sort) which can be adopted, the time complexity of Divide V is O(n). Thus, the overall complexity that one ant induces to produce its solution is O(l ∗ n2 ). However, if this network is denoted by linked list, its time complexity can be decreased to O(l(m + n)). As most complex networks are sparse graphs, this can be very efficient. Algorithm ACOMRW. There are two main parts in algorithm ACOMRW: exploration phase and partition phase. The goal of the first phase is to attain the pheromone matrix when the algorithm converges. The goal of the second phase is to analyze this pheromone matrix in order to get the clustering solution for the network. The algorithm of the exploration phase is given by: Procedure Exploration Phase input: A, T, S, ρ / ∗ A is the adjacent matrix of the network, T is the limitation of iteration number, S is the size of ant colony,
ACO with Markov Random Walk for Community Detection in Graphs
129
ρ is the updating rate to ants pheromone matrix ∗ / output: B / ∗ denotes pheromone matrix ∗ /
begin 1 D ← sum(A, 2); 2 D ← diag(D); 3 P ← inv(D) ∗ A; / ∗ produce transition probability matrix of random walk ∗ / 4 B ← ones(n, n)/n; / ∗ initialize the pheromone matrix ∗ / 5 f or i = 1 : T 6 solution ← zeros(n, n); 7 f or j = 1 : S 8 solution ← solution + one ant(P, B); / ∗ one ant returns a solution ∗ / 9 end / ∗ aggregate local solutions of all ants to one ∗ / 10 D ← sum(solution, 2); 11 D ← diag(D); 12 solution ← inv(D) ∗ solution; / ∗ normalize the solution ∗ / 13 B ← (1 − ρ) ∗ B + ρ ∗ solution; / ∗ update the pheromone matrix ∗ / 14 end end As we can see, at each iteration, this algorithm aggregates the local solutions of all ants to a global one, and then updates the pheromone matrix B by using it. With the increase of iterations, the pheromone matrix is gradually evolving, which makes the ants more and more directed, and the trend that any ant stays in its own community more and more obvious. When the algorithm finally converges, the pheromone matrix B can be regarded as the final clustering result which aggregates the information of all the ants at all iterations. The next step is how to analyze the produced pheromone matrix B in order to attain the clustering solution of the network. Because of the convergence property of ACO, the cluster characteristic of matrix B is very obvious. The description of a simple partition phase algorithm is as follows. Procedure Partition Phase input: B / ∗ pheromone matrix af ter algorithm converging ∗ / output: C / ∗ f inal clustering solution or community structure ∗ / begin 1 Compute cutoff value ε; / ∗ ε is 1/n, where n is the number of nodes ∗ / 2 Get the first row of B, and take the nodes whose values are greater than ε as a community; 3 From matrix B, delete all the rows and columns corresponding to the nodes in above community; 4 If B is not empty go to step 2, otherwise return the clustering solution C which includes all the communities; end
130
D. Jin et al.
Because of the convergence properties of the exploration phase algorithm, we can develop a simple algorithm for the partition phase. As known from the convergence property of ACO, in pheromone matrix B, the rows which correspond to the nodes in the same community should be equal. Therefore, by choosing any row from B, we can identify its community by using a small positive number ε as cutoff value to divide this row. 2.3
Parameter Setting
There are four parameters: T, S, ρ, l in this algorithm, which denote iteration number limitation, ant colony size, the update rate to the ants’ pheromone matrix and a step number limitation, respectively. The first three parameters are easy to be determined, they can be set as: T = 20, S = n (where n is the number of nodes in the network) and ρ = 0.5 in general. However, the setting of parameter l is very difficult and important. As most social networks are small-world networks, the average distance between two nodes was shown to be around 6, according to the theory of “six degrees of separation” [14]. For scale-free networks, the average distance is usually small too. The World Wide Web is one of the biggest scale-free networks that we have found so far. However, the average distance of such a huge network is actually about 19 [15]; that is, we can get to anywhere we want through 19 clicks on average. Thus, based on the above general observations, good options for the value of l that we propose, in practice, should somehow fall in the range 6 ≤ l ≤ 19. Additionally, the l-value that we are considering is actually the average distance between nodes within any community instead of the whole network, thus it can be even smaller. In addition, by using this algorithm we can attain a reasonable hierarchical community structure of the network with the change of parameter l. The experimental section of the paper will give a detailed analysis on parameter l.
3
Experiments
In order to quantitatively analyze the performance of algorithm ACOMRW, we tested it by using both computer-generated and real-world networks. We conclude by analyzing the parameter l defined in this algorithm. In this experiment our algorithm is compared with GN algorithm [3], Fast Newman (FN) algorithm [5] and FEC algorithm [10]. They are all known and competitive network clustering algorithms. In order to more fairly compare the clustering accuracy of different algorithms, we adopt two widely used accuracy measures, which are Fraction of Vertices Classified Correctly (FVCC) [5] and Normalized Mutual Information (NMI) [16] respectively. 3.1
Computer-Generated Networks
To test the performance of ACOMRW, we adopt random networks with known community structure, which have been used as benchmark datasets for testing complex network clustering algorithms [3]. This kind of random network is
ACO with Markov Random Walk for Community Detection in Graphs
131
defined as RN (C, s, d, zout ), where C is the number of communities, s is the number of nodes in each community, d is the degree of nodes in the network, and each node has zin edges connecting it to members of its own community and zout edges to members of other communities. Parameter l is set at 6 for algorithm ACOMRW and benchmark random network RN (4, 32, 16, zout) is used in this experiment. It’s obvious that as zout is increased from zero, community structures of networks become more diffused and the resulting networks pose greater and greater challenges to network clustering algorithms. Especially, a network doesn’t have community structure when zout is greater than 8 [3]. Fig. 1 shows the results, in which y-axis denotes clustering accuracy and x-axis denotes zout . For each zout , for each algorithm, we compute the average accuracy through clustering 50 random networks. As we can see from Fig. 1, our algorithm significantly outperforms the other three algorithms in terms of both these accuracy measures. Furthermore, as zout becomes larger and larger, the superiority of our algorithm becomes more and more significant. Especially, when zout equals 8, which means the number of within-community and between-community edges per vertex is the same, our algorithm can still correctly classify 100% of vertices into their correct communities, while the accuracy of the other algorithms is low at this moment. 1.05 tlyc 1 rreo 0.95 c de 0.9 ifis sa 0.85 cl se 0.8 ticr ev 0.75 fo onit 0.7 acfr 0.65
1 onti 0.9 a rm fon 0.8 il uat u 0.7 m edz lia 0.6 m orn 0.5
0.40
GN FN FEC ACOMRW 1 2 3 4 5 6 7 number of inter-community edges per vertex zout
(a)
8
0.6 0
GN FN FEC ACOMRW 1 2 3 4 5 6 7 number of inter-community edges per vertex zout
8
(b)
Fig. 1. Compare ACOMRW with GN, FN and FEC against benchmark random networks. (a) take NMI as accuracy measure; (b) take FVCC as accuracy measure.
3.2
Real-World Networks
As a further test on algorithm ACOMRW, we applied it to three widely used real-world networks with a known community structure. These actual networks may have different topological properties than the artificial ones. They are the well known karate network [17], dolphin network [18] and football network [3]. In our algorithm, parameter l is set to be 7, 3 and 11 for these three networks respectively. The clustering results (over 50 runs) of algorithms ACOMRW, GN, FN and FEC against the three real-world networks are shown as Table 1. It can be seen that the clustering accuracy of ACOMRW is clearly higher than that of the other algorithms in terms of both these two accuracy measures.
132
D. Jin et al.
Table 1. Compare ACOMRW with GN, FN and FEC on three real-world networks karate network Algs
NMI
GN FN FEC ACO
57.98% 69.25% 69.49% 100%
dolphin network
FVCC C Num NMI
FVCC C Num NMI
FVCC C Num
97.06% 97.06% 97.06% 100%
98.39% 96.77% 96.77% 98.39%
83.48% 63.48% 77.39% 93.04%
5 3 3 2
6
44.17% 50.89% 52.93% 88.88%
0.54 0.52 0.5 0.48 eu0.46 la v- 0.44 Q 0.42 0.4 0.38 0.366
ACOMRW
5 re b4 m un re stu3 lc 2 16
8
10
football network
12 14 L-steps
16
18
20
(a)
13 5 4 2
87.89% 75.71% 80.27% 92.69%
10 7 9 12
Q-value of ACOMRW Q-value of real community structure
8
10
12 14 L-steps
16
18
20
(b)
(c)
Fig. 2. Sensitivity analysis of parameter l. (a) Cluster number got by ACOMRW as a function of parameter l. (b) Q-values got by ACOMRW as a function of parameter l. (c) Clustering solution got by ACOMRW varying with parameter l.
ACO with Markov Random Walk for Community Detection in Graphs
3.3
133
Parameters Analysis
Parameter l is very important in algorithm ACOMRW. In Sec. 2.3, we have given a reasonable indication on the range of parameter l by considering the small-world and scale-free networks. Taking the dolphin network as an example, this section gives a more detailed analysis. Here we adopt network modularity function (Q), which was proposed by Newman and has been widely accepted by the scientific community [19], as a measure of the compactness of communities. Fig. 2 shows how the cluster number, Q value, and the clustering result got by algorithm ACOMRW vary with its parameter l. From Fig. 2(a) and (c), we find out that this algorithm divides the network into 5 little tight communities when l is small. With the increase of parameter l, the communities between which there are more edges are beginning to merge. Finally, this network is divided into 2 big tight communities. Note that: the red nodes form a community and the blue nodes form another community in the real community structure of dolphin network. Furthermore, from Fig. 2(b), the Q values got by ACOMRW for different l values are all greater than that of the real community structure of this network, thus we can also say algorithm ACOMRW can give some welldefined hierarchical community structures of networks with the change of its parameter l.
4
Conclusions and Future Work
The main contribution of this paper is to propose a high accuracy network clustering algorithm ACOMRW. Because the real community structures of most current large scale networks are still unknown now, we have adopted as benchmark computer generated networks and some widely used real world networks, whose community structures are known, to test its performance. In the future, we also wish to apply it to the analysis of real-world large-scale networks, and try to uncover and interpret significative community structure that is expected to be found on them. Acknowledgments. This work was supported by National Natural Science Foundation of China under Grant Nos. 60773099, 60703022, 60873149, 60973088, the National High-Tech Research and Development Plan of China under Grant No. 2006AA10Z245, the Open Project Program of the National Laboratory of Pattern Recognition, the Fundamental Research Funds for the Central Universities under Grant No. 200903177 and the Project BTG of European Commission.
References 1. Watts, D.J., Strogatz, S.H.: Collective Dynamics of Small-World Networks. Nature 393(6638), 440–442 (1998) 2. Barabsi, A.L., Albert, R., Jeong, H., Bianconi, G.: Power-law distribution of the World Wide Web. Science 287(5461), 2115a (2000)
134
D. Jin et al.
3. Girvan, M., Newman, M.E.J.: Community Structure in Social and Biological Networks. Proceedings of National Academy of Science 9(12), 7821–7826 (2002) 4. Santo, F.: Community Detection in Graphs. Physics Reports 486(3-5), 75–174 (2010) 5. Newman, M.E.J.: Fast Algorithm for Detecting Community Structure in Networks. Physical Review E 69(6), 066133 (2004) 6. Guimera, R., Amaral, L.A.N.: Functional cartography of complex metabolic networks. Nature 433(7028), 895–900 (2005) 7. Barber, M.J., Clark, J.W.: Detecting Network Communities by Propagating Labels under Constraints. Phys. Rev. E 80(2), 026129 (2009) 8. Jin, D., He, D., Liu, D., Baquero, C.: Genetic algorithm with local search for community mining in complex networks. In: Proc. of the 22th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2010), pp. 105–112. IEEE Press, Arras (2010) 9. Palla, G., Derenyi, I., Farkas, I., Vicsek, T.: Uncovering the overlapping community structures of complex networks in nature and society. Nature 435(7043), 814–818 (2005) 10. Yang, B., Cheung, W.K., Liu, J.: Community Mining from Signed Social Networks. IEEE Trans. on Knowledge and Data Engineering 19(10), 1333–1348 (2007) 11. Zhang, Y., Wang, J., Wang, Y., Zhou, L.: Parallel Community Detection on Large Networks with Propinquity Dynamics. In: Proc. the 15th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 997–1005. ACM Press, Paris (2009) 12. Morarescu, C.I., Girard, A.: Opinion Dynamics with Decaying Confidence: Application to Community Detection in Graphs. arXiv:0911.5239v1 (2010) 13. Strehl, A., Ghosh, J.: Cluster ensembles-a knowledge reuse framework for combining partitionings. Journal of Machine Learning Research 3, 583–617 (2002) 14. Milgram, S.: The Small World Problem. Psychology Today 1(1), 60–67 (1967) 15. Albert, R., Jeong, H., Barabasi, A.L.: Diameter of the World Wide Web. Nature 401, 130–131 (1999) 16. Danon, L., Duch, J., Diaz-Guilera, A., Arenas, A.: Comparing community structure identification. J. Stat. Mech., P09008 (2005) 17. Zachary, W.W.: An Information Flow Model for conflict and Fission in Small Groups. J. Anthropological Research 33, 452–473 (1977) 18. Lusseau, D.: The Emergent Properties of a Dolphin Social Network. Proc. Biol. Sci. 270, S186–S188 (2003) 19. Newman, M.E.J., Girvan, M.: Finding and Evaluating Community Structure in Networks. Phys. Rev. E 69(2), 026113 (2004)
Faster and Parameter-Free Discord Search in Quasi-Periodic Time Series Wei Luo and Marcus Gallagher The University of Queensland, Australia {luo,marcusg}@itee.uq.edu.au Abstract. Time series discord has proven to be a useful concept for timeseries anomaly identification. To search for discords, various algorithms have been developed. Most of these algorithms rely on pre-building an index (such as a trie) for subsequences. Users of these algorithms are typically required to choose optimal values for word-length and/or alphabetsize parameters of the index, which are not intuitive. In this paper, we propose an algorithm to directly search for the top-K discords, without the requirement of building an index or tuning external parameters. The algorithm exploits quasi-periodicity present in many time series. For quasiperiodic time series, the algorithm gains significant speedup by reducing the number of calls to the distance function. Keywords: Time Series Discord, Minimax Search, Time Series Data Mining, Anomaly Detection, Periodic Time Series.
1
Introduction
Periodic and quasi-periodic time series appear in many data mining applications, often due to internal closed-loop regulation or external phase-locking forces on the data sources. A time series’ temporary deviation from a periodic or quasiperiodic pattern constitutes a major type of anomalies in many applications. For example, an electrocardiography (ECG) recording is nearly periodic, as one’s heartbeat. Figure 1 shows an ECG signal where a disruption of periodicity is highlighted. This disruption of periodicity actually indicates a Premature Ventricular Contraction (PVC) arrhythmia [3]. As another example, Figure 4 shows the number of beds occupied in a tertiary hospital. The time series suggests a weekly pattern—busy weekdays followed by quieter weekends. If the weekly pattern is disrupted, then chaos often follows with elective surgeries being canceled and the emergency department being over-crowded, greatly impacting patient satisfaction and health care quality. Time Series Discord captures the idea of anomalous subsequences in time series and has proven to be useful in a diverse range of applications (see for example [5,1,11]). Intuitively, a discord of a time series is a subsequence with the largest distance from all other non-overlapping subsequences in the time series. Similarly, the 2nd discord is a subsequence with the second largest distance from all other non-overlapping subsequences. And more generally one can search
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 135–148, 2011. c Springer-Verlag Berlin Heidelberg 2011
136
W. Luo and M. Gallagher
1st discord
5
distance 10
15
d ^ d
0
1000
3000 index
5000
Fig. 1. An ECG time series that demonstrates periodicity, baseline shift, and a discord. The time series is the second-lead signal from dataset xmitdb_x108_0 of [6]. According to [3], the ECG was taken under the frequency of 360 Hz. The unit for measurement is unknown to the author.
0
1000
3000 p
5000
Fig. 2. Illustration of Proposition 1. The blue solid line represents the true d for the time series xmitdb_x108_0 (with subsequence length 360). The red dashed line ˆ for d. Although represents an estimate d ˆ is very different from at many locations d ˆ coincides with the d, the maximum of d maximum of d.
for the top-K discords [1]. Finding the discord for a time series in general requires comparisons among O(m2 ) pair-wise distances, where m is the length of the time series. Despite past efforts in building heuristics (e.g., [5,1]), searching for the discord still requires expensive computation, making real-time interaction with domain experts difficult. In addition, most of existing algorithms are based on the idea of indexing subsequences with a data structure such as a trie. Such data structures often have unintuitive parameters (e.g., word length and alphabet size) to tune. This means time consuming trial-and-error that compromises the efficiency of the algorithms. Keogh, Lin, and Fu first defined time series discords and proposed a search algorithm named HOT SAX in [5]. A memory efficient search algorithm was also proposed later [11]. HOT SAX builds on the idea of discretizing and indexing time series subsequences. To select the lengths for index keys, wavelet decomposition can be used ([2,1]). Most recently, adaptive discretization has been proposed to improve the index for efficient discord search ([8]). In this paper, we propose a fast algorithm to find the top-K discords in a time series without prebuilding an index or tuning parameters. For periodic or quasi-periodic time series, the algorithm finds the discord with much less computation, compared to results previously reported in the literature (e.g., [5]). After finding the 1st discord, our algorithm finds subsequent discords with even less computation—often 50% less. We tested our algorithm with a collection of datasets from [6] and [4]. The diversity of the collection shows the definition of “quasi-periodicity” can be very relaxed for our algorithm to achieve search efficiency. Periodicity of a time series can be easily assessed through visual inspection. The experiments with artificially generated non-periodic random walk time series showed increased running time, but
Faster and Parameter-Free Discord Search in Quasi-Periodic Time Series
137
the algorithm is still hundreds of times faster than the brute-force search, without tuning any parameter. The paper is organized as follows. Section 2 reviews the definition of timeseries discord and existing algorithms for discord search. Section 3 introduces our direct search algorithm and explains ideas behind it. Section 4 presents empirical evaluation for the new algorithm and a comparison with the results of HOT SAX from [5]. Section 5 concludes the paper.
2
Time Series Discords
This section reviews the definition of time-series discord and major search algorithms.
950 850 750
Number of Occupied Beds
Notation. In this paper, T (t1 , . . . , tm ) denotes a time series of length m. In addition, T [p; n] denotes the length-n subsequence of T with beginning position p. The distance between two length-n subsequences T [p; n] and T [q; n] is denoted distT,n (p, q). Following [5], we consider by default the Euclidean distance between two standardized subsequences—all subsequences are standardized to have a mean of 0 and a standard deviation of 1. Nevertheless, the results in this paper apply to other definitions of distance. Given a subsequence T [p; n], the minimum distance between T [p; n] and any non-overlapping subsequence T [q; n] is denoted dp,n (i.e., dp,n = minq:|p−q|≥n distT,n (p, q)). As n is a constant, we often write dp ˆ for dp,n . Finally we use d to denote the vector (d1 , d2 , . . . , dm−n+1 ) and use d and dˆp to denote estimates for d and dp respectively. For a time series of length m, there are at most 12 (m − n − 1)(m − n − 2) + 1 distinct distT,n (p, q) values. In particular distT,n (p, q) = distT,n (q, p) and distT,n (p, p) == 0. Figure 3 shows a heatmap of the distances distT,n (p, q) for all p and q values of the time series xmitdb_x108_0 (see Figure 1).
Sep
Oct
Nov
Date
Fig. 3. Distribution of distT,360 (p, q) where T is Fig. 4. Hourly bed occupancy in a the time series xmitdb_x108_0 tertiary hospital for two months
138
W. Luo and M. Gallagher
The following definition is reformulated from [5]. Definition 1 (Discord). Let T be a sequence of length m. A subsequence T [p(1) ; n] is the first discord (or simply the discord) of length n for T if p(1) = argmax {dp : 1 ≤ p ≤ m − n + 1} .
(1)
p
Intuitively, a discord is the most “isolated” length-n subsequence in the space Rn . Subsequent discords—the second discord, the third discord, and so on—of a time series are defined inductively as follows. Definition 2. Let T [p(1) ; n], T [p(2) ; n], . . . , T [p(k−1) ; n] be the top k − 1 discords of length n for a time series T . Subsequence T [p(k) ; n] is the k-th discord of length n for T if p(k) = argmax dp : |p − p(i) | ≥ n for all i < k p
Note that the values for both n and k should be determined by the application; they are independent of the search algorithm. If a user was looking for three most unusual weeks in the bed occupancy example (Figure 4), k would be 3 and n would be 7 ∗ 24, assuming the time series is sampled hourly. Strictly speaking, the discord is not well defined as there may be more than one location p that maximizes dp (i.e., dp1 = dp2 = max p {dp : 1 ≤ p ≤ m − n + 1}). But the ambiguity rarely matters in most applications, especially when the top-K discords are searched in a batch. In this paper, we shall follow the existing literature [5] and assume that all dp ’s have distinct values. The discord has a formulation similar to the minimax problem in game theory. Note that maxp minq {distT,n (p, q) : |p − q| ≥ n} ≤ min max {distT,n (p, q) : |p − q| ≥ n} . p
q
According to Sion’s minimax theorem [9], the equality holds if distT,n (p, ·) is quasi-concave on q for every p and distT,n (·, q) is quasi-convex on p for every q. Figure 3 indicates, however, that in general neither distT,n (p, ·) is quasi-concave nor distT,n (·, q) is quasi-convex, and no global saddle point exists. That suggests searching for discords requires a strategy different from those used in game theory. In the worst case, searching for the discord has the complexity O(m2 ), essentially requiring brute-force computation of the pair-wise distances of all length-n subsequences of the time series. When m = 104 , that means 100 million calls to the distance function. Nevertheless, the following sufficient condition for the discord suggests a search strategy better than the brute-force computation. Observation 1. Let T be a time series. A subsequence T [p∗ ; n] is the discord of length n if there exists d∗ such that ∀q : |p∗ − q| > n ⇒ distT,n (p∗ , q) ≥ d∗ , and ∗
∗
∀p = p , ∃q : (|p − q| > n) ∧ (distT,n (p, q) < d ).
(2) (3)
Faster and Parameter-Free Discord Search in Quasi-Periodic Time Series
139
In general, there are infinitely many d∗ that satisfies Clause (2) and Clause (3). Suppose we have a good guess d∗ . Clause (3) implies that a false candidate of the discord can be refuted, potentially in fewer than m steps. Clause (2) implies that, given all false candidates have been refuted, the true candidate for the discord can be verified in m − n + 1 steps. Hence in the best case, (m − n + 1) + (m − 1) = 2m − n calls to the distance function are sufficient to verify the discord. To estimate d∗ , we can start with the value of dp where p is a promising candidate for the discord, and later increase the guess to a larger value dp if p is not refuted (i.e., distT,n (p , q) > dp for every non-overlapping q) and becomes the next candidate. This hill-climbing process goes on until all but one of the subsequences are refuted with the updated value of d∗ . This idea forms the basis of most existing discord search algorithms (e.g., HOT SAX in [5] and WAT in [1]); the common structure of these algorithms is shown in Figure 5. With this base algorithm, the efficiency of a search then depends on the
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12:
Select a p0 and let d∗ ← dp0 and p∗ ← p0 . {Initialization} for all the remaining locations p ordered by certain heuristic Outer. do {Outer Loop} for all locations q ordered by some heuristic Inner such that |p − q| ≥ n. do {Inner Loop} if distT,n (p, q) < d∗ then According to Clause (3) in Observation 1, T [p; n] cannot be the discord; break to next p. end if end for if minq distT,n (p, q) > d∗ then As Clause (2) in Observation 1 is not met, update d∗ ← minq distT,n (p, q) and p∗ ← p. end if end for return p∗ and d∗
Fig. 5. Base algorithm for HOT SAX and WAT
order of subsequences in the Outer and Inner loops (see lines 2 and 3). Intuitively, the Outer loop should rank p according to the singularity of subsequence T [p; n]; the Inner loop should rank q according the proximity between subsequences T [p; n] and T [q; n]. Both HOT SAX and WAT adopt the following strategy. Firstly all subsequences of length n are discretized and compressed into shorter strings. Then the strings are indexed with a suffix trie—in the ideal situation, subsequences close in distance also share an index key or occupy neighboring index keys in the trie. This is not so different from the idea of hashing to achieve O(1) search time. In the end, all subsequences will be indexed into a number of buckets on the terminal nodes. The hope is that, with careful selection of string length and alphabet size, the discord will fall into a bucket containing very few subsequences while a non-discord subsequence will fall into a bucket shared with similar subsequences. Then the uneven distribution of subsequences among the buckets can be exploited to devise efficient ordering for the Outer and Inner loops. This ingenious approach however has two drawbacks. Firstly, one needs to select optimal parameters that balance the index size and the bucket size, which are critical to the search efficiency. For example, to use HOT SAX, one needs
140
W. Luo and M. Gallagher
to set the alphabet size and the word size for the discretized subsequences [5, Section 4.2]; WAT automates the selection of word size, but still requires setting the alphabet size [1, Section 3.2]. Such parameters are not always intuitive to a user, as the difficulty of building a useable trie has been discussed in [11, Section 2]. Secondly, the above approach uses fixed/random order in the outer loop to search for all top-K discords. A dynamic ordering for the outer loop could potentially make better use of the information gained in the previous search steps. Also it is not clear how knowledge gained in finding the k-th discord can help finding the (k + 1)-th discord. In [1, Section 3.6], partial information about ˆ is cached so that the inner loop may break quickly. But as caching works at d the “easy” part of the search space—where dp is small, it is not clear how much computation is saved. In the following section, we address the above issues by proposing a direct way to search for multiple discords. In particular, our algorithm requires no ancillary index (and hence no parameters to tune), and the algorithm reuses the knowledge gained in searching for the first k discords to speed up the search for the (k + 1)-th discord.
3
Direct Discord Search
In Definition 1, the formula p(1) = argmax {dp : 1 ≤ p ≤ m − n + 1} suggests a p
direct way to search for the discord with the following two steps: ˆ Step 1: Compute an estimate dp of dp for each p. ∗ ˆ Step 2: Let p arg maxp dp : 1 ≤ p ≤ m − n + 1 , and verify that T [p∗ ; n] is the discord. Step 2 can be carried out by testing the condition dp∗ ≥ maxp dˆp , as justified by the following proposition. ˆ be an estimate of d such that d ˆ d. If dp∗ ≥ maxp dˆp , Proposition 1. Let d then dp∗ ≥ maxp dp . ˆ d, we have dp∗ ≥ maxp dˆp ≥ maxp dp . Proof. With d Proposition 1 gives a sufficient condition for verifying the discord of a time series. ˆ does not have to be close to d at every location p. To verify the It shows that d ˆ d and max d ≥ max d. ˆ This point is illustrated discord, it suffices to have d in Figure 2. To estimate dp = minq dist(p, q) in Step 1, we can use dˆp minq∈Qp dist(p, q). Here Qp is a subset of {q : |p − q| > n}—Hence dˆp ≥ dp . As Qp includes more locations, the error dˆp − dp becomes smaller. If Qp = {q : |p − q| > n}, then dˆp − dp = 0. By controlling the size of Qp , we can control the accuracy of dˆp for different p. Therefore Proposition 1 justifies the search strategy shown in Figure 6. For top-K discords search, the while-loop (Line 2-10) is repeated K times (with ˆ keeps decreasing proper book keeping to exclude overlapping subsequences). As ds ˆ in Line 3. in the computation, every time we start with a better estimate d
Faster and Parameter-Free Discord Search in Quasi-Periodic Time Series
141
1: For each p, estimate dˆp minq∈Qp dist(p, q), where Qp is a subset of {q : |p − q| > n}. 2: while the discord has not been found, do 3: p∗ ← argmax{dˆp }. p
4: Compute dp∗ minq dist(p∗ , q). 5: if dp∗ > dˆp for all p = p∗ then 6: return p∗ as the discord starting location. 7: else ˆ by enlarging Qp s. 8: Decrease d 9: end if 10: end while
Fig. 6. Base algorithm for direct discord search
3.1
ˆ Efficient Way to Estimate d
Figure 2 suggests that to find the discord, it is not necessary to have a highly accurate estimate of dˆp for every p. Instead, highly accurate dˆp is needed only when dp is relatively large. To minimize the total computation cost, we should distribute computational resources according to the importance of dˆp . We propose three operations to estimate dp , with increasing level of computation cost. 1. Traversing: Suppose that dist(p, qp ) is known to be small for some qp . For small integers k, let Qp+k contain one location qp + k. Intuitively, traversing translates to searching along the 45 degree lines in Figure 3. 2. Sampling: Let Qp be a set of locations if dp is likely to be large or knowledge of dp is unavailable. We shall see a way to construct such Qp using periodicity of time series. 3. Exhausting: Let Qp be all possible locations if the exact value of dp is required. Note that the most expensive Exhausting operation is needed only in verifying the discord (Line 4 in Figure 6). The Traversing operation can be justified with the following argument. For a relatively large n, distT,n (p, q) distT,n (p+1, q+1). If distT,n (p, qp ) is small, then distT,n (p+1, qp +1) is likely to be small as well. The argument can be “telescoped” to other k values as long as nk is small enough. This is demonstrated in Figure 7, where local minima for sp ’s tend to cluster around some “sweet spots” (the red circle). Therefore, in Traversing, a good estimate dˆp = distT,n (p, q) suggests a “sweet spot” q around which good estimates dˆp+k for neighboring positions (p+k) can be found. The Sampling operation may be implemented with local search with a set of random starting points. But when the time series is nearly periodic or quasi-periodic, more efficient implementation exists. This will be discussed in the next section. 3.2
Quasi-Periodic Time Series
Suppose a time series T is nearly periodic with a period l (i.e., tp tp±k·l ). Then distT,n (p, p ± k · l) 0 implies dp = minq:|p−q|≥n distT,n (p, q) 0 as long
W. Luo and M. Gallagher
0
5
dist(p,, q) 10 15 20
25
30
142
0
200
400
600
800
1000
q
Fig. 7. Distance profiles dist(p, ·) of time series xmitdb_x018_0. Each line plots = (distT,n (p, 1), . . . , the sequence sp distT,n (p, 1000)) for some p, where n = 360. The 10 lines in the plot correspond to p being 10, 20, . . . , 100 respectively.
Fig. 8. Locations of qp ’s for time series xmitdb_x108_0. Each location (p, qp ) is colored according the value dp . Dashed lines are a period (360) apart. Hence if a location (p, qp ) falls on a dashed line, then qp − p is a multiple of the period 360.
as k · l ≥ n for some k . Small distances associated with multiple times of the time-series period can be seen in Figure 7—at locations around p + 360 and p + 2 × 360 for each p in {10, 20, . . . , 100}. With this observation, the following heuristic can be used to implement the Sampling operation for nearly periodic time series: a location q multiple periods away from p is likely to be near a local minimum for {distT,n (p, q) : q}. Figure 8 shows the location qp = argmin{distT,n (p, q)} for all locations p at the time q
series in Figure 1. It shows that in most cases a minimum-location qp is roughly multiple periods away from p. There are a number of ways to estimate the period of a time series. For example, autocorrelation function (see Figure 9) and phase coherence analysis [7] are often used to estimate period. As suggested in Figure 7, the gaps between local minima of a distance profile {distT,n (p, q) : 1 ≤ q ≤ m − n + 1} approximate the period of a time series, for distT,n (p, p + k · l) 0. We use this observation to estimate period in this paper (see Figure 11). Figure 10 shows the collection of gaps {Δk } for local minima of {distT,n (1000, q) : q}, where T is the time series in Figure 1 and n = 360. Taking the median of {Δk } gives the estimate 354 for the period of the time series. Note that the period need to estimated only once (with the distance profile {distT,n (p, q) : 1 ≤ q ≤ m − n + 1} for only one location p). Hence it takes only m − n calls to the distance function to estimate the period of a time series of length m. As a by-product, the exact value of dp is also obtained.
143
0 1000
3000
Fig. 9. Autocorrelation function of time series xmitdb_x108_0. The plot shows multiple peaks corresponding to multiples of the period. 1: 2: 3: 4: 5: 6: 7: 8:
● ● ● ● ●●● ●●
300
5000
Lag
estimated period
0.010 0.000
−0.2
0.2
Density
ACF
0.6
1.0
Faster and Parameter-Free Discord Search in Quasi-Periodic Time Series
400
●
500
600
●
700
Gaps between neighboring local minima
Fig. 10. The density plot for gaps between local minima and the estimated period for time series xmitdb_x108_0
Randomly pick a location p. Compute dist(p, q) for every q. cp ← the lower 5% quantile of all distances calculated in the previous step. Q ← all local minima q such that dist(p, q) ≤ cp . Sort Q in increasing order (Q1 , Q2 , . . . , Q|Q| ). Δk = Qk+1 − Qk for all k < |Q|. l ← the median of {Δk : k < |Q|}. return l as the estimated period.
Fig. 11. Estimating period with the median gap between two neighboring local minima
3.3
Implementation of the Search Strategy
With heuristics for both Traversing and Sampling, Figure 12 implements Line 1 in Figure 6. The procedure uses a sequential covering strategy to estimate dˆp for each p. In each iteration (the while loop from Line 2 to Line 10), a Sampling operation is done to find a “sweet spot”. Then a Traversing operation exploits that location to cover as many neighboring locations as possible. The verification stage of our algorithm (Lines 2-10 in Figure 6) consists of a while loop which resembles the outer loop in HOT SAX and WAT. But here ˆ the order of location is dynamic, determined by the ever-improving estimate d. ˆ Line 8 in Figure 6 further improves d when the initial guess for the discord turns out to be incorrect. The improvement can be achieved by traversing with a better starting location qp∗ produced in Line 4 of Figure 6 (see Figure 13). As suggested by Figure 8, the “best” locations tend to cluster along the 45 degree lines. Moreover the large value of the initial estimate dˆp∗ suggests the neighborˆ As the traversing hood of p∗ is a high-payoff region for further refinement of d. is done locally, the improvement step is relatively fast compared to the initial estimation step for d. To sum up, we have described a new algorithm for discord search that consists of an estimation stage followed by a verification stage. The estimation stage
144
W. Luo and M. Gallagher
1: traversed[p] ← FALSE and mindist[p] ← ∞, for each p. 2: while traversed[p] = FALSE for some p do 3: Randomly pick a location p from {p : traversed[p] = FALSE}. 4: Q ← {q : |p − q| = k · l for some integer k}. 5: Do local search for the optimal qp with starting points in Q.{Sampling} 6: cp ← the lower 5% quantile of all distances calculated in the previous step. 7: Find the largest numbers L and R such that dist(p − i, qp − i) < cp for all i ≤ L and dist(p + i, qp + i) < cp for all i ≤ R. {Traversing} 8: traversed[p − L : p + R] ← TRUE. 9: mindist[p − L : p + R] ← {dist(p + i, qp + i) : −L ≤ i ≤ R}. 10: end while Fig. 12. Implementation of d estimation (Line 1 in Figure 6) 1: Let qp be the minimum location returned with Exhausting. 2: cp ← the lower 5% quantile of all distances calculated in Exhausting. 3: Find the largest numbers L and R such that mindist[p − L : p + R] ≥ dp , dist(p − i, qp − i) < cp and for all i ≤ L, and dist(p + i, qp + i) < cp for all i ≤ R. 4: mindist[p − L : p + R] ← {dist(p + i, qp + i) : −L ≤ i ≤ R}. ˆ Fig. 13. Traversing with a better starting point to improve d
achieves efficiency by dynamically differentiating locations p according to their potential influence to maxp dˆp . Further reduction in computation cost comes from the periodicity of a time series. In general, the Traversing heuristic works best when a time series is smooth (or equivalently densely sampled), while the Sampling heuristic works best when the periodicity of the time series is pronounced. The algorithm is guaranteed to halt and to return the discord by Observation 1. The efficiency of the algorithm has been evaluated in the following section.
4
Empirical Evaluation
In this section, we first compare the performance of our direct-discord-search algorithm with the results reported for HOT SAX in [5]. We then report the performance of our algorithm on a collection of time series which are publicly available. Following the tradition established in [5] and [1], the efficiency of our algorithm was measured by the number of calls to the distance function, as apposed to wall clock or CPU time. Since our algorithm entails no overhead of constructing an index (in contrast to the algorithms in [5] and [1]), the number of calls to the distance function is roughly proportional to the total computation time involved. As shown in [2] and [1], the performance of HOT SAX depends on the parameters selected. Here we assume that the metrics reported in [5] were based on optimal parameter values. To compare to HOT SAX, we use the dataset qtdbsel102 from [6]. Although several datasets were used in [5] to evaluate the performance of HOT SAX, this is the only one readily available to us. The dataset qtdbsel102 contains two time series of length 45, 000; we use the first one as the two are highly correlated.
1e+06
HOT SAX direct search: 1st discord direct search: 2nd discord direct search: 3rd discord ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ●● ●● ● ●● ●● ● ● ● ● ● ● ●●● ● ●● ●● ●● ● ● ● ● ● ●●● ●● ● ● ● ● ●●●●● ● ●● ● ●● ● ● ● ●● ●●● ● ● ● ● ●● ●● ●● ●●●● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ●●●● ● ● ●● ● ● ● ● ● ● ●●●● ● ● ● ● ●● ●● ●● ● ● ● ●● ● ●●●● ● ● ●● ●● ● ●● ●● ● ●●●● ● ●● ● ●● ● ● ● ●● ●● ●● ● ● ● ● ●● ● ● ● ●● ● ●● ●● ● ●● ● ●● ● ● ●● ●● ● ● ●● ●● ● ● ● ● ●●● ● ●●● ● ●
● ● ●
●
●
● ● ● ●● ● ● ●● ● ● ● ●●● ●●● ●● ●● ● ● ● ● ● ● ● ●●●● ● ●● ● ●● ●● ● ●●● ● ● ● ●●●●● ● ● ●● ● ● ●● ●● ●● ● ●●●● ● ● ● ●● ●●●● ● ● ●● ● ● ● ●● ●● ●● ●● ● ●● ● ● ● ● ● ● ● ● ●● ●● ●●●● ● ●● ● ●● ● ● ● ● ●●● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ●●● ● ● ● ● ●● ● ●● ● ● ● ●● ●● ● ● ● ●● ● ● ●●● ● ● ●● ●●●● ● ● ●● ●● ● ● ● ● ● ● ●●●●●● ●● ● ●● ●● ● ●● ●● ● ● ●● ●● ● ●●● ● ●●● ●●● ●● ●● ●● ● ● ● ●● ● ●● ●● ● ●●●●●● ●● ●● ●● ●● ● ●● ● ●● ● ● ● ●● ● ●●
●
● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ●●●●● ●●● ● ● ● ●● ●●● ●●●●●● ● ●● ● ● ● ● ● ● ● ●● ● ●●● ●● ● ● ●●● ●● ● ● ●● ● ● ● ●●● ●● ●●● ●● ●● ● ●● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●●●●●● ● ● ● ●● ●● ●● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ●●●● ● ● ●●● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●●● ●●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●●● ● ● ●●● ●● ●● ● ● ● ●●●● ●● ● ● ● ●
●
●●
●
● ●● ● ● ●●● ● ● ● ● ● ●● ●● ●
0
5000
15000
25000
0
5000
15000 Index
25000
● ● ●
● ● ●●
● ●
● ●
●
● ●
● ●
●
●
●
●
● ● ●
●
● ● ●
●● ● ●● ● ●● ●●●● ●● ●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
● ● ●●● ● ● ● ●● ● ● ● ●
● ●●● ● ● ● ●●● ● ● ● ● ● ●● ● ●● ●●● ● ● ●●
●
●
●
●
● ●
● ●
●
●
^ d 10 14
10 100
●● ● ● ● ● ● ●● ●●● ●●
● ● ● ● ●● ● ● ●● ● ●●● ● ●● ●● ● ● ●● ● ●● ● ● ● ● ● ●● ● ●● ●● ● ● ●●● ●●●● ●● ● ●● ● ● ● ●● ●● ● ●●● ● ●●● ● ● ● ●● ●● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ●● ●●●● ●● ●●● ● ● ●● ●● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ●● ● ●●● ● ●●● ● ●● ●● ●●●● ● ● ● ●● ● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ●● ● ● ●● ●● ●●●● ● ● ● ●● ● ● ● ● ●●●●● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ●●● ●●
●
●
●
●
● ●
●
●● ● ●
● ● ●● ● ●● ● ● ● ● ●● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ●● ●● ● ● ●● ● ● ● ● ●●●●● ●●● ● ●● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ●● ●●● ● ● ●● ●●● ●● ●● ● ● ● ●● ● ● ●●● ● ●●●●● ● ● ●● ● ● ●● ● ● ● ●●● ● ●● ● ● ● ● ● ● ●●● ● ● ● ●● ●● ●●● ● ●● ● ● ●● ● ● ● ● ● ●●● ●● ● ●● ● ● ● ●● ● ● ● ● ● ●●● ● ●● ●● ● ● ● ● ● ●● ● ●● ● ●●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ●● ●● ● ● ●● ● ●● ● ●
● ●
1k
2k
4k
8k
16k
32k
Length of time series
Fig. 14. Search costs for the direct search algorithm and HOT SAX. For HOT SAX, the mean numbers of distance calls were visually estimated from [5]; interval estimates were used to account for potential estimation error.
6
10000
● ●
● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ●● ●●● ● ● ● ● ● ●● ● ●●●● ● ●●● ● ● ● ● ● ●●●●● ●● ● ● ● ● ● ●● ●● ●●● ●● ●●●●● ● ● ●● ●●● ● ● ● ●●●● ● ●● ● ●● ● ●●● ●●● ● ●● ●●● ● ●● ● ● ●● ●● ● ● ● ●●● ● ●● ● ● ● ●●●● ● ●● ●● ●● ●● ● ●● ● ● ●●● ● ●● ●●● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ●●●●● ●● ● ● ●● ●● ● ●● ● ● ●● ● ● ● ● ●● ●● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ●● ●● ● ●●● ● ●● ● ●●● ● ●●● ● ● ●●● ● ●● ● ● ●●● ●● ●●● ● ●
145
1st discord 2nd discord 3rd discord
●
1
Number of calls to the distance function
Faster and Parameter-Free Discord Search in Quasi-Periodic Time Series
ˆ Fig. 15. Time series nprs44 and its d vector
Following [5], we created random excerpts of length {1000 × 2k : 0 ≤ k ≤ 5}from the original time series1 . For each length configuration, 100 random excerpts were created and the top 3 discords of length 128 were searched for. Table 1 shows the mean and the standard error for numbers of calls to the distance function. The rightmost column of the table contains the mean performance metric visually estimated from Figure 13 of [5]. Similar information is also visualized in Figure 14. The figure plots the numbers of calls to the distance function for 6 × 3 × 100 runs of the direct-discord-search algorithm. Each point corresponds to one run of discord search; horizontal jitter was applied to reduce overlaps among points. The dashed intervals estimate the average number of calls to the distance function by HOT SAX. Loess lines for the costs of searching for top-3 discords are also plotted. We can see that for the 1st discord, the average number of calls by the direct search algorithm (the red line) is roughly linear to the size of the time series excerpts. Moreover, these numbers are significant smaller than the numbers reported for HOT SAX (summarized with the dashed intervals). For subsequent discords, the average numbers of calls to the distance function (the blue line and the green line) decrease significantly, due to information gained from prior computation. The metrics for the second and the third discords also show larger variance: some points are significantly higher or lower than the loess lines. A likely cause is that the complete time series contains only a number of truly anomalous subsequences (discords): When a random excerpt of the time series includes only one (or two) of these discords, searching for the second (or the third, respectively) discord will be difficult. (Note the plot uses the log scale for x and y axes.) 1
Experiments on the length 64, 000 were not carried out because qtdbsel102 has only 45, 000 points and we choose not to pad the time series with hypothetical values.
146
W. Luo and M. Gallagher
Table 1. Numbers of calls to the distance function with random excerpts from ptdbsel102, for the direct-discord-search algorithm and HOT SAX g Time Series Direct Search Cost (Standard Error) Aver. Cost for HOT SAX Length 1st discord 2nd discord 3rd discord (visual estimates) 1,000 4,020 (1,441) 1,072 (705) 998 ( 690) 16, 000 to 40, 000 2,000 11,159 (4,641) 4,120 (2,532) 3,493 (2,780) 40, 000 to 100, 000 4,000 30,938 (12,473) 13,963 (10,633) 13,399 (12,473) 60, 000 to 160, 000 8,000 77,381 (33,064) 29,711 (32,651) 38,632 (40,974) 100, 000 to 160, 000 16,000 168,277 (70,071) 94,855 (107,128) 141,038 (143,553) 250, 000 to 400, 000 32,000 365,900 (184,540) 198,797 (95,960) 105,911 (107,992) 400, 000 to 1 × 106
In the second set of experiments, we search for the top 3 discords for a collection of time series from [6]2 and [4], using the proposed algorithm. For time series from [6], the discord lengths are chosen to be consistent with configurations used in [5]. The results are shown in Table 2. Many of these datasets, in particular 2h_radioactivity, demonstrate little periodicity. The results show that our algorithm has reasonable performance even for such time series. Table 2. Numbers of calls to the distance function for top-3 discord search Time Series nprs44 nprs43 power data chfdbchf15 2h radioactivity
Length Discord Length 24,085 18,012 35,000 15,000 4,370
320 320 750 256 128
Search Cost (Standard Error) 1st discord 2nd discord 3rd discord 249,283 188,095 158,235 79,683 157,495
(12,454) (11,820) (13,546) (5,606) (8,799)
231,350 24,588 34,680 21,400 20,286
(19,949) (2,785) (874) (2,224) (5,725)
208,539 109,147 37,992 134,734 16,463
(34,640) (29,516) (3,460) (18,967) (4,657)
In Table 2, the results for the time series nprs44 are particularly interesting. For nprs44, no significant reduction in computation is observed for computing the 2nd and the 3rd discords. To find out why, we plot the time series and the estimated d vector in Figure 15. The figure shows that the 2nd and the 3rd discords are not noticeably different from other subsequences. Completely nonperiodic case. Completely nonperiodic time series rarely exist in applications, and they can be easily identified through visual inspection of the time series or their autocorrelation function. In an unlikely situation where our algorithm is blindly applied to a completely nonperiodic time series, a bad estimation of period will reduce the efficiency of the algorithm. p To demonstrate this, we generate two random walk time series T with tp = i=1 Zi , where Zi are independent normally-distributed random variables with mean 0 and variance 1 2
For datasets containing more than one time series, we take the first one in each data file.
Faster and Parameter-Free Discord Search in Quasi-Periodic Time Series
0
5000
10000
147
15000
Index
(a) Random Walk 1
(b) Random Walk 2
Fig. 16. Random walk time series used in the experiments for completely nonperiodic data Table 3. Number of calls to the distance function for top-3 discord search (random walk time series) Time Length Discord Direct Search Cost (Standard Error) Series Length 1st 2nd 3rd random walk 1 15,000 256 136,395 (7,410) 54,994 (10,144) 34,355 (7,696) random walk 2 30,000 128 441,685 (35,695) 329,380 (50,432) 636,930 (164,842)
(see Figure 16). Random walk time series is interesting in two aspects: firstly a random walk time series is completely nonperiodic; secondly every subsequence of a random walk can be regarded as equally anomalous. We applied the algorithm to find the top-3 discords in the two random-walk time series. The results are shown in Table 3. Without tuning any parameter, the algorithm is still hundreds of times faster than the brute-force computation of all pair-wise distances. To sum up, our experiments show clear performance improvement on quasiperiodic time series by the proposed direct discord-search algorithm. Our algorithm also demonstrates consistent performance across a broad range of time series, with varying degree of periodicity.
5
Conclusions and Future Work
The paper has introduced a parameter-free algorithm for top-K discord search. When a time series is nearly periodic or quasi-periodic, the algorithm demonstrated significant reduction in computation time. Many applications generate quasi-periodic time series, and the assumption of quasi-periodicity can be assessed by simple visual inspection. Therefore our algorithm has wide applicability. Our results have shown that periodicity is a useful feature in time-series anomaly detection. More theoretical study is needed to better understand the effect of periodicity on the search space of time-series discords. We are also interested in knowing to what extent the results in this paper can be generalized to chaotic time series [10].
148
W. Luo and M. Gallagher
One limitation of the proposed algorithm is that the time series need to be fit into the main memory. Hence the algorithm requires O(m) memory. One future direction is to explore disk-aware approximations to the direct-discord-search algorithm. When the time series is too large to be fitted into the main memory, one needs to minimize the number of disk scans as well the number of calls to the distance function (see [11]). Another direction is to explore alternative ways of estimating the d vector so ˆ is minimized. We also are looking for that the number of iterations for refining d ways to extend the algorithm so that the periodicity assumption can be removed.
Acknowledgment Support for this work was provided by an Australian Research Council Linkage Grant (LP 0776417). We would like to thank anonymous reviewers for their helpful comments.
References 1. Bu, Y., Leung, T.W., Fu, A.W.C., Keogh, E., Pei, J., Meshkin, S.: WAT: Finding top-k discords in time series database. In: Proceedings of 7th SIAM International Conference on Data Mining (2007) 2. Fu, A.W.-c., Leung, O.T.-W., Keogh, E.J., Lin, J.: Finding time series discords based on haar transform. In: Li, X., Za¨ıane, O.R., Li, Z.-h. (eds.) ADMA 2006. LNCS (LNAI), vol. 4093, pp. 31–41. Springer, Heidelberg (2006) 3. Goldberger, A.L., Amaral, L.A.N., Glass, L., Hausdorff, J.M., Ivanov, P.C., Mark, R.G., Mietus, J.E., Moody, G.B., Peng, C.-K., Stanley, H.E.: PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation 101(23), e215–e220 (2000), Circulation Electronic Pages: http://circ.ahajournals.org/cgi/content/full/101/23/e215 4. Hyndman, R.J.: Time Series Data Library, http://www.robjhyndman.com/TSDL (accessed on April 15, 2010) 5. Keogh, E., Lin, J., Fu, A.: HOT SAX: Efficiently finding the most unusual time series subsequence. In: Proc. of the 5th IEEE International Conference on Data Mining, pp. 226–233 (2005) 6. Keogh, E., Lin, J., Fu, A.: The UCR Time Series Discords Homepage, http://www.cs.ucr.edu/~eamonn/discords/ 7. Lindstr¨ om, J., Kokko, H., Ranta, E.: Detecting periodicity in short and noisy time series data. Oikos 78(2), 406–410 (1997) 8. Pham, N.D., Le, Q.L., Dang, T.K.: HOT aSAX: A novel adaptive symbolic repre´ atek, sentation for time series discords discovery. In: Nguyen, N.T., Le, M.T., Swi J. (eds.) ACIIDS 2010. LNCS, vol. 5990, pp. 113–121. Springer, Heidelberg (2010) 9. Sion, M.: On general minimax theorems. Pacific J. Math. 8(1), 171–176 (1958) 10. Sprott, J.C.: Chaos and time-series analysis. Oxford Univ. Pr., Oxford (2003) 11. Yankov, D., Keogh, E., Rebbapragada, U.: Disk aware discord discovery: Finding unusual time series in terabyte sized datasets. Knowledge and Information Systems 17(2), 241–262 (2008)
INSIGHT: Efficient and Effective Instance Selection for Time-Series Classification Krisztian Buza, Alexandros Nanopoulos, and Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany {buza,nanopoulos,schmidt-thieme}@ismll.de
Abstract. Time-series classification is a widely examined data mining task with various scientific and industrial applications. Recent research in this domain has shown that the simple nearest-neighbor classifier using Dynamic Time Warping (DTW) as distance measure performs exceptionally well, in most cases outperforming more advanced classification algorithms. Instance selection is a commonly applied approach for improving efficiency of nearest-neighbor classifier with respect to classification time. This approach reduces the size of the training set by selecting the best representative instances and use only them during classification of new instances. In this paper, we introduce a novel instance selection method that exploits the hubness phenomenon in time-series data, which states that some few instances tend to be much more frequently nearest neighbors compared to the remaining instances. Based on hubness, we propose a framework for score-based instance selection, which is combined with a principled approach of selecting instances that optimize the coverage of training data. We discuss the theoretical considerations of casting the instance selection problem as a graph-coverage problem and analyze the resulting complexity. We experimentally compare the proposed method, denoted as INSIGHT, against FastAWARD, a state-of-the-art instance selection method for time series. Our results indicate substantial improvements in terms of classification accuracy and drastic reduction (orders of magnitude) in execution times.
1 Introduction Time-series classification is a widely examined data mining task with applications in various domains, including finance, networking, medicine, astronomy, robotics, biometrics, chemistry and industry [11]. Recent research in this domain has shown that the simple nearest-neighbor (1-NN) classifier using Dynamic Time Warping (DTW) [18] as distance measure is “exceptionally hard to beat” [6]. Furthermore, 1-NN classifier is easy to implement and delivers a simple model together with a human-understandable explanation in form of an intuitive justification by the most similar train instances. The efficiency of nearest-neighbor classification can be improved with several methods, such as indexing [6]. However, for very large time-series data sets, the execution time for classifying new (unlabeled) time-series can still be affected by the significant computational requirements posed by the need to calculate DTW distance between the new time-series and several time-series in the training data set (O(n) in worst case, where n is the size of the training set). Instance selection is a commonly applied approach for speeding-up nearest-neighbor classification. This approach reduces the size J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 149–160, 2011. c Springer-Verlag Berlin Heidelberg 2011
150
K. Buza, A. Nanopoulos, and L. Schmidt-Thieme
of the training set by selecting the best representative instances and use only them during classification of new instances. Due to its advantages, instance selection has been explored for time-series classification [20]. In this paper, we propose a novel instance-selection method that exploits the recently explored concept of hubness [16], which states that some few instances tend to be much more frequently nearest neighbors than the remaining ones. Based on hubness, we propose a framework for score-based instance selection, which is combined with a principled approach of selecting instances that optimize the coverage of training data, in the sense that a time series x covers an other time series y, if y can be classified correctly using x. The proposed framework not only allows better understanding of the instance selection problem, but helps to analyze the properties of the proposed approach from the point of view of coverage maximization. For the above reasons, the proposed approach is denoted as Instance Selection based on Graph-coverage and Hubness for Time-series (INSIGHT). INSIGHT is evaluated experimentally with a collection of 37 publicly available time series classification data sets and is compared against FastAWARD [20], a state-of-the-art instance selection method for time series classification. We show that INSIGHT substantially outperforms FastAWARD both in terms of classification accuracy and execution time for performing the selection of instances. The paper is organized as follows. We begin with reviewing related work in section 2. Section 3 introduces score-based instance selection and the implications of hubness to score-based instance selection. In section 4, we discuss the complexity of the instance selection problem, and the properties of our approach. Section 5 presents our experiments followed by our concluding remarks in section 6.
2 Related Work Attempts to speed up DTW-based nearest neighbor (NN) classification [3] fall into 4 major categories: i) speed-up the calculation of the distance of two time series, ii) reduce the length of time series, iii) indexing, and iv) instance selection. Regarding the calculation of the DTW-distance, the major issue is that implementing it in the classic way [18], the comparison of two time series of length l requires the calculation of the entries of an l × l matrix using dynamic programming, and therefore each comparison has a complexity of O(l 2 ). A simple idea is to limit the warping window size, which eliminates the calculation of most of the entries of the DTW-matrix: only a small fraction around the diagonal remains. Ratanamahatana and Keogh [17] showed that such reduction does not negatively influence classification accuracy, instead, it leads to more accurate classification. More advanced scaling techniques include lower-bounding, like LB Keogh [10]. Another way to speed-up time series classification is to reduce the length of time series by aggregating consecutive values into a single number [13], which reduces the overall length of time series and thus makes their processing faster. Indexing [4], [7] aims at fast finding the most similar training time series to a given time series. Due to the “filtering” step that is performed by indexing, the execution time for classifying new time series can be considerable for large time-series data sets, since it can be affected by the significant computational requirements posed by the need to calculate DTW distance between the new time-series and several time-series in the
INSIGHT: Efficient and Effective Instance Selection for Time-Series Classification
151
training data set (O(n) in worst case, where n is the size of the training set). For this reason, indexing can be considered complementary to instance selection, since both these techniques can be applied to improve execution time. Instance selection (also known as numerosity reduction or prototype selection) aims at discarding most of the training time series while keeping only the most informative ones, which are then used to classify unlabeled instances. While instance selection is well explored for general nearest-neighbor classification, see e.g. [1], [2], [8], [9], [14], there are just a few works for the case of time series. Xi et al. [20] present the FastAWARD approach and show that it outperforms state-of-the-art, general-purpose instance selection techniques applied for time series. FastAWARD follows an iterative procedure for discarding time series: in each iteration, the rank of all the time series is calculated and the one with lowest rank is discarded. Thus, each iteration corresponds to a particular number of kept time time series. Xi et al. argue that the optimal warping window size depends on the number of kept time series. Therefore, FastAWARD calculates the optimal warping window size for each number of kept time series. FastAWARD follows some decisions whose nature can be considered as ad-hoc (such as the application of an iterative procedure or the use of tie-breaking criteria [20]). Conversely, INSIGHT follows a more principled approach. In particular, INSIGHT generalizes FastAWARD by being able to use several formulae for scoring instances. We will explain that the suitability of such formulae is based on the hubness property that holds in most time-series data sets. Moreover, we provide insights into the fact that the iterative procedure of FastAWARD is not a well-formed decision, since its large computation time can be saved by ranking instances only once. Furthermore, we observed the warping window size to be less crucial, and therefore we simply use a fixed window size for INSIGHT (that outperforms FastAWARD using adaptive window size).
3 Score Functions in INSIGHT INSIGHT performs instance selection by assigning a score to each instance and selecting instances with the highest scores (see Alg. 1). In this section, we examine how to develop appropriate score functions by exploiting the property of hubness. 3.1 The Hubness Property In order to develop a score function that selects representative instance for nearestneighbor time-series classification, we have to take into account the recently explored property of hubness [15]. This property states that for data with high (intrinsic) dimensionality, as most of the time-series data1 , some objects tend to become nearest neighbors much more frequently than others. In order to express hubness in a more precise way, for a data set D we define the k-occurrence of an instance x ∈ D, denoted fNk (x), that is the number of instances of D having x among their k nearest neighbors. With the term hubness we refer to the phenomenon that the distribution of fNk (x) becomes 1
In case of time series, consecutive values are strongly interdependent, thus instead of the length of time series, we have to consider the intrinsic dimensionality [16].
152
K. Buza, A. Nanopoulos, and L. Schmidt-Thieme
Fig. 1. Distribution of fG1 (x) for some time series datasets. The horizontal axis correspond to the values of f G1 (x), while on the vertical axis we see how many instance have that value.
significantly skewed to the right. We can measure this skewness, denoted by S f k (x) , N
with the standardized third moment of fNk (x): S f k (x) =
E[( fNk (x) − μ f k (x) )3 ]
N
σ 3f k (x)
N
(1)
N
where μ f k (x) and σ f k (x) are the mean and standard deviation of fNk (x). When S f k (x) N N N is higher than zero, the corresponding distribution is skewed to the right and starts presenting a long tail. In the presence of labeled data, we distinguish between good hubness and bad hubness: we say that the instance y is a good (bad) k-nearest neighbor of the instance x if (i) y is one of the k-nearest neighbors of x, and (ii) both have the same (different) class labels. This allows us to define good (bad) k-occurrence of a time series x, fGk (x) (and fBk (x) respectively), which is the number of other time series that have x as one of their good (bad) k-nearest neighbors. For time series, both distributions fGk (x) and fBk (x) are usually skewed, as is exemplified in Figure 1, which depicts the distribution of fG1 (x) for some time series data sets (from the collection used in Table 1). As shown, the distributions have long tails, in which the good hubs occur. We say that a time series x is a good (bad) hub, if fGk (x) (and fBk (x) respectively) is exceptionally large for x. For the nearest neighbor classification of time series, the skewness of good occurrence is of major importance, because a few time series (i.e., the good hubs) are able to correctly classify most of the other time series. Therefore, it is evident that instance selection should pay special attention to good hubs. 3.2 Score Functions Based on Hubness Good 1-occurrence score — In the light of the previous discussion, INSIGHT can use scores that take the good 1-occurrence of an instance x into account. Thus, a simple score function that follows directly is the good 1-occurrence score fG (x): fG (x) = fG1 (x) Henceforth, when there is no ambiguity, we omit the upper index 1.
(2)
INSIGHT: Efficient and Effective Instance Selection for Time-Series Classification
153
While x is being a good hub, at the same time it may appear as bad neighbor of several other instances. Thus, INSIGHT can also consider scores that take bad occurrences into account. This leads to scores that relate the good occurrence of an instance x to either its total occurrence or to its bad occurrence. For simplicity, we focus on the following relative score, however other variations can be used too: Relative score fR (x) of a time series x is the fraction of good 1-occurrences and total occurrences plus one (plus one in the denominator avoids division by zero): fR (x) =
fG1 (x) fN1 (x) + 1
(3)
Xi’s score — Interestingly, fGk (x) and fBk (x) allows us to interpret the ranking criterion of Xi et al. [20], by expressing it as another form of score for relative hubness: fXi (x) = fG1 (x) − 2 fB1 (x)
(4)
4 Coverage and Instance Selection Based on scoring functions, such as those described in the previous section, INSIGHT selects top-ranked instances (see Alg. 1). However, while ranking the instances, it is also important to examine the interactions between them. For example, suppose that the 1st top-ranked instance allows correct 1-NN classification of almost the same instances as the 2nd top-ranked instance. The contribution of the 2nd top-ranked instance is, therefore, not important with respect to the overall classification. In this section we describe the concept of coverage graphs, which helps to examine the aforementioned aspect of interactions between the selected instances. In Section 4.1 we examine the general relation between coverage graphs and instance-based learning methods, whereas in Section 4.2 we focus on the case of 1-NN time-series classification. 4.1 Coverage Graphs for Instance-Based Learning Methods We first define coverage graphs, which in the sequel allow us to cast the instanceselection problem as a graph-coverage problem: Definition 1 (Coverage graph). A coverage graph Gc = (V, E) is a directed graph, where each vertex v ∈ VGc , corresponds to a time series of the (labeled) training set. A directed edge from vertex vx to vertex vy , denoted as (vx , vy ) ∈ EGc states that instance x contributes to the correct classification of instance y. Algorithm 1. INSIGHT Require: Time-series dataset D, Score Function f , Number of selected instances N Ensure: Set of selected instances (time series) D 1: Calculate score function f (x) for all x ∈ D 2: Sort all the time series in D according to their scores f (x) 3: Select the top-ranked N time series and return the set containing them
154
K. Buza, A. Nanopoulos, and L. Schmidt-Thieme
We first examine coverage graphs for the general case of instance-based learning methods, which include the k-NN (k ≥ 1) classifier and its generalizations, such as adaptive k-NN classification where the number of nearest neighbors k is chosen adaptively for each object to be classified [12], [19].2 In this context, the contribution of an instance x to the correct classification of an instance y refers to the case when x is among the nearest neighbors of y and they have the same label. Based on the definition of the coverage graph, we can next define the coverage of a specific vertex and of set of vertices: Definition 2 (Coverage of a vertex and of vertex-set). A vertex v covers an other vertex v if there is an edge from v to v; C(v) is the set of all vertices covered by v: C(v) = {v |v = v ∧ (v , v) ∈ EGc }. Moreover, a set of vertices S0 covers all the vertices that are covered by at least one vertex v ∈ S0 : C(S0 ) = ∀v∈S0 C(v). Following the common assumption that the distribution of the test (unlabeled) data is similar to the distribution of the training (labeled) data, the more vertices are covered, the better prediction for new (unlabeled) data is expected. Therefore, the objective of an instance-selection algorithm is to have the selected vertex-set S (i.e., selected instances) covering the entire set of vertices (i.e., the entire training set), i.e., C(S) = VGc . This, however, may not be always possible, such as when there exist vertices that are not covered by any other vertex. If a vertex v is not covered by any other vertex, this means that the out-degree of v is zero (there are no edges going from v to other vertices). Denote the set of such vertices with by VG0c . Then, an ideal instance selection algorithm should cover all coverable vertices, i.e., for the selected vertices S an ideal instance selection algorithm should fulfill: ∀v∈S
C(v) = VGc \ VG0c
(5)
In order to achieve the aforementioned objective, the trivial solution is to select all the instances of the training set, i.e., chose S = VGc . This, however is not an effective instance selection algorithm, as the major aim of discarding less important instances is not achieved at all. Therefore, the natural requirement regarding the ideal instance selection algorithm is that it selects the minimal amount of those instances that together cover all coverable vertices. This way we can cast the instance selection task as a coverage problem: Instance selection problem (ISP) — We are given a coverage graph Gc = (V, E). We aim at finding a set of vertices S ⊆ VGc so that: i) all the coverable vertices are covered (see Eq. 5), and ii) the size of S is minimal among all those sets that cover all coverable vertices. Next we will show that this problem is NP-complete, because it is equivalent to the set-covering problem (SCP), which is NP-complete [5]. We proceed with recalling the set-covering problem. 2
Please notice that in the general case the resulting coverage graph has no regularity regarding both the in- and out-degrees of the vertices (e.g., in the case of k-NN classifier with adaptive k).
INSIGHT: Efficient and Effective Instance Selection for Time-Series Classification
155
Set-covering problem (SCP) — ”An instance (X, F ) of the set-covering problem consists of a finite set X and a familiy F of subsets of X, such that every element of X belongs to at least one subset in F . (...) We say that a subset F ∈ F covers its elements. The problem is to find a minimum-size subset C ⊆ F whose members cover all F. of X”[5]. Formally: the task is to find C ⊆ F , so that |C | is minimal and X = ∀F∈C
Theorem 1. ISP and SCP are equivalent. (See Appendix for the proof.) 4.2 1-NN Coverage Graphs In this section, we introduce 1-nearest neighbor (1-NN) coverage graphs which is motivated by the good performance of the 1-NN classifier for time series classification. We show the optimality of INSIGHT for the case of 1-NN coverage graphs and how the NP-completeness of the general case (Section 4.1) is alleviated for this special case. We first define the specialization of the coverage graph based on the 1-NN relation: Definition 3 (1-NN coverage graph). A 1-NN coverage graph, denoted by G1NN is a coverage graph where (vx , vy ) ∈ EG1NN if and only if time series y is the first nearest neighbor of time series x and the class labels of x and y are equal. This definition states that an edge points from each vertex v to the nearest neighbor of v, only if this is a good nearest neighbor (i.e., their labels match). Thus, vertexes are not connected with their bad nearest neighbors. From the practical point of view, to account for the fact that the size of selected instances is defined apriori (e.g., a user-defined parameter), a slightly different version of the Instance Selection Problem (ISP) is the following: m-limited Instance Selection Problem (m-ISP) — If we wish to select exactly m labeled time series from the training set, then, instead of selecting the minimal amount of time series that ensure total coverage, we select those m time series that maximize the coverage. We call this variant m-limited Instance Selection Problem (m-ISP). The following proposition shows the relation between 1-NN coverage graphs and m-ISP: Proposition 1. In 1-NN coverage graphs, selecting m vertices v1 , ..., vm that have the largest covered sets C(v1 ), ..., C(vm ) leads to the optimal solution of m-ISP. The validity of this proposition stems from the fact that, in 1-NN coverage graphs, the out-degree of all vertices is 1. This implies that each vertex is covered by at most one other vertex, i.e., the covered sets C(v) are mutually disjoint for each v ∈ VG1NN . Proposition 1 describes the optimality of INSIGHT, when the good 1-occurrence score (Equation 2) is used, since the size of the set C(vi ) is the number of vertices having vi as first good nearest neighbor. It has to be noted that described framework of coverage graphs can be extended to other scores too, such as relatives scores (Equations 3 or 4). In such cases, we can additionally model bad neighbors and introduce weights on the edges of the graph. For example, for the score of Equation 4, the weight of an edge e is +1, if e denotes a good neighbor, whereas it is −2, if e denotes a bad neighbor. We can define the coverage score of a vertex v as the sum of weights of the incoming edges to v and aim to maximize this coverage score. The detailed examination of this generalization is addressed as future work.
156
K. Buza, A. Nanopoulos, and L. Schmidt-Thieme
Fig. 2. Accuracy as function of the number of selected instances (in % of the entire training data) for some datasets for FastAWARD and INSIGHT
5 Experiments We experimentally examine the performance of INSIGHT with respect to effectiveness, i.e., classification accuracy, and efficiency, i.e., execution time required by instance selection. As baseline we use FastAWARD [20]. We used 37 publicly available time series datasets3 [6]. We performed 10-fold-cross validation. INSIGHT uses fG (x) (Eq. 2) as the default score function, however fR (x) (Eq. 3) and fXi (x) (Eq. 4) are also being examined. The resulting combinations are denoted as INS- fG (x), INS- fR (x) and INS- fXi (x), respectively. The distance function for the 1-NN classifier is DTW that uses warping windows [17]. In contrast to FastAWARD, which determines the optimal warping window size ropt , INSIGHT sets the warping-window size to a constant of 5%. (This selection is justified by the results presented in [17], which show that relatively small window sizes lead to higher accuracy.) In order to speed-up the calculations, we used the LB Keogh lower bounding technique [10] for both INSIGHT and FastAWARD. Results on Effectiveness — We first compare INSIGHT and FastAWARD in terms of classification accuracy that results when using the instances selected by these two methods. Table 1 presents the average accuracy and corresponding standard deviation for each data set, for the case when the number of selected instances is equal to 10% of the size of the training set (for INSIGHT, the INS- fG (x) variation is used). In the vast majority of cases, INSIGHT substantially outperforms FastAWARD. In the few remaining cases, their difference are remarkably small (in the order of the second or third decimal digit, which are not significant in terms of standard deviations). We also compared INSIGHT and FastAWARD in terms of the resulting classification accuracy for varying number of selected instances. Figure 2 illustrates that INSIGHT compares favorably to FastAWARD. Due to space constraints, we cannot present such results for all data sets, but analogous conclusion is drawn for all cases of Table 1 for which INSIGHT outperforms FastAWARD. Besides the comparison between INSIGHT and FastAward, what is also interesting is to examine their relative performance compared to using the entire training data (i.e., no instance selection is applied). Indicatively, for 17 data sets from Table 1 the accuracy 3
For StarLightCurves the calculations have not been completed for FastAWARD till the submission, therefore we omit this dataset.
INSIGHT: Efficient and Effective Instance Selection for Time-Series Classification
157
Table 1. Accuracy ± standard deviation for INSIGHT and FastAWARD (bold font: winner) Dataset FastAWARD 50words 0.526±0.041 Adiac 0.348±0.058 Beef 0.350±0.174 Car 0.450±0.119 CBF 0.972±0.034 Chlorinea 0.537±0.023 CinC 0.406±0.089 Coffee 0.560±0.309 Diatomb 0.972±0.026 ECG200 0.755±0.113 ECGFiveDays0.937±0.027 FaceFour 0.714±0.141 FacesUCR 0.892±0.019 FISH 0.591±0.082 GunPoint 0.800±0.124 Haptics 0.303±0.068 InlineSkate 0.197±0.056 Italyc 0.960±0.020 Lighting2 0.694±0.134 a d
INS- fG (x) 0.642±0.046 0.469±0.049 0.333±0.105 0.608±0.145 0.998±0.006 0.734±0.030 0.966±0.014 0.603±0.213 0.966±0.058 0.835±0.090 0.945±0.020 0.894±0.128 0.934±0.021 0.666±0.085 0.935±0.059 0.435±0.060 0.434±0.077 0.957±0.028 0.670±0.096
Dataset FastAWARD Lighting7 0.447±0.126 MALLAT 0.551±0.098 MedicalImages 0.642±0.033 Motes 0.867±0.042 OliveOil 0.633±0.100 OSULeaf 0.419±0.053 Plane 0.876±0.155 Sonyd 0.924±0.032 SonyIIe 0.919±0.015 SwedishLeaf 0.683±0.046 Symbols 0.957±0.018 SyntheticControl0.923±0.068 Trace 0.780±0.117 TwoPatterns 0.407±0.027 TwoLeadECG 0.978±0.013 Wafer 0.921±0.012 WordsSynonyms0.544±0.058 Yoga 0.550±0.017
INS- fG (x) 0.510±0.082 0.969±0.013 0.693±0.049 0.908±0.027 0.717±0.130 0.538±0.057 0.981±0.032 0.976±0.017 0.912±0.033 0.756±0.048 0.966±0.016 0.978±0.026 0.895±0.072 0.987±0.007 0.989±0.012 0.991±0.002 0.637±0.066 0.877±0.021
ChlorineConcentration, b DiatomSizeReduction, c ItalyPowerDemand, SonyAIBORobotSurface, e SonyAIBORobotSurfaceII
resulting from INSIGHT (INS- fG (x)) is worse by less than 0.05 compared to using the entire training data. For FastAward this number is 4, which clearly shows that INSIGHT select more representative instances of the training set than FastAward. Next, we investigate the reasons for the presented difference between INSIGHT and FastAward. In Section 3.1, we identified the skewness of good k-occurrence, fGk (x), as a crucial property for instance selection to work properly, since skewness renders good hubs to become representative instances. In our examination, we found that using the iterative procedure applied by FastAWARD, this skewness has a decreasing trend from iteration to iteration. Figure 3 exemplifies this by illustrating the skewness of fG1 (x) for two data sets as a function of iterations performed in FastAWARD. (In order to quantitatively measure skewness we use the standardized third moment, see Equation 1.) The reduction in the skewness of fG1 (x) means that FastAWARD is not able to identify in the end representative instances, since there are no pronounced good hubs remaining. To further understand that the reduced effectiveness of FastAWARD stems from its iterative procedure and not from its score function, fXi (x) (Eq. 4), we compare the accuracy of all variations of INSIGHT including INS- fXi (x), see Tab. 2. Remarkably, INS- fXi (x) clearly outperforms FastAWARD for the majority of cases, which verifies our previous statement. Moreover, the differences between the three variations are not large, indicating the robustness of INSIGHT with respect to the scoring function.
158
K. Buza, A. Nanopoulos, and L. Schmidt-Thieme
Fig. 3. Skewness of the distribution of fG1 (x) as function of the number of iterations performed in FastAWARD. On the trend, the skewness decreases from iteration to iteration. Table 2. Number of datasets where different versions of INSIGHT win/lose against FastAWARD
Wins Loses
INS- f G (x) INS- f R (x) INS- fX i (x) 32 33 33 5 4 4
Table 3. Execution times (in seconds, averaged over 10 folds) of instance selection using INSIGHT and FastAWARD for some datasets Dataset
FastAWARD INS- fG (x)
50words 94 464 Adiac 32 935 Beef 1 273 Car 11 420 CBF 37 370 ChlorineConcentration 16 920 CinC 3 604 930 Coffee 499 DiatomSizeReduction 18 236 ECG200 634 ECGFiveDays 20 455 FaceFour 4 029 FacesUCR 150 764 FISH 59 305 GunPoint 1 107 Haptics 152 617 InlineSkate 906 472 ItalyPowerDemand 1 855 Lighting2 15 593
203 75 3 18 67 1 974 16 196 1 44 2 60 6 403 93 4 869 4 574 6 23
Dataset
FastAWARD INS- f G (x)
Lighting7 5 511 Mallat 4 562 881 MedicalImages 13 495 Motes 17 937 OliveOil 3 233 OSULeaf 80 316 Plane 1 527 SonyAIBORobotS. 4 608 SonyAIBORobotS.II 10 349 SwedishLeaf 37 323 Symbols 165 875 SyntheticControl 3 017 Trace 3 606 TwoPatterns 360 719 TwoLeadECG 12 946 Wafer 923 915 WordsSynonyms 101 643 Yoga 1 774 772
8 19 041 55 55 5 118 4 11 23 89 514 8 11 1 693 45 4 485 203 6 114
Results on Efficiency — The computational complexity of INSIGHT depends on the calculation of the scores of the instances of the training set and on the selection of the top-ranked instances. Thus, for the examined score functions, the computational complexity is O(n2 ), n being the number of training instances, since it is determined by the calculation of the distance between each pair of training instances. For FastAWARD, its
INSIGHT: Efficient and Effective Instance Selection for Time-Series Classification
159
first step (leave-one-out nearest neighbor classification of the train instances) already requires O(n2 ) execution time. However, FastAWARD performs additional computationally expensive steps, such as determining the best warping-window size and the iterative procedure for excluding instances. For this reason, INSIGHT is expected to require reduced execution time compared to FastAWARD. This is verified by the results presented in Table 3, which show the execution time needed to perform instance selection with INSIGHT and FastAWARD. As expected, INSIGHT outperforms FastAWARD drastically. (Regarding the time for classifying new instances, please notice that both methods perform 1-NN using the same number of selected instances, therefore the classification times are equal.)
6 Conclusion and Outlook We examined the problem of instance selection for speeding-up time-series classification. We introduced a principled framework for instance selection based on coverage graphs and hubness. We proposed INSIGHT, a novel instance selection method for time series. In our experiments we showed that INSIGHT outperforms FastAWARD, a state-of-the-art instance selection algorithm for time series. In our future work, we aim at examining the generalization of coverage graphs for considering weights on edges. We also plan to extend our approach for other instancebased learning methods besides 1-NN classifier. Acknowledgements. Research partially supported by the Hungarian National Research Fund (Grant Number OTKA 100238).
References 1. Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Machine Learning 6(1), 37–66 (1991) 2. Brighton, H., Mellish, C.: Advances in Instance Selection for Instance-Based Learning Algorithms. Data Mining and Knowledge Discovery 6, 153–172 (2002) 3. Buza, K., Nanopoulos, A., Schmidt-Thieme, L.: Time-Series Classification based on Individualised Error Prediction. In: IEEE CSE 2010 (2010) 4. Chakrabarti, K., Keogh, E., Sharad, M., Pazzani, M.: Locally adaptive dimensionality reduction for indexing large time series databases. ACM Transactions on Database Systems 27, 188–228 (2002) 5. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms. MIT Press, Cambridge (2001) 6. Ding, H., Trajcevski, G., Scheuermann, P., Wang, X., Keogh, E.: Querying and Mining of Time Series Data: Experimental Comparison of Representations and Distance Measures. In: VLDB 2008 (2008) 7. Gunopulos, D., Das, G.: Time series similarity measures and time series indexing. ACM SIGMOD Record 30, 624 (2001) 8. Jankowski, N., Grochowski, M.: Comparison of instances seletion algorithms I. Algorithms survey. In: Rutkowski, L., Siekmann, J.H., Tadeusiewicz, R., Zadeh, L.A. (eds.) ICAISC 2004. LNCS (LNAI), vol. 3070, pp. 598–603. Springer, Heidelberg (2004)
160
K. Buza, A. Nanopoulos, and L. Schmidt-Thieme
9. Jankowski, N., Grochowski, M.: Comparison of instance selection algorithms II. Results and Comments. In: Rutkowski, L., Siekmann, J.H., Tadeusiewicz, R., Zadeh, L.A. (eds.) ICAISC 2004. LNCS (LNAI), vol. 3070, pp. 580–585. Springer, Heidelberg (2004) 10. Keogh, E.: Exact indexing of dynamic time warping. In: VLDB 2002 (2002) 11. Keogh, E., Kasetty, S.: On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration. In: SIGKDD (2002) 12. Ougiaroglou, S., Nanopoulos, A., Papadopoulos, A.N., Manolopoulos, Y., Welzer-Druzovec, T.: Adaptive k-Nearest-Neighbor Classification Using a Dynamic Number of Nearest Neighbors. In: Ioannidis, Y., Novikov, B., Rachev, B. (eds.) ADBIS 2007. LNCS, vol. 4690, pp. 66–82. Springer, Heidelberg (2007) 13. Lin, J., Keogh, E., Lonardi, S., Chiu, B.: A Symbolic Representation of Time Series, with Implications for Streaming Algorithms. In: Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (2003) 14. Liu, H., Motoda, H.: On Issues of Instance Selection. Data Mining and Knowledge Discovery 6, 115–130 (2002) 15. Radovanovic, M., Nanopoulos, A., Ivanovic, M.: Nearest Neighbors in High-Dimensional Data: The Emergence and Influence of Hubs. In: ICML 2009 (2009) 16. Radovanovic, M., Nanopoulos, A., Ivanovic, M.: Time-Series Classification in Many Intrinsic Dimensions. In: 10th SIAM International Conference on Data Mining (2010) 17. Ratanamahatana, C.A., Keogh, E.: Three myths about Dynamic Time Warping. In: SDM (2005) 18. Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoustics, Speech and Signal Proc. 26, 43–49 (1978) 19. Wettschereck, D., Dietterich, T.: Locally Adaptive Nearest Neighbor Algorithms. Advances in Neural Information Processing Systems 6 (1994) 20. Xi, X., Keogh, E., Shelton, C., Wei, L., Ratanamahatana, C.A.: Fast Time Series Classification Using Numerosity Reduction. In: Airoldi, E.M., Blei, D.M., Fienberg, S.E., Goldenberg, A., Xing, E.P., Zheng, A.X. (eds.) ICML 2006. LNCS, vol. 4503. Springer, Heidelberg (2007)
Appendix: Proof of Theorem 1 We show the equivalence in two steps. First we show that ISP is a subproblem of SCP, i.e. for each instance of ISP a corresponding instance of SCP can be constructed (and the solution of the SCP-instances directly gives the solution of the ISP-instance). In the second step we show that SCP is a subproblem of ISP. The both together imply equivalence. For each ISP-instance we construct a corresponding SCP-instance: X := VGc \VG0c and F := {C(v)|v ∈ VGc } We say that vertex v is the seed of the set C(v). The solution of SCP is a set F ⊆ F . The set of seeds of the subsets in F constitute the solution of ISP: S = {v|C(v) ∈ F} While constructing an ISP-instance for an SCP-instance, we have to be careful, because the number of subsets in SCP is not limited. Therefore in the coverage graph Gc there are two types of vertices. Each first-type-vertex vx corresponds to one element x ∈ X, and each second-type-vertex vF correspond to a subset F ∈ F . Edges go only from first-type-vertices to second-type-vertices, thus only first-type-vertices are coverable. There is an edge (vx , vF ) from a first-type-vertex vx to a second-type-vertex vF if and only if the corresponding element of X is included in the corresponding subset F, i.e. x ∈ F. When the ISP is solved, all the coverable vertices (first-type-vertices) are covered by a minimal set of vertices S. In this case, S obviously consits only of second-type-vertices. The solution of the SCP are the subsets corresponding to the vertices included in S: C = {F|F ∈ F ∧ vF ∈ S}
Multiple Time-Series Prediction through Multiple Time-Series Relationships Profiling and Clustered Recurring Trends Harya Widiputra , Russel Pears, and Nikola Kasabov The Knowledge Engineering and Discovery Research Institute, Auckland University of Technology, New Zealand {harya.widiputra,rpears,nkasabov}@aut.ac.nz http://kedri.info
Abstract. Time-series prediction has been very well researched by both the Statistical and Data Mining communities. However the multiple timeseries problem of predicting simultaneous movement of a collection of time sensitive variables which are related to each other has received much less attention. Strong relationships between variables suggests that trajectories of given variables that are involved in the relationships can be improved by including the nature and strength of these relationships into a prediction model. The key challenge is to capture the dynamics of the relationships to reflect changes that take place continuously over time. In this research we propose a novel algorithm for extracting profiles of relationships through an evolving clustering method. We use a form of non-parametric regression analysis to generate predictions based on the profiles extracted and historical information from the past. Experimental results on a real-world climatic data reveal that the proposed algorithm outperforms well established methods of time-series prediction. Keywords: time-series inter-relationships, multiple time-series prediction, evolving clustering method, recurring trends.
1
Introduction
Previous studies have found that in multiple time-series data relating to real world phenomena in the Biological and Economic domains, dynamic relationships between series exist, and being governed by them these series move together through time. For instance, it is well known that movement of a stock market index in a specific country is affected by the movements of other stock market indexes across the world or in that particular region [1],[2],[3]. Likewise, in a Gene Regulatory Network (GRN) the expression level of a Gene is determined by its time varying interactions with other Genes [4],[5]. However, even though time-series prediction has been extensively researched, and some prominent methods from the machine learning and data mining arenas
Corresponding author.
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 161–172, 2011. c Springer-Verlag Berlin Heidelberg 2011
162
H. Widiputra, R. Pears, and N. Kasabov
such as the Multi-Layer Perceptron and Support Vector Machines have been developed, there has been no research so far into developing a method that can predict multiple time-series simultaneously based on interactions between the series. The closest researches that take into account multiple time-series variables are that of ([6],[7],[8],[9]) which generally used historical values of some independent variables as inputs to a model that estimates future values of a dependent variable. Consequently, these methods do not have the capability to capture and model the dynamics of relationships in multiple time-series dataset and to predict their future values simultaneously. This research proposes a new method for modeling the dynamics of relationships in multiple time-series and to simultaneously predict their future values without the need for generating multiple models. The work thus focuses on the discovery of profiles of relationships in multiple time-series dataset and the recurring trends of movement that occur in a specific relationship’s form to construct a knowledge repository of the system under evaluation. The identification and exploitation of these profiles and recurring trends is expected to provide knowledge to perform simultaneous multiple time-series prediction, and that it would also significantly improve the accuracy of time-series prediction. The rest of the paper is organized as follows; in the next section we briefly review the issues involved in time-series modeling and cover the use of both Global and Localized models. Section 3 describes and explains the method proposed in this paper to discover profiles of relationships in multiple time-series dataset and their recurring trends of movement. In section 4 we present our experimental findings, and to end with, in section 5 we conclude the paper summarizing the main achievements of the research and briefly outline some directions for future research.
2
Global and Local Model of Time-Series Prediction
In the last few decades, the use of a single global model to forecast future events based on known past events has been a very popular approach in the data mining and knowledge discovery area [10]. Global models are built using all available historical data and thus can be used to predict future trends. However, the trajectories that global models produce often fail to track localized changes that take place at discrete points in time. This is due to the fact that trajectories tend to smooth localized deviations by averaging the effects of such deviations over a long period of time. In reality localized disturbances may be of great significance as they capture the conditions under which a time-series deviates from the norm. Furthermore, it is of interest to capture similar deviations from a global trajectory that take place repeatedly over time, in other words to capture recurring deviations from the norm that are similar in shape and magnitude. Such localized phenomena can only be captured accurately by localized models that are built only on data that define the phenomenon under consideration and are not contaminated by data outside the underlying localized phenomenon. Local models can be built by grouping together data that has similar behavior. Different types of phenomena will define their own clusters. Models can then be
Multiple Time-Series Prediction through Time-Series Relationships Profiling
163
developed for each cluster (i.e. local regressions) that will yield better accuracy over the local problem space covered by the model in contrast to a global model. Having a set of local models also offers greater flexibility as predictions can be made either on the basis of a single model or, if needed, on a global level by combining the predictions made by the individual local models [11].
3
Local Modeling of Multiple Time-Series Data
In this research a method to construct local models of multiple time-series containing profiles of relationships between multiple time-series by utilizing a non-parametric regression model in combination with the ECM, Evolving Clustering Method [12] is proposed. The construction of local models in the proposed methodology consists of two main steps, that are the extraction of profiles of relationships between time-series and the detection and clustering of recurring trends of movement in time-series when a particular profile emerges. The principal objective of the methodology is to construct a repository of profiles and recurring trends as the knowledge-base and key data resource to perform multiple time-series prediction. To attain such objective a 2-level local modeling process is utilized in the proposed methodology. The first level of local modeling which deals with the extraction of profiles of relationships between series in a sub-space of the given multiple time-series data is outlined in Section 3.1. The second level of local modeling, which is the procedure to detect and cluster recurring trends that take place in time-series when a particular profile is emerging, is described in Section 3.2. Constructed local models are expected to be able to capture the underlying behavior of the multiple time-series under examination, in terms of their tendency to move collectively in a similar fashion over time. Knowledge about such underlying behavior is expected to be helpful in predicting their upcoming movements as new data becomes available. This premise has been experimentally verified in the work presented in this paper. 3.1
Extracting Profiles of Relationship of Multiple Time-Series
Most of the work in clustering time-series data has concentrated on sample clustering rather than variable clustering [13]. However, one of the key tasks in our work is to group together series that are highly correlated and have similar shapes of movement (variable clustering), as we believe that these local models representing clusters of similar profiles will provide a better basis than a single global model for predicting future movement of the multiple time-series. The first step in extracting profiles of relationships between multiple timeseries is the computation of cross-correlation coefficients between the observed time-series using the Pearson’s correlation analysis. Additionally, only those statistically significant correlation coefficients identified by the t-test with confidence level of 95% are taken into account. The following step of the algorithm is to measure dissimilarity between time-series from the Pearson’s correlation
164
H. Widiputra, R. Pears, and N. Kasabov
Fig. 1. The Pearson’s correlation coefficient matrix is calculated from a given multiple time-series data (TS-1,TS-2,TS-3,TS-4), and then converted to normalized correlation [Equation 1] before the profiles are finally extracted
coefficient (line 1, Algorithm 1) by calculating the Rooted Normalized OneMinus Correlation [13] (known hence as normalized correlation in this paper) given by, 1 − corr(a, b) RNOMC(a, b) = (1) 2 where a and b are the time-series being analyzed. The normalized correlation coefficient ranges from 0 to 1, in which 0 denotes high similarity and 1 signifies the opposite condition. Thereafter, the last stage of the algorithm is to extract profiles of relationships from the normalized correlation matrix. The methodology used in this step is outlined in line 3 to 24 of Algorithm 1. The whole process of extracting profiles of relationships is illustrated in Figure 1. In any case, the fundamental concept of this algorithm is to group multiple time-series with comparable fashion of movement whilst validating that every time-series belong to the same cluster are correlated and hold significant level of similarity. The underlying concept of Algorithm 1 is closely comparable to the CAST, Clustering Affinity Search Technique, clustering algorithm [14]. However, Algorithm 1 works by dynamically creating new clusters, deleting and merging existing clusters as it evaluates the coefficient of similarity between time-series or observed variables. Therefore, Algorithm 1 is considerably different compared to CAST which creates a single cluster at one time and perform updates by adding new elements to the cluster from a pool of elements, or by removing elements from the cluster and return it to the pool as it evaluates the affinity factor of the cluster in which the elements belong to. After the profiles have been extracted, then the next step of the methodology is to mine and cluster series’ trends of movement from each profile. This process is outlined and explained in the next section. Additionally, as the timecomplexity of Algorithm 1 is O( 12 (n2 − n)), to avoid expensive re-computation
Multiple Time-Series Prediction through Time-Series Relationships Profiling
165
Algorithm 1. Extracting profiles of relationship of multiple time-series Require: X, where X1 , X2 , ..., Xn are observed time-series Ensure: profiles of relationships between multiple time-series 1: calculate the normalized correlation coefficient [Equation 1] of X 2: for each time-series X1 , X2 , ..., Xn do 3: //pre-condition: Xi ,Xj do not belong to any cluster 4: if (Xi ,Xj are correlated) AND (Xi , Xj do not belong to any cluster) then 5: allocate Xi ,Xj together in a new cluster 6: end if 7: //pre-condition: Xi belongs to a cluster; Xj does not belong to any cluster 8: if (Xi ,Xj are correlated) AND (Xi belongs to a cluster) then 9: if (Xj is correlated with all Xi cluster member) then 10: allocate Xj to cluster of Xi 11: else if (Xi ,Xj correlation > max(correlation) of Xi with its cluster member) AND (Xj is not correlated with any of Xi cluster member) then 12: remove Xi from its cluster; allocate Xi ,Xj together in a new cluster 13: end if 14: end if 15: //pre-condition: Xi and Xj belong to different cluster 16: if (Xi ,Xj are correlated) AND (Xi , Xj belong to different cluster) then 17: if (Xi is correlated with all Xj cluster member) AND (Xj is correlated with all Xi cluster member) then 18: merge cluster of Xi ,Xj together 19: else if (Xi ,Xj correlation > max(correlation) of Xj with its cluster member) AND (Xj is correlated with all Xi cluster member) then 20: remove Xj from its cluster; allocate Xj to cluster of Xi 21: else if (Xi ,Xj correlation > max(correlation) of both Xi ,Xj with their cluster member) AND (Xi is not correlated with one of Xj cluster member) AND (Xj is not correlated to any of Xi cluster member) then 22: remove Xi ,Xj from their cluster; allocate Xi ,Xj together in a new cluster 23: end if 24: end if 25: end for 26: return clusters of multiple time-series
and extraction of profiles, extracted profiles of relationships are stored and being updated dynamically instead of being computed on the fly. 3.2
Clustering Recurring Trends of a Time-Series
To detect and cluster recurring trends of movement from localized sets of timeseries data, an algorithm that extracts patterns of movement in the form of polynomial regression function and groups them on the basis of similarity in the regression coefficients has been proposed in a previous study [15]. However, in this proposed algorithm to eliminate the assumption that the data is drawn from a Gaussian data distribution when estimating the regression function, a non-parametric regression analysis is used in place of the polynomial regression.
166
H. Widiputra, R. Pears, and N. Kasabov
The process to cluster recurring trends of a time-series by using kernel regression as the non-parametric regression method is outlined in Algorithm 2 as follows, – Step 1, perform the autocorrelation analysis to the time-series dataset from which trends of movement will be extracted and clustered. Number of lag, as outcome of the autocorrelation analysis where lag > 0, with highest correlation coefficient is then taken as the size of snapshot-window n. – Step 2, create the first cluster C1 by simply taking the trend of movement (1) (1) (1) of the first snapshot X(1) = (X1 , X2 , ..., Xn ), from the input stream as the first cluster centre Cc1 and set the cluster radius Ru1 to 0. In this methodology, the i-th trend of movement represented by the kernel weight vector wi = (wi1 , wi2 , ..., win ) as outcome of the non-parametric regression analysis, is calculated using the Nadaraya-Watson kernel weighted average formula defined as follows, n
ˆ (i) = f (x(i) , wi ) = X j j
wik xjk
k=1 n
.
(2)
xjk
k=1 (i)
Here xj = (xij1 , ..., xijk ) is the extended smaller value of the original data n X(i) at domain j and certain small step dx where j = 1, 2, ..., ( dx + 1). xj = (xj1 , ..., xjk ) is calculated using the Gaussian MF equation as follows, (xj − k)2 xjk = K(xj , k) = exp − , (3) 2α2 where xj = dx×(j−1), k = 1, 2, ..., n and α is a pre-defined kernel bandwidth. The kernel weight, wi is estimated using common OLS, ordinary least squares such that the following objective functions is minimized, SSR =
n
(i) ˆ (i) ), ∀ X ˆ (i) where xj = Xk . (Xk − X j j
(4)
k=1
To gain knowledge about upcoming trend of movement when a particular trend emerge in a locality of time, the algorithm also model next trajectories of a data snapshot defined by, n+1
ˆ (i)(u) = f (x(u) , w(u) ) = X j j i
(u)
(u)
wik K(xj , k)
k=1 n+1
,
(5)
(u) K(xj , k)
k=1 (u)
where xj
= dx × (j (u) − 1); j (u) = 1, 2, ..., ( n+1 dx + 1); k = 1, 2, . . . , n + 1 and (u)
the kernel weights wi
(u)
(u)
(u)
= (wi1 , wi2 , ..., wi(n+1) ).
Multiple Time-Series Prediction through Time-Series Relationships Profiling
167
– Step 3, if there is no more data snapshot, then the process stops (go to Step 7); else next snapshot, X(i) , is taken. Extract trend of movement from X(i) as in Step 2, and calculate distances between current trend and all m already created cluster centres defined as, Di,l = 1 − CorrelationCoefficient(wi , Ccl ),
(6)
where l = 1, 2, ..., m. If found cluster centre Ccl where Di,l ≤ Rul , then current trend joins cluster Cl and the step is repeated; else continue to next step. – Step 4, find a cluster Ca (with centre Cca and cluster radius Rua ) from all m existing cluster centres by calculating the values of Si,a given by, Si,a = Di,a + Rua = min(Si,l ),
(7)
where Si,l = Di,l + Rul and l = 1, 2, ..., m. – Step 5, if Si,a > 2 × Dthr, where Dthr is a clustering parameter to limit the maximum size of a cluster radius, then current trend of X(i) , wi , does not belong to any existing clusters. A new cluster is then created in the same way as described in Step 2, and the algorithm returns to Step 3. – Step 6, if Si,a ≤ 2 × Dthr, current trend of X(i) , wi , joins cluster Ca . Cluster Ca is updated by moving its centre, Cca , and increasing the value of its radius, Rua . The updated radius Runew is set to Si,a /2 and the new centre a Ccnew is now the mean value of all trends of movement belong to cluster Ca . a Distance from the new centre Ccnew to current trend wi , is equal to Runew . a a The algorithm then returns to Step 3. – Step 7, end of the procedure. In the procedure of clustering trends the following indexes are used, – number of data snapshots: i = 1, 2, ...; – number of clusters: l = 1, 2, ..., m; – number of input and output variables: k = 1, 2, ..., n. Clusters of trends of movement are then stored in each extracted profile of relationship. This information about profiles of and the trends of movements inside them will then be exploited as the knowledge repository to perform simultaneous multiple time-series prediction. 3.3
Knowledge Repository and Multiple Time-Series Prediction
To visualize how Algorithms 1 and 2 extract profiles of relationships and cluster recurring trends from multiple time-series to construct a knowledge repository, a pre-created simple synthetic dataset is conferred to the algorithms, as it is shown in Figure 2, in this paper. The most suitable window size of snapshot is defined by implementing the auto-correlation analysis as it is explained in previous section and indicated in the first step of Algorithm 2.
168
H. Widiputra, R. Pears, and N. Kasabov
Fig. 2. Creation of knowledge repository (profiles of relationships and recurring trends)
Fig. 3. Multiple time-series prediction using profiles of relationships and recurring trends
Figure 2 illustrates how a repository of profiles of relationships and recurring trends is built from the first snapshot of the observed multiple time-series and how it is being updated dynamically after the algorithm has processed the third snapshot. This knowledge repository of profiles and recurring trends acts as the knowledge-base and is the key data resource to perform multiple time-series prediction.
Multiple Time-Series Prediction through Time-Series Relationships Profiling
169
After the repository has been built, there are two further steps that need to be performed before prediction can take place. The first step is to extract current profiles of relationships between the multiple series. Thereafter, matches are found between the current trajectory and previously stored profiles from the past. Predictions are then made by implementing a weighting scheme that gives more importance to pairs of series that belong to the same profile and retain comparable trends of movement. The weight wij for given pair (i, j) of series, is given by the distance of similarity between them.
4 4.1
Experiments and Evaluation of Results New Zealand Climate Data
Air pressure data collected from various locations in New Zealand by the National Institute of Weather and Atmosphere (NIWA, http://www.niwa.co.nz) constitutes the multiple time-series in this research. The data covers a period of three years, ranging from 1st January 2007 to 31st December 2009. Findings from previous study about global weather system [16] which argue that a small change to one part of the system can lead to a complete change in the weather system as a whole is the key reason that drives us to use such dataset. Consequently, being able to reveal profiles of relationship between air pressure in different locations at various time-points would help us to understand more about the behavior of our weather system and would also facilitate in constructing a reliable prediction model. 4.2
Prediction Accuracy Evaluation
To examine the performance of the proposed algorithm in predicting changes in weather patterns across the multiple stations, a prediction of future air pressure level in four locations from 1st October 2009 to 31st December 2009 was performed. Additionally, to confirm that forecasting movement of multiple timeseries simultaneously by exploiting relationships between multiple series would provide better accuracy, prediction outcomes are also compared to results from the commonly-used single time-series prediction methods, such as multiple linear regression and the Multi-Layer Perceptron. Throughout the conducted experiments, the window size used to perform bootstrap sampling in extracting profiles of relationships and recurring trends from the training dataset is set to 5, as suggested by the results of autocorrelation analysis performed on the New Zealand air pressure dataset. Furthermore, the parameters in clustering recurring trends process (Algorithm 2) are set to dx = 0.1, α = 0.1, and Dthr = 0.5. Generally, error rates of prediction and estimated trajectories RMSE, Rooted Mean Square Root Error, as outlined in Table 1 and Figure 4, reveals that the proposed algorithm demonstrates its superiority in predicting movement of dynamic and oscillated time-series data with a consistently high-degree of accuracy in relation to multiple linear regression and the Multi-Layer Perceptron. These
170
H. Widiputra, R. Pears, and N. Kasabov
Table 1. RMSE of the proposed algorithm compared to the commonly-used single time-series prediction method, multiple linear regression (MLR) and the Multi-Layer Perceptron (MLP)
Location Proposed Algorithm Paeroa 1.3890 Auckland 1.3219 Hamilton 1.4513 Reefton 1.8351
MLR 3.4257 3.5236 3.7263 4.1725
MLP 3.1592 3.0371 3.4958 3.9125
Fig. 4. Results of 100 days (1st October 2009 to 31st December 2009) air pressure level prediction at four observation locations (Paeroa, Auckland, Hamilton and Reefton) in New Zealand
results confirm proposals in previous studies which states that by being able to reveal and understand characteristics of relationships between variables in multiple time-series data, one can predict their future states or behavior accurately [2],[3],[4]. In addition, as it is expected, the proposed algorithm is not only able to provide excellent accuracy in predicting future values, but it is also capable of extracting knowledge about profiles of relationship between different locations in New Zealand (in terms of movement of air pressure level) and clustering recurring trends which exist in the series as illustrated in Figure 5 and 6. Consequently, our study is also able to reveal that the air pressure level in the four locations are highly correlated and tend to move in a similar fashion through time. This is showed by the circle in the lower left corner where number of occurrences of such profile is 601 in Figure 5.
Multiple Time-Series Prediction through Time-Series Relationships Profiling
171
Fig. 5. Extracted profiles of relationship from air pressure data. The radius represents average normalized correlation coefficient, while N indicates number of occurrences of a distinct profile.
Fig. 6. Created clusters of recurring trends when Paeroa, Auckland, Hamilton and Reefton are detected to be progressing in a highly correlated similar manner
5
Conclusion and Future Work
Outcomes of conducted experiments undoubtedly prove that predicting movement of multiple time-series on the same real world phenomenon by using profiles of relationship between the series improves prediction accuracy. Additionally, the algorithm proposed in this study demonstrates the ability to: (1) extract profiles of relationships and recurring trends from a multiple timeseries data i.e., the air pressure dataset from four observation stations in New Zealand; (2) perform prediction of multiple time-series all together with excellent precision; and (3) evolve, by continuing to extract profiles of relationships and recurring trends over time. As future work, we plan to explore the use of
172
H. Widiputra, R. Pears, and N. Kasabov
correlation analysis methods which are capable of detecting non-linear correlations between observed variables (i.e. correlation ratio, Copula, etc.) in place of the Pearson’s correlation coefficient to extract profiles of relationships of multiple time-series.
References 1. Collins, D., Biekpe, N.: Contagion and Interdependence in African Stock Markets. The South African Journal of Economics 71(1), 181–194 (2003) 2. Masih, A., Masih, R.: Dynamic Modeling of Stock Market Interdependencies: An Empirical Investigation of Australia and the Asian NICs. Working Papers 98-18, pp. 1323–9244. University of Western Australia (1998) 3. Antoniou, A., Pescetto, G., Violaris, A.: Modelling International Price Relationships and Interdependencies between the Stock Index and Stock Index Future Markets of Three EU Countries: A Multivariate Analysis. Journal of Business, Finance and Accounting 30, 645–667 (2003) 4. Kasabov, N., Chan, Z., Jain, V., Sidorov, I., Dimitrov, D.: Gene Regulatory Network Discovery from Time-series Gene Expression Data: A Computational Intelligence Approach. In: Pal, N.R., Kasabov, N., Mudi, R.K., Pal, S., Parui, S.K. (eds.) ICONIP 2004. LNCS, vol. 3316, pp. 1344–1353. Springer, Heidelberg (2004) 5. Friedman, L., Nachman, P.: Using Bayesian Networks to Analyze Expression Data. Journal of Computational Biology 7, 601–620 (2000) 6. Liu, B., Liu, J.: Multivariate Time Series Prediction via Temporal Classification. In: Proc. IEEE ICDE 2002, pp. 268–275. IEEE, Los Alamitos (2002) 7. Kim, T., Adali, T.: Approximation by Fully Complex Multilayer Perceptrons. Neural Computation 15, 1641–1666 (2003) 8. Yang, H., Chan, L., King, I.: Support Vector Machine Regression for Volatile Stock Market Prediction. In: Yellin, D.M. (ed.) Attribute Grammar Inversion and Sourceto-source Translation. LNCS, vol. 302, pp. 143–152. Springer, Heidelberg (1988) 9. Zanghui, Z., Yau, H., Fu, A.M.N.: A new stock price prediction method based on pattern classification. In: Proc. IJCNN 1999, pp. 3866–3870. IEEE, Los Alamitos (1999) 10. Holland, J.H., Holyoak, K.J., Nisbett, R.E., Thagard, P.R.: Induction: Processes of Inference, Learning and Discovery, Cambridge, MA, USA (1989) 11. Kasabov, N.: Global, Local and Personalised Modelling and Pattern Discovery in Bioinformatics: An Integrated Approach. Pattern Recognition Letters 28, 673–685 (2007) 12. Song, Q., Kasabov, N.: ECM - A Novel On-line Evolving Clustering Method and Its Applications. In: Posner, M.I. (ed.) Foundations of Cognitive Science, pp. 631–682. MIT Press, Cambridge (2001) 13. Rodrigues, P., Gama, J., Pedroso, P.: Hierarchical Clustering of Time-Series Data Streams. IEEE TKDE 20(5), 615–627 (2008) 14. Ben-Dor, A., Shamir, R., Yakhini, Z.: Clustering Gene Expression Patterns. Journal of Computational Biology 6(3/4), 281–297 (1999) 15. Widiputra, H., Kho, H., Lukas, Pears, R., Kasabov, N.: A Novel Evolving Clustering Algorithm with Polynomial Regression for Chaotic Time-Series Prediction. In: Leung, C.S., Lee, M., Chan, J.H. (eds.) ICONIP 2009. LNCS, vol. 5864, pp. 114–121. Springer, Heidelberg (2009) 16. Vitousek, P.M.: Beyond Global Warming: Ecology and Global Change. Ecology 75(7), 1861–1876 (1994)
Probabilistic Feature Extraction from Multivariate Time Series Using Spatio-Temporal Constraints Michal Lewandowski, Dimitrios Makris, and Jean-Christophe Nebel Digital Imaging Research Centre, Kingston University, London, United Kingdom {m.lewandowski,d.makris,j.nebel}@kingston.ac.uk
Abstract. A novel nonlinear probabilistic feature extraction method, called Spatio-Temporal Gaussian Process Latent Variable Model, is introduced to discover generalised and continuous low dimensional representation of multivariate time series data in the presence of stylistic variations. This is achieved by incorporating a new spatio-temporal constraining prior over latent spaces within the likelihood optimisation of Gaussian Process Latent Variable Models (GPLVM). As a result, the core pattern of multivariate time series is extracted, whereas a style variability is marginalised. We validate the method by qualitative comparison of different GPLVM variants with their proposed spatio-temporal versions. In addition we provide quantitative results on a classification application, i.e. view-invariant action recognition, where imposing spatiotemporal constraints is essential. Performance analysis reveals that our spatio-temporal framework outperforms the state of the art.
1
Introduction
A multivariate time series (MTS) is a sequential collection of high dimensional observations generated by a dynamical system. The high dimensionality of MTS creates challenges for machine learning and data mining algorithms. To tackle this, feature extraction techniques are required to obtain computationally efficient and compact representations. Gaussian Process Latent Variable Model [5] (GPLVM) is one of the most powerful nonlinear feature extraction algorithms. GPLVM emerged in 2004 and instantly made a breakthrough in dimensionality reduction research. The novelty of this approach is that, in addition to the optimisation of low dimensional coordinates during the dimensionality reduction process as other methods did, it marginalises parameters of a smooth and nonlinear mapping function from low to high dimensional space. As a consequence, GPLVM defines a continuous low dimensional representation of high dimensional data, which is called latent space. Since GPLVM is a very flexible approach, it has been successfully applied in a range of application domains including pose recovery [2], human tracking [20], computer animation [21], data visualization [5] and classification [19]. However, extensive study of the GPLVM framework has revealed some essential limitations of the basic algorithm [5, 6, 8, 12, 19, 21, 22]. First, since GPLVM J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 173–184, 2011. c Springer-Verlag Berlin Heidelberg 2011
174
M. Lewandowski, D. Makris, and J.-C. Nebel
aims at retaining data global structure in the latent space, there is no guarantee that local features are preserved. As a result, the natural topology of the data manifold may not be maintained. This is particularly problematic when data, such as MTS, have a strong and meaningful intrinsic structure. In addition, when data are captured from different sources, even after normalisation, GPLVM tends to produce latent spaces which fail to represent common local features [8, 21]. This prevents successful utilisation of the GPLVM framework for feature extraction of MTS. In particular, GPLVM cannot be applied in many classification applications such as speech and action recognition, where latent spaces should be inferred from time series generated by different subjects and used to classify data produced by unknown individuals. Another drawback of GPLVM is its computationally expensive learning process [5, 6, 21] which may converge towards a local minima if the initialization of the model is poor [21]. Although recent extensions of GPLVM, i.e. back constrained GPLVM [12] (BC-GPLVM) and Gaussian Process Dynamical Model [22] (GPDM), allow satisfactorily representation of time series, the creation of generalised latent spaces from data issued from several sources is still a unsolved problem which has never been addressed by the research community. In this paper, we define ’style’ as the data variations between two or more datasets representing a similar phenomenon. They can be produced by different sources and/or repetitions from a single source. Here, we propose an extension of the GPLVM framework, i.e. Spatio-Temporal GPLVM (ST-GPLVM), which produces generalised and probabilistic representation of MTS in the presence of stylistic variations. Our main contribution is the integration of a spatio-temporal ’constraining’ prior distribution over the latent space within the optimisation process. After a brief review of the state of art, we introduce the proposed methodology. Then, we validate qualitatively our method on a real dataset of human behavioral time series. Afterwards, we apply our method to a challenging view independent action recognition task. Finally, conclusions are presented.
2
Related Work
Feature extraction methods can be divided into two general categories, i.e., deterministic and probabilistic frameworks. The deterministic methods can be further classified into two main classes: linear and non linear methods. Linear methods like PCA cannot model the curvature and nonlinear structures embedded in observed spaces. As a consequence, nonlinear methods, such as Isomap [17], locally linear embedding [13] (LLE), Laplacian Eigenmaps [1] (LE) and kernel PCA [14], were proposed to address this issue. Isomap, LLE and LE aim at preserving a specific geometrical property of the underlying manifold by constructing graphs which encapsulate nonlinear relationships between points. However, they do not provide any mapping function between spaces. In contrast, kernel PCA obtains embedded space through nonlinear kernel based mapping from a high to low dimensional space. In order to deal with MTS, extensions of Isomap [3], LE [8] and the kernel based approach [15] were proposed.
Probabilistic Feature Extraction from Multivariate Time Series
175
Since previously described methods do not model uncertainty, another class of feature extraction methods evolved, the so-called latent variable models (LVM). They define a probability distribution over the observed space that is conditioned on a latent variable and mapping parameters. Consequently, it produces a probabilistic generative mapping from the latent space to the data space. In order to address intrinsic limitations of probabilistic linear methods, such as probabilistic principal component analysis [18] (PPCA), Lawrence [5] reformulated the PPCA model to the nonlinear GPLVM by establishing nonlinear mappings from the latent variable space to the observed space. From a Bayesian perspective, the Gaussian process prior is placed over these mappings rather than the latent variables with a nonlinear covariance function. As a result, GPLVM produces a complete joint probability distribution over latent and observed variables. Recently, many researchers have exploited GPLVM in a variety of applications [2, 5, 19, 20, 21], thus designing a number of GPLVM-based extensions which address some of the limitations of standard GPLVM. Lawrence [5, 6] proposed to use sparse approximation of the full Gaussian process which allows decreasing the complexity of the learning process. Preservation of observed space topology was supported by imposing high dimensional constraints on the latent space [12, 21]. BC-GPLVM [12] enforces local distance preservation through the form of a kernel-based regression mapping from the observed space to the latent space. Locally linear GPLVM [21] (LL-GPLVM) extends this concept by defining explicitly a cylindrical topology to maintain. This is achieved by constructing advanced similarity measures for the back constrained mapping function and incorporation of the LLE objective function [13] into the GPLVM framework to reflect a domain specific prior knowledge about observed data. Another line of work, i.e. GPDM [22], augments GPLVM with a nonlinear dynamical mapping on the latent space based on the auto-regressive model to take advantage of temporal information provided with time series data. Current GPLVM based approaches have proven very effective when modelling of MTS variability is desired in the latent space. However, these methods are inappropriate in a context of recognition based applications where the discovery of a common content pattern is more valuable than modelling stylistic variations. In this work, we tackle this fundamental problem by introducing the idea of a spatio-temporal interpretation of GPLVM. This concept is formulated by incorporating a constraining prior distribution over the latent space in the GPLVM framework. In contrast to LL-GPLVM and BC-GPLVM, we aim at implicitly preserving a spatio-temporal structure of the observed time series data in order to discard style variability and discover a unique low dimensional representation for a given set of MTS. The proposed extension is easily adaptable to any variant of GPLVM, for instance BC-GPLVM or GPDM.
3
Methodology
Let a set of multivariate time series Y consists of multiple repetitions (or cycles) of the same phenomenon from the same or different sources and all data points {yi }(i=1..N ) in this set are distributed on a manifold in a high dimensional space
176
M. Lewandowski, D. Makris, and J.-C. Nebel
(yi ∈ RD ). ST-GPLVM is able to discover low dimensional representation X = {xi }(i=1..N ) (xi ∈ Rd with d D) by giving a Gaussian process prior to mapping functions from the latent variable space X to the observed space Y under a constraint L to preserve the spatio-temporal patterns of the underlying manifold. The entire process is summarized in figure 1. Initially the spatio-temporal constraints L are constructed. These constraints are exploited twofold. First they are used to better initialise the model by discovering a low dimensional embedded space which is close to the expected representation. Secondly, they constrain the GPLVM optimisation process so that it converges faster and maintains the spatio-temporal topology of the data. The learning process is performed using the standard two stage maximum a posteriori (MAP) estimation used in GPLVM. Latent positions X, and the hyperparameters Φ are optimised iteratively until the optimal solution is reached under the introduced constraining prior p(X|L). The key novelty of the proposed methodology is its style generalisation potential. ST-GPLVM discovers a coherent and continuous low dimensional representation by identifying common spatio-temporal patterns which result in discarding style variability among all conceptually similar time series. 3.1
Gaussian Process Latent Variable Model
GPLVM [5] was derived from the observation that a particular probabilistic interpretation of PCA is a product of Gaussian Process (GP) models, where each of them is associated with a linear covariance function (i.e. kernel function). Consequently, the design of a non-linear probabilistic model could be achieved by replacing the linear kernel function with a non-linear covariance function. From a Bayesian perspective, by marginalising over the mapping function [5], the complete joint likelihood of all observed data dimensions given the latent positions is: 1 exp(−0.5tr(K −1 Y Y T )) (1) (2π)DN/2 |K|D/2 where Φ denotes the kernel hyperparameters and K is the kernel matrix of the GP model which is assembled with a combination of covariance functions: K = {k(xi , xj )}(i,j=1..N ) . The covariance function is usually expressed by the nonlinear radial basis function (RBF): γ k(xi , xj ) = α exp(− xi − xj 2 ) + β −1 δxi xj (2) 2 where the kernel hyperparameters Φ = {α, β, γ} respectively determine the output variance, the variance of the additive noise and the RBF width. δxi xj is the Kronecker delta function. Learning is performed using two stage MAP estimation. First, latent variables are initialized, usually using PPCA. Secondly, latent positions and the hyperparameters are optimised iteratively until the optimal solution is reached. This can be achieved by maximising the likelihood (1) with respect to the latent positions, X, and the hyperparameters, Φ using the following posterior: p(Y |X, Φ) =
p(X, Φ|Y ) ∝ p(Y |X, Φ)p(X)p(Φ)
(3)
Probabilistic Feature Extraction from Multivariate Time Series
177
where priors of the unknowns are: p(X) = N (0, I), p(Φi ) ∝ i Φ−1 i . The maximisation of the above posterior is equivalent to minimising the negative log posterior of the model: xi 2 )+ Φi − ln p(X, Φ|Y ) = 0.5((DN +1) ln 2π+D ln |K|+tr(K −1 Y Y T )+ i
i
(4) This optimization process can be achieved numerically using the scaled conjugate gradient [11] (SCG) method with respect to Φ and X. However, the learning process is computationally very expensive, since O(N 3 ) operations are required in each gradient step to inverse the kernel matrix K [5]. Therefore, in practice, a sparse approximation to the full Gaussian process, such as ’fully independent training conditional’ (FITC) approximation [6] or active set selection [5], is exploited to reduce the computational complexity to a more manageable O(k 2 N ) where k is the number of points involved in the lower rank sparse approximation of the covariance [6]. 3.2
Spatio-Temporal Gaussian Process Latent Variable Model
The proposed ST-GPLVM relies on the novel concept of a spatio-temporal constraining prior which is introduced into the standard GPLVM framework in order to maintain temporal coherence and marginalise style variability. This is achieved by designing an objective function, where the prior p(X) in Eq. 3 is replaced by the proposed conditioned prior p(X|L): p(X, Φ|Y, L) ∝ p(Y |X, Φ)p(X|L)p(Φ)
(5)
where L denotes the spatio-temporal constraints imposed on the latent space. Although p(X|L) is not a proper prior, conceptually it can be seen as an equivalent of a prior for a given set of weights L [21]. These constraints are derived from graph theory, since neighbourhood graphs have been powerful in designing nonlinear geometrical constraints for dimensionality reduction using spectral based approaches [1,13,17]. In particular, the Laplacian graph allows preserving approximated distances between all data points in the low dimensional space [1]. This formulation is extensively exploited in our approach by constructing cost matrix, L, which emphasizes spatio-temporal dependencies between similar time series. This is achieved by designing two types of neighbourhood for each high dimensional data point Pi (figure 2): – Temporal neighbours (T): the 2m closest points in the sequential order of input (figure 2a): Ti ∈ {Pi−m , ..., Pi−1 , Pi , Pi+1 , ..., Pi+m } – Spatial neighbours (S): let’s associate to each point, Pi , 2s temporal neighbours which define a time series fragment Fi . The spatial neighbours, Si , of Pi are the centres of the qi time series fragments, Fik , which are similar to Fi (figure 2b): Si ∈ {Fi,1 (C), ..., Fi,qi (C)}. Here Fi,k (C) returns the centre point of Fi,k . The neighbourhood Si is determined automatically using either dynamic time warping [8] or motion pattern detection [7].
178
M. Lewandowski, D. Makris, and J.-C. Nebel
Fig. 1. Spatio-Temporal Gaussian Process Latent Variable Model pipeline
Fig. 2. Temporal (a) and spatial (b) neighbours (green dots) of a given data point, Pi , (red dots)
Neighbourhood connections defined in the Laplacian graphs implicitly impose points closeness in the latent space. Consequently, the temporal neighbours allow to model a temporal continuity of MTS, whereas spatial neighbours remove style variability by aligning MTS in the latent space. The constraint matrix, L, is obtained, first, by assigning weights, W, to the edges of each graph, G ∈ {T, S}, using the standard LE heat kernel [1]: WijG
=
exp(yi − yj 2 ) 0
i,j connected otherwise
(6)
Probabilistic Feature Extraction from Multivariate Time Series
179
Then, information from both graphs are combined L = LT + LS where LG = G G G DG − W G is the Laplacian matrix. DG = diag{D11 , D22 , , DN N } denotes a N G G diagonal matrix with entries: Dii = W The prior probability of the ij j=1 latent variables, which forces each latent point to preserve the spatio-temporal topology of the observed data, is expressed by: p(X|L) = √
1 2πσ 2
exp(−
tr(XLX T ) ) 2σ 2
(7)
where σ represents a global scaling of the prior and controls the ’strength’ of the constraining prior. Note that although distance between neighbours (especially spatial ones) may be large in L, it is infinite between unconnected points. The maximisation of the new objective function (5) is equivalent to minimising the negative log posterior of the model: − ln p(X, Φ|Y, L) = 0.5(D ln |K| + tr(K −1 Y Y T ) + σ −2 tr(XLX T ) + C) + Φi i
(8) where C is a constant: (DN + 1) ln 2π + ln σ 2 . Following the standard GPLVM approach, the learning process involves minimising Eq. 8 with respect to Φ and X iteratively using SCG method [11] until convergence. ST-GPLVM is initialised using a nonlinear feature extraction method, i.e. temporal LE [8] which is able to preserve the constraints L in the produced embedded space. Consequently, compared to the standard usage of linear PPCA, initialisation is more likely to be closer to the global optimum. In addition, the enhancement of the objective function (3) with the prior (7) constrains the optimisation process and therefore further mitigates the problem of local minima. The topological structure in terms of spatio-temporal dependencies is implicitly preserved in the latent space without enforcing any domain specific prior knowledge. The proposed methodology can be applied to other GPLVM based approaches, such as BC-GPLVM [12] and GPDM [22] by integrating the prior (7) in their cost function. The extension of BC-GPLVM results in a spatio-temporal model which provides bidirectional mapping between latent and high dimensional spaces. Alternatively, ST-GPDM produces a spatio-temporal model with an associated nonlinear dynamical process in the latent space. Finally, the proposed extension is compatible with a sparse approximation of the full Gaussian process [5, 6] which allows reducing further processing complexity.
4
Validation of ST-GPLVM Approach
Our new approach is evaluated qualitatively through a comparative analysis of latent spaces discovered by standard non-linear probabilistic latent variable models, i.e. GPLVM, BC-GPLVM and GPDM and their extensions, i.e., STGPLVM, ST-BC-GPLVM and ST-GPDM, where the proposed spatio-temporal constraints have been included.
180
M. Lewandowski, D. Makris, and J.-C. Nebel
Fig. 3. 3D models learned from walking sequences of 3 different subjects with corresponding first 2 dimensions and processing times: a) GPLVM, b)ST-GPLVM, c) BCGPLVM, d) ST-BC-GPLVM, e) GPDM and f) ST-GPDM. Warm-coloured regions correspond to high reconstruction certainty.
Our evaluation is conducted using time series of MoCap data, i.e. repeated actions provided by the HumanEva dataset [16]. The MoCap time series are firstly converted into normalized sequences of poses, i.e. invariant to subject’s rotation and translation. Then each pose is represented as a set of quaternions, i.e. a 52dimension feature vector. In this experiment, we consider three different subjects performing a walking action comprising of 500 frames each. The dimensionality of walking action space is reduced to 3 dimensions [20, 22]. During the learning process, the computational complexity is reduced using FITC [6] where the number of inducing variables is set to 10% of the data. The global scaling of the constraining prior, σ, and the width of the back constrained kernel [12] were set empirically to 104 and 0.1 respectively. Values of all the other parameters of the models were estimated automatically using maximum likelihood optimisation. The back constrained models used a RBF kernel [12]. The learned latent spaces for the walking sequences with the corresponding first two dimensions and processing times are presented in figure 3. Qualitative analysis confirms the generalisation property of the proposed extension. Standard GPLVM based approaches discriminate between subjects in the spatially distinct latent space regions. Moreover, action repetitions by a given subject are represented separately. In contrast, the introduction of our spatio-temporal
Probabilistic Feature Extraction from Multivariate Time Series
181
constraint in objective functions allows producing consistent and smooth representation by discarding style variability in all considered models. In addition, the extended algorithms converge significantly faster than standard versions. Here, we achieve a speed-up of a factor 4 to 6.
5
Application of ST-GPLVM to Activity Recognition
We demonstrate the effectiveness of the novel methodology in a realistic computer vision application by integrating ST-GPLVM within the view independent human action recognition framework proposed in [7]. Here, the training data comprises of time series of action images obtained from evenly spaced views located around the vertical axis of a subject. In order to deal with this complicated scenario, the introduced methodology is extended by a new initialisation procedure and a new advanced constraining prior. The learning process of action recognition framework is summarised in figure 4. First, for each view (z=1..Z ), silhouettes are extracted from videos, normalized and represented as 3364-dimensional vectors of local space-time saliency features [7]. Then, spatio-temporal constraints Lz are calculated. During initialisation, style invariant one-dimensional action manifolds Xz are obtained using temporal LE [8] and subsequently all these view-dependent models are combined to generate a coherent view invariant representation of the action [7]. The outcome of this procedure reveals a torus-like structure which is used to initialise GPLVM and encapsulates both style and view. Finally, the latent space and parameters of model are optimised jointly under a new combined prior p(X|L). This prior is derived by taking into account constraints associated with each view: Z 1 tr(Xz Lz XzT ) √ p(X|L) = exp(− ) (9) 2σ 2 2πσ 2 z=1 where L is a block diagonal matrix formed by all Lz . Action recognition is performed by maximum likelihood estimation. Performance of the system is evaluated using the multi-view IXMAS dataset [23] which is considered as the benchmark for view independent action recognition. This dataset is comprised of 12 actions which are performed 3 times by 12 different actors. In this dataset, actors’ positions and orientations in videos are arbitrary since no specific instruction was given during acquisition. As a consequence, the action viewpoints are arbitrary and unknown. Here, we use one action repetition of each subject for training, whereas testing is performed with all action repetitions. Experiments are conducted using the popular leave-one-out schema [4,23,24]. Two recognition tasks were evaluated using either a single view or multiple views. In line with other experiments made on this dataset [9,10,24], the top view was discarded for testing. The global scaling of the constraining prior and the number of inducing variables in FITC [6] were set to 104 and 25% of the data respectively. Values of all the other parameters of the models were estimated automatically using maximum likelihood optimisation.
182
M. Lewandowski, D. Makris, and J.-C. Nebel
Fig. 4. Pipeline for generation of probabilistic view and style invariant action descriptor Table 1. Left, average recognition accuracy over all cameras using either single or multiple views for testing. Right, class-confusion matrix using multiple views.
%
Subjects /Actions
Average accuracy Single All view
views
Weinland [23]
10 / 11
63.9
81.3
Yan [24]
12 / 11
64.0
78.0
Junejo [4]
10 / 11
74.1
-
Liu [9]
12 / 13
71.7
78.5
Liu [10]
82.8
12 / 13
73.7
Lewandowski [7] 12 / 12
73.2
83.1
Our
76.1
85.4
12 / 12
Action recognition results are compared with the state of the art in table 1 (top view excluded). Examples of learned view and style invariant action descriptors using ST-GPLVM are shown in figure 5. Although different approaches may use slightly different experimental settings, table 1 shows that our framework produces the best performances. In particular, it improves the accuracy of the standard framework [7]. The confusion matrix of recognition for the ’all-view’ experiment reveals that our framework performed better when dealing with motions involving the whole body, i.e. ”walk”, ”sit down”, ”get up”, ”turn around”
Probabilistic Feature Extraction from Multivariate Time Series
183
Fig. 5. Probabilistic view and style invariant action descriptors obtained using STGPLVM for a) sit down, b) cross arms, c) turn around and d) kick
and ”pick up”. As expected, the best recognition rates 78.7%, 80.7% are obtained for camera 2 and 4 respectively, since those views are similar to those used for training, i.e. side views. Moreover, when dealing with either different, i.e. camera 1, or even significantly different views, i.e. camera 3, our framework still achieves good recognition rate, i.e. 75.2% and 69.9% respectively.
6
Conclusion
This paper introduces a novel probabilistic approach for nonlinear feature extraction called Spatio-Temporal GPLVM. Its main contribution is the inclusion of spatio-temporal constraints in the form of a conditioned prior into the standard GPLVM framework in order to discover generalised latent spaces of MTS. All conducted experiments confirm the generalisation power of the proposed concept in the context of classification applications where marginalising style variability is crucial. We applied the proposed extension on different GPLVM variants and demonstrated that their Spatio-Temporal versions produce smoother, coherent and visually more convincing descriptors at a lower computational cost. In addition, the methodology has been validated in a view independent action recognition framework and produced state of the art accuracy. Consequently, the concept of consistent representation of time series should benefit to many other applications beyond action recognition such as gesture, sign-language and speech recognition.
References 1. Belkin, M., Niyogi, P.: Laplacian eigenmaps and spectral techniques for embedding and clustering. In: Proc. NIPS, vol. 14, pp. 585–591 (2001) 2. Ek, C., Torr, P., Lawrence, N.D.: Gaussian process latent variable models for human pose estimation. Machine Learning for Multimodal Interaction, 132–143 (2007) 3. Jenkins, O., Matari´c, M.: A spatio-temporal extension to isomap nonlinear dimension reduction. In: Proc. ICML, pp. 441–448 (2004) 4. Junejo, I., Dexter, E., Laptev, I., P´erez, P.: Cross-view action recognition from temporal self-similarities. In: Proc. ECCV, vol. 12 (2008)
184
M. Lewandowski, D. Makris, and J.-C. Nebel
5. Lawrence, N.: Gaussian process latent variable models for visualisation of high dimensional data. In: Proc. NIPS, vol. 16 (2004) 6. Lawrence, N.: Learning for larger datasets with the Gaussian process latent variable model. In: Proc. AISTATS (2007) 7. Lewandowski, J., Makris, D., Nebel, J.C.: View and style-independent action manifolds for human activity recognition. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6316, pp. 547–560. Springer, Heidelberg (2010) 8. Lewandowski, M., Martinez-del-Rincon, J., Makris, D., Nebel, J.-C.: Temporal extension of laplacian eigenmaps for unsupervised dimensionality reduction of time series. In: Proc. ICPR (2010) 9. Liu, J., Ali, S., Shah, M.: Recognizing human actions using multiple features. In: Proc. CVPR (2008) 10. Liu, J., Shah, M.: Learning human actions via information maximization. In: Proc. CVPR (2008) 11. M¨ oller, M.: A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks 6(4), 525–533 (1993) 12. Lawrence, N.D., Quinonero-Candela, J.: Local Distance Preservation in the GP-LVM Through Back Constraints. In: Proc. ICML, pp. 513–520 (2006) 13. Roweis, S., Saul, L.: Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2000) 14. Sch¨ olkopf, B., Smola, A., M¨ uller, K.: Kernel principal component analysis. In: ICANN, pp. 583–588 (1997) 15. Shyr, A., Urtasun, R., Jordan, M.: Sufficient dimension reduction for visual sequence classification. In: Proc. CVPR (2010) 16. Sigal, L., Black, M.: HumanEva: Synchronized Video and Motion Capture Dataset for Evaluation of Articulated Human Motion. Brown Univertsity (2006) 17. Tenenbaum, J., Silva, V., Langford, J.: A global geometric framework for nonlinear dimensionality reduction. Science 290(5500), 2319–2323 (2000) 18. Tipping, M., Bishop, C.: Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B 61, 611–622 (1999) 19. Urtasun, R., Darrell, T.: Discriminative Gaussian process latent variable model for classification. In: Proc. ICML, pp. 927–934 (2007) 20. Urtasun, R., Fleet, D.J., Fua, P.: 3D people tracking with gaussian process dynamical models. In: Proc. CVPR, vol. 1, pp. 238–245 (2006) 21. Urtasun, R., Fleet, D., Geiger, A., Popovi´c, J., Darrell, T., Lawrence, N.: Topologically-constrained latent variable models. In: Proc. ICML (2008) 22. Wang, J., Fleet, D., Hertzmann, A.: Gaussian process dynamical models. In: Proc. NIPS, vol. 18, pp. 1441–1448 (2006) 23. Weinland, D., Boyer, E., Ronfard, R.: Action recognition from arbitrary views using 3D exemplars. In: Proc. ICCV, vol. 5(7), p. 8 (2007) 24. Yan, P., Khan, S., Shah, M.: Learning 4D action feature models for arbitrary view action recognition. In: Proc. CVPR, vol. 12 (2008)
Real-Time Change-Point Detection Using Sequentially Discounting Normalized Maximum Likelihood Coding Yasuhiro Urabe1 , Kenji Yamanishi1 , Ryota Tomioka1 , and Hiroki Iwai2 1
2
The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, Japan
[email protected] The first author’s current affiliation is Faculty of Medicine, University of Miyazaki
[email protected],
[email protected] Little eArth Corporation Co., Ltd, 2-16-1 Hirakawa-cho, Chiyoda-ku, Tokyo, Japan
[email protected]
Abstract. We are concerned with the issue of real-time change-point detection in time series. This technology has recently received vast attentions in the area of data mining since it can be applied to a wide variety of important risk management issues such as the detection of failures of computer devices from computer performance data, the detection of masqueraders/malicious executables from computer access logs, etc. In this paper we propose a new method of real-time change point detection employing the sequentially discounting normalized maximum likelihood coding (SDNML). Here the SDNML is a method for sequential data compression of a sequence, which we newly develop in this paper. It attains the least code length for the sequence and the effect of past data is gradually discounted as time goes on, hence the data compression can be done adaptively to nonstationary data sources. In our method, the SDNML is used to learn the mechanism of a time series, then a change-point score at each time is measured in terms of the SDNML code-length. We empirically demonstrate the significant superiority of our method over existing methods, such as the predictive-coding method and the hypothesis testing method, in terms of detection accuracy and computational efficiency for artificial data sets. We further apply our method into real security issues called malware detection. We empirically demonstrate that our method is able to detect unseen security incidents at significantly early stages.
1 1.1
Introduction Motivation
We are concerned with the issue of detecting change points in time series. Here a change-point is the time point at which the statistical nature of time series suddenly changes. Hence the detection of that point may lead to the discovery of a novel event. The issue of change-point detection has recently received vast attentions in the area of data mining ([1],[9],[2],etc.).This is because it can be J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 185–197, 2011. c Springer-Verlag Berlin Heidelberg 2011
186
Y. Urabe et al.
applied to a wide variety of important data mining problems such as the detection of failures of computer devices from computer performance data such as CPU loads, the detection of malicious executables from computer access logs. We require that the change-point detection be conducted in real-time. This requirement is crucial in real environments as in security monitoring, system monitoring, etc. Hence we wish to design a real-time change-point detection algorithm s.t. every time a datum is input, it gives a score measuring to what extent it is likely to be a change-point. Further it is desired that such an algorithm detects change-points as early as possible with least false alarms. We attempt to design a change-point detection algorithm on the basis of data compression. The basic idea is that a change point may be considered as a time point when the data is no longer compressed using the same nature as the one which have ever been observed. An important notion of sequentially normalized maximum likelihood (SNML) coding has been developed in the scenario of sequential source coding [4],[6],[5]. It has turned out to attain the shortest codelength among possible coding methods. Hence, from the information-theoretic view point, it is intuitively reasonable that the time point when the SNML codelength suddenly changes can be thought of as a change point. However, SNML coding has never been applied to the issue of change-point detection. Further in the case where data sources are non-stationary, SNML should be extended so the data compression is adaptive to the time-varying nature of the sources. 1.2
Purpose and Significances of This Paper
The purpose of this paper is twofolds. One is to propose a new method of realtime change point detection using the sequentially discounting normalized maximum likelihood coding(SDNML). SDNML is a variant of SNML, which we newly develop in this paper. It is obtained by extending SNML so that out-of-date statistics is gradually discounted as time goes on, and eventually the coding can be adaptive to the time-varying nature of data sources. In our method, we basically employ the two-stage learning framework proposed in [9] for real-time change-point detection. In it there are two-stages for learning; one is the stage for learning a probabilistic model from an original data sequence and giving a score for each time on the basis of the learned model, and the other is the stage for learning another probabilistic model from a score sequence obtained by smoothing scores calculated at the first stage and giving a change-point score at each time. In this framework we use SDNML code-length as a change-point score. Note that in [9], the predictive code-length was used as a change-point score instead of SDNML code-length. Since the SDNML coding is optimal as shown in [6],[5], we expect that our method will lead to a better strategy than the one proposed in [9]. The theoretical background behind this intuition is Rissanen’s minimum description length(MDL) principle [4], which asserts that the shorter code-length leads to the better estimation of an underlying statistical model. The other purpose is to empirically demonstrate the significant superiority of our method over existing methods in terms of detection accuracy and computational efficiency. We demonstrate that using both artificial data and real data. As
Real-Time Change-Point Detection
187
for artificial data demonstration, we evaluate the performance of our method for two types of change-points; continuous change points and discontinuous ones. As for real data demonstration, we apply our method into real security issues called malware detection. We empirically demonstrate that our method is able to detect unseen security incidents or their symptoms at significantly early stages. Through this demonstration we develop a new method for discovering unseen malware by means of change-point detection from web server logs. 1.3
Related Works
There exist several earlier works on change-point detection. A standard approach to this issue has been to employ the hypothesis testing method [3], [8], i.e., testing whether the probabilistic models before and after the change-point are identical or not. Guralnik and Srivastava proposed a hypothesis-testing based event detection [2]. In it a piecewise segmented function was used to fit the time-dependent data and a change-point was detected by finding the point such that the total errors of local model fittings of segments to the data before and after that point is minimized. However, it is basically computationally expensive to find such a point since the local model fitting task is required as many times as the number of points between the successive points every time a datum is input. As for real-time change-point detection, Takeuchi and Yamanishi [9] (see also [11]) have proposed ChangeFinder, in which the two-stage learning framework has been employed. It has been reported in [9] that ChangeFinder outperforms the hypothesis testing-based method both in detection accuracy and computational efficiency. In the two-stage learning framework the choice of scoring function is crucial. In ChangeFinder the score is calculated as the predictive code-length, which will be replaced with the SDNML code-length in this paper. The technology of change-point detection has been applied to a variety of application domains (failure detection, marketing, security, etc.). The security has been recognized as one of most important application areas among them since it has critical issues of how to detect cyber-threat caused by malicious hackers. Although various classification-based pattern matching methods have been applied to security issues [10],[12], to the best of our knowledge, there is few works that gives a clear relation of security issues to change-point detection. The rest of this paper is organized as follows: Section 2 introduces the notion of the SDNML. Section 3 describes our proposed method. Section 4 gives empirical evaluation of the proposed method for artificial data sets. Section 5 shows an application of our method to security. Section 6 yields concluding remarks.
2
Sequentially Discounting Normalized Maximum Likelihood Coding
This section introduces the SDNML coding. Suppose that we observe a discretetime series, which we denote as, {xt : t = 1, 2, · · · }. We denote xt = x1 · · · xt . Consider the parametric class F = {p(xt |xt−1 : θ) : t = 1, 2, · · · } of conditional
188
Y. Urabe et al.
probability density functions where θ denotes the k-dimensional parameter vect−1 ˆ tor. For this class, letting θ(x·x ) be the maximum likelihood estimate of θ from t−1 t−1 x·x = x1 · · · xt−1 · x (i.e., θˆ = arg maxθ {p(x|xt−1 : θ) j=1 p(xj |xj−1 : θ)}, we consider the following minimax problem: min max log t−1
q(x|x
)
x
ˆ · xt−1 )) p(x · xt−1 |θ(x . q(x|xt−1 )
(1)
This is known as the conditional minimax criterion [5], which is a conditional variant of Shatarkov’s minimax risk [7]. The solution to this yields the distribution having the shortest code-length relative to the model class F. It is known from [4] that the solution of the minimum in (1) is achieved by the sequentially normalized maximum likelihood (SNML) density function defined as: def
pSNML (xt |xt−1 ) =
ˆ t )) p(xt |θ(x t−1 def ˆ · xt−1 ))dx. , K (x ) = p(x · xt−1 |θ(x t Kt (xt−1 )
We call the quantity − log pSNML (xt |xt−1 ) the SNML code-length. It is known from [5],[6] that the cumulative SNML code-length, which is the sum of SNML code-length over the sequence, is optimal in the sense that it asymptotically achieves the shortest code-length. According to Rissanen’s MDL principle [4], the SNML leads to the best statistical model for explaining data. We employ here the AR model as a probabilistic model and introduce SDNML(sequentially discounting normalized maximum likelihood) coding for this model by extending SNML so that the effect of past data can gradually be discounted as time goes on. The function of ”discounting” is important in real situations where the data source is non-stationary and the coding should be adaptive to it. Let X ⊂ R be 1-dimensional and let xt ∈ X for each t. We define the kth order auto-regression (AR) model as follows: 1 1 t−1 2 p(xt |xt−k : θ) = exp − 2 (xt − w) , (2) 2σ (2πσ 2 )1/2 k where w = i=1 A(i) xt−i and θ = (A(1) , · · · , A(k) , σ 2 ). Let r (0 < r < 1) be the discounting coefficient. Let m be the the least sample (1) (k) size such that Eq.(3) is uniquely solved. Let Aˆt = (Aˆt , · · · Aˆt )T be the dis(1) (k) counting maximum likelihood estimate of the parameter At = (At , · · · , At )T t from x i.e., Aˆt = arg min A
t
r(1 − r)t−j (xj − AT x ¯j )2 ,
(3)
j=m+1
where x ¯j = (xj−1 , xj−2 , . . . , xj−k )T . Here the discounting maximum likelihood estimate can be thought of as a modified variant of maximum likelihood estimate so that the weighted likelihood is maximum where the weight of the jth past
Real-Time Change-Point Detection
189
data is given r(1 − r)t−j . Hence the larger the discounting coefficient r is, the exponentially smaller the effect of past data becomes. def We further let eˆt = xt − AˆTt x¯t . Then let us define the discounting maximum likelihood estimate of the variance from xt by def
τˆt = argmax σ2
=
t
ˆ 2 p(xt |xt−1 t−k : At , σ )
j=m+1
t 1 eˆ2 . t − m j=m+1 j
Below we give a method of sequential computation of Aˆt and τˆt so that they def can be computed every time a datum xt is input. Let Xt = (¯ xk+1 , x ¯k+2, . . . , x ¯t ). Let us recursively define V˜t and Mt as follows: def −1 V˜t−1 = (1 − r)V˜t−1 + r¯ xt x ¯Tt ,
def
Mt = (1 − r)Mt−1 + r¯ xt xt .
Then we obtain the following iterative relation for the parameter estimation: 1 ˜ r V˜t−1 x ¯t x ¯Tt V˜t−1 Vt−1 − , 1−r 1 − r 1 − r + c˜t eˆt = xt − AˆTt x ¯t , c ˜ t d˜t = . 1 − r + c˜t
V˜t =
Aˆt = V˜t Mt , c˜t = r¯ xTt V˜t−1 x ¯t , (4)
Setting r = 1/(t − m) yields the iteration developed by Rissanen et.al. [5] and Roos et.al. [6]. We employ (4) for parameter estimation. Define st by def
st =
t
eˆ2j = (t − m)ˆ τt .
(5)
j=m+1
We define a SDNML density function by normalizing the discounting maximum likelihood, which is given by −(t−m)/2
pSDNML (xt |xt−1 ) = Kt−1 (xt−1 )(st
−(t−m−1)/2
/st−1
where the normalizing factor Kt (xt−1 ) is calculated as follows: √ π Γ ((t − m − 1)/2) t−1 Kt (x ) = . 1 − dt Γ ((t − m)/2)
),
(6)
(7)
The SDNML code-length for xt is calculated as follows: √ π Γ ((t − m − 1)/2) − log pSDNML (xt |xt−1 ) = log 1 − dt Γ ((t − m)/2) 1 t−m (t − m)ˆ τt + log((t − m − 1)ˆ τt−1 ) + log .(8) 2 2 (t − m − 1)ˆ τt−1 We may employ the SDNML code-length (8) as the scoring function in the context of change-point detection.
190
3
Y. Urabe et al.
Proposed Method
The main features of our proposed method are summarized as follows: 1)Two-stage learning framework with SDNML code-length: We basically employ the two-stage learning framework proposed in [9] to realize real-time changepoint detection. The key idea is that probabilistic models are learned at the two stages; in the first stage a probabilistic model is learned from the original time series and a score is given for each time on the basis of the model, and in the second stage another probabilistic model is learned from a score sequence obtained by smoothing scores calculated at the first stage and a change-point score is calculated on the basis of the learned model. We use the SDNML code-length for the scoring in each stage. 2)Efficiently computing the estimates of parameters: Although the Yule-Walker equation must be solved for the parameter estimation in ChangeFinder [9], we can use an iterative relation to more efficiently estimate parameters than ChangeFinder. Below we give details of our version of the two-stage learning framework. Two-stage Learning Based Change-point Detection: We observe a discrete-time series, which we denote as, {xt : t = 1, 2, · · · }. The following steps are executed every time xt is input. Step 1 (First Learning Stage). We employ the AR model as in (2) to learn from the time series {xt } a series of the SDNML density functions, which we denote as {pSDNML (xt+1 |xt ) : t = 1, 2, · · · }. This is computed according to the iterative learning procedure (4),(5),(6),(7). Step 2 (First Scoring Stage). A score for xt is calculated in terms of the SDNML code-length of xt relative to pSDNML (·|xt−1 ), according to (8): Score(xt ) = − log pSDNML (xt |xt−1 ).
(9)
This score measures how largely xt is deviated relative to pSDNML (·|xt−1 ). Step 3 (Smoothing). We construct another time series {yt } on the basis of the score sequence as follows: For a fixed size T (> 0), 1 yt = T
t
Score(xi ).
(10)
i=t−T +1
This implies that {yt : t = 1, 2, · · · } is obtained by smoothing: i.e., taking an average of scores over a window of fixed size T and then sliding the window over time. The role of smoothing is to reduce influence from isolated outliers. Our goal is to detect bursts of outliers rather than isolated ones. Step 4 (Second Learning Stage). We sequentially learn again SDNML density functions associated AR model from the score sequence {yt}, which we denote as {qSDNML (yt+1 |y t ) : t = 1, 2, · · · }. This is also computed according to the iterative relation (4),(5),(6),(7).
Real-Time Change-Point Detection
191
Step 5 (Second Scoring Stage). We calculate the SDNML code-length for yt according to (8) and make a smoothing of the new score values over a window of fixed size T to get the following change-point score: Score(t) =
1 T
t
− log qSDNML (yi |y i−1 ) .
(11)
i=t−T +1
This indicates how drastically the nature of the original sequence xt has changed at time point t. In [9],[11], the score is calculated as the negative logarithm of the plug-in density defined as: − log p(xt |θˆ(t−1) ), where θˆ(t−1) is the estimate of θ obtained by using the discounting learning algorithm from xt−1 . In our method, it is replaced by the SDNML code-length. In updating each parameter estimate, there is only one iteration every time a datum is input. The computation time for our method is O(k 2 n) while that for the original two-stage learning-based method: ChangeFinder is O(k 3 n).
4 4.1
Empirical Evaluation for Artificial Datasets Methods to Be Compared
We evaluated the proposed method in comparison with existing algorithms: ChangeFinder in [9], and the hypothesis testing method. Guralnik and Srivastava [2] proposed an algorithm for hypothesis testingbased change-point detection, which we denote as HT. In it the square loss function was used as an error measure. It was extended in [9] to the one using the predictive code-length as an error measure. Below we briefly sketch the basic idea of HT using the predictive code-length. Let xtu = xu · · · xt . Let us define the cumulative predictive code-length, which we denote as def
I(xtu ) = −
t
i−1 log p(xi |xi−1 i−k : θ(xu )),
i=u i−1 where θ(xi−1 u ) is the maximum likelihood estimator of θ from xu . In HT, if t there exists a change point v (u < v < t) in xu , the total amount of predictive code-length will be reduced by fitting different statistical models before and after the change point, that is, I(xvu ) + I(xtv+1 ) is significantly smaller than I(xtu ). On the basis of the principle as above, we can detect change points in an incremental manner. Let ta be the last data point detected as a change point and tb be the latest data point. Every time a datum is input, we examine whether the following inequality holds or not: 1 tb tb i I(xta ) − min (I(xta ) + I(xi+1 )) > δ, i:ta
192
Y. Urabe et al.
where δ is a predetermined threshold. If it holds, we recognize the time point giving the minimum in the left hand side as a change point. Once a change point is detected, which we denote as tcp , then the detection process restarts by letting tcp be the final detected change point. This procedure continues recursively. The computation time is O(n2 ) where n is the data size. 4.2
Discontinuous Change-Point Detection
In our experimental setting, we formally define two-types of change-points. One is a discontinuous type of change points and the other is a continuous type. Letting pB and pA be the probability density functions for data generation before and after the change point t, respectively. We define the dissimilarity between pB and pA in terms of the Kullback-Leibler divergence defined as follows: 1 pA (X n ) def Δ(t) = D(pA ||pB ) = lim EpA log . n→∞ n pB (X n ) First we consider the case where change points are discontinuous in the sense that the value of Δ(t) discontinuously changes at a change point t. As for the data-generation model we employed the following AR model: xt = A1 xt−1 + A2 xt−2 + ε, where A1 = 0.6, A2 = −0.5, and ε ∼ N (μ, 1). We generated 1,000 records and set change-points so that the jump of the mean value μ occurred at x × 100 (x = 1, 2, · · · ). Let the amount of jump of mean at the kth change-point be Dk . We set Dk = 10 − k. The dissimilarity at the ith change point is given by Δi =
D(i)2 (1 − A21 − A22 )2 = 0.405Δ(i)2 . 2σ 2
The amount of jump in this model is discrete. Hence we call the change points in this case discontinuous change points. Figure 1(a) shows the data set including discontinuous change points. Note that the change points tend to be more difficult to be identified as i increases since they are more affected by noise. Hence it is non-trivial to detect all of them regardless they are discontinuous. In the applications of our method and ChangeFinder, we set k = 4 (the degree of AR model), r = 0.01 (discounting parameter), and T = 3 (smoothing parameter). Through the paper all of the parameters are systematically chosen so that they are best fit for a fixed percentage (say, 5%) of training data. Below we give a measure of performance for change-point detection. By setting a threshold of scores, an alarm was made if the score value exceeds the threshold. Letting t∗ be the true change-point, we define the benefit of the time point t when an alarm is made as follows:
1 − (t − t∗ )/20 : 0 ≤ t − t∗ ≤ 20 def benef it(t) = 0 : otherwise
Real-Time Change-Point Detection
193
60
50
40
30
20
10
0
-10
1
101
201
301
401
501
601
701
801
901
(a) Data: Discontinuous Change Points
SDNML
(b) Benefit-FDR Curves
Fig. 1. Detecting Discontinuous Change Points
Log10 SDNML
Fig. 2. Comparison of computation times of SDNML and CF
The benefit measures how early the true change-point is detected. It takes the maximum value 1 when the true change-point is detected at that point, and is zero when |t − t∗ | exceeds 20. The false discovery rate (FDR) is the ratio of the number of false positive alarms over the number of total alarms. Considering the trade-off between benefit and FDR, we used the benefit-FDR curve as proposed in [1] for the performance comparison. It is a concept similar to ROC curve. Figure 1(b) shows the results of the benefit-FDR curves for our method and existing methods. The horizontal axis shows FDR while the vertical axis shows the average benefit where the average was taken over all of the change points. SDNML is our method, CF is ChangeFinder, the conventional two-stage learning based method. HT is the hypothesis-testing based method in which the fourthdegree AR model is used for model fitting and the score is measured in terms of the logarithmic loss. We observe from Figure 1(b) that SDNML performs better than HT and CF. The AUC (Area Under Curve) for SDNML was about 12 % larger than that for CF. Figure 2 shows the computation time (sec) of SDNML in comparison with CF for this data set. We see that SDNML is significantly more efficient than CF.
194
Y. Urabe et al.
600
500
400
300
200
100
0 1
26
51
76
101
126
151
176
SDNML
-100
(a) Data: Continuous Change Points
(b) Benefit-FDR Curves
Fig. 3. Detecting Continuous Change Points
4.3
Continuous Change-Point Detection
Next we consider the case where change points are continuous in the sense that the value of dissimilarity Δ continuously changes at each of the points. We consider the following data generation model xt = v(t) + ε where ε ∼ N (0, 1) and v(t) = 0 for 0 ≤ t ≤ 100, and v(t) = c(t − 100)(t − 99)/2 for t > 100. Letting the dissimilarity at time t be Δ(t), then it is calculated as: For a given c > 0, Δ(t) = 0 for 0 ≤ t ≤ 100, and Δ(t) = c2 (t − 100)2 /2 for t > 100. This shows that the dissimilarity of change points is continuous with respect to t. We call such change points continuous change points. They are more difficult to be identified than discontinuous ones. We generated 6 times 200 records according to the model as above. Figure 3(a) shows an example of such data sets. We evaluated the detection accuracies for CF,HT, and SDNML for this data set. Parameter values for all of the methods are systematically chosen as with the discontinuous case. Figure 3(b) shows the results of the benefit-FDR curves for CF and SDNML where the average-benefit was computed as the average of the benefits taken over the 6-times randomized data generation. Note that HT was much worse than CF, and was omitted from the Figure3(b). We observe from Figure 3(b) that SDNML performs significantly better than CF. The AUC for SDNML is about 46 % larger than that for CF. Through the empirical evaluation using artificial data sets including continuous and discontinuous change-points, we observe that our method performs significantly better than the existing methods both in detection accuracy and computational efficiency. The superiority of SDNML over CF may be justified from the view of the minimum description length (MDL) principle. Indeed, SDNML is designed as the optimal strategy that sequentially attains the least code-length while CF using the predictive code produces longer code-lengths than SDNML. It is theoretically guaranteed from the theory of the MDL principle that the shorter the code-length for data sequence is, the better model is learned from data. Hence the better strategy in the light of the MDL principle yields a better strategy for statistical
Real-Time Change-Point Detection
195
modeling, eventually leads to a better strategy for change-point detection. This insight was demonstrated experimentally. The reason why SDNML and CF are significantly better than HT is that SDNML and CF are more adaptive to non-stationary data sources than HT. Indeed, SDNML and CF have the function of sequential discounting learning while HT has no such a function. It was also demonstrated experimentally. It is interesting to see that the difference between SDNML and CF becomes much larger in detecting continuous change points rather than discontinuous ones. This is due to the fact that the statistical modeling is more critical for the cases where the change-points are more difficult to be detected.
5
Applications to Malware Detection
We show an application of our method to malware detection. Malware is a generic term indicating unwanted software (e.g., viruses, backdoors, spywares, torojans, worms etc.). Most of conventional methods against malware are signature-based ones such as anti-virus software. Here signature is a short string characterizing malware’s features. In the signature-based methods, pattern matching of any given input data with signatures is conducted to detect malware. Hence unseen malware or those whose signatures are difficult to describe may not be detected by the signature-based methods. Furthermore it is desired that the symptom of malware is detected earlier than its malicious action actually occurs. We expect that change-point detection from access log data is one of promising technologies for detecting such malware at early stages. We are concerned with the issue of detecting backdoor, which is one of typical malware. In our experiment we used access logs, each of which consists of a number of attributes, including time stamp, IP address, URL, server name, kinds of action, etc. All of the data were collected at a server. URL means the URL accessed by a user. We used only three attributes from among them; time stamp, IP address, URL. We constructed two kinds of time series. One is a time series of IP address counts, where a datum was generated every 1 minutes and its value is the maximum number of identical IPs which occurred within past 15 minutes. The other is a time series of URL counts, where a datum was generated every 1 minutes and its value was the maximum number of identical URLs which occurred within past 15 minutes. In the data set there are a number of bursts of logs including the message 500ServerError, which are considered as actions related to backdoors. From the view of security analysts, they can be thought of as a symptom of them. Hence we are interested in how early our method is able to detect such bursts without any knowledge of the message 500ServerError. We applied our method to the two time series as above. The original data set consisted of 5538 records, and the length of the time series obtained after the preprocessing as above was 536. In the original data set, there were included three bursts of logs including the message 500ServerEroor. Figure 4(a) shows graphs of a time series of IP address counts, SDNML score curve, and CF score curve. Figure 4(b) shows graphs of URL counts data, SDNML score curve, and CF score curve.
196
Y. Urabe et al.
SDNML
SDNML
(a) IP access counting data
(b) URL counting data
Fig. 4. Malware detection results Table 1. Malware Detection ServerError Time SDNML Alert Time CF Alert Time
Begin End IP URL IP URL
Alert Time Alarms 8:49:15 9:17:15 9:59:45 – 8:51:15 9:27:15 9:59:45 – 8:48:45 9:10:15 10:15:45 7 8:48:45 9:10:15 10:00:45 12 – 9:10:15 10:04:15 4 – 9:10:15 10:00:45 5
Table 1 summarizes the performance of SDNML and CF for the two time series (IP counting data and URL counting data) in terms of alert time and the total number of alarms. In the row of ServerError Time, for each burst of messages 500ServerError, the starting time point and ending time point of the burst are shown. In the table ”-” indicates the fact that the burst associated with the message: 500ServerError was not detected. We observe from Table 1 that our method was able to detect all of the bursts associated with the message: 500ServerError, while CF overlooked some of them. It was confirmed by security analysts that all of the detected bursts were related to backdoor, and were considered as symptoms of backdoor. Further there were no logs related to backdoor other than the bursts of the message: 500ServerError. It implies that our method was able to detect backdoor at early stages when its symptoms appeared. This demonstrates the validity of our method in the scenario of malware detection.
6
Conclusion
We have proposed a new method of real-time change point detection, in which we employ the sequentially discounting normalized maximum likelihood (SDNML) coding as a scoring function within the two-stage learning framework. The intuition behind the design of this method is that SDNML coding, which sequentially attains the shortest code-length, would improve the accuracy of change-point
Real-Time Change-Point Detection
197
detection. This is because according to the theory of the minimum description length principle, the shorter code-length leads to the better statistical modeling. This paper has empirically demonstrated the validity of our method using artificial data sets and real data sets. It has turned out that our method is able to detect change-points with significantly higher accuracy and efficiency than the existing real-time change-point detection method and the hypothesis-testing based method. Specifically, through the application of our method to malware detection, we have shown that real-time change-point detection is a promising approach to the detection of symptoms of malware at early stages.
Acknowledgments This research was supported by Microsoft Corporation (Microsoft Research CORE Project) and NTT Corporation.
References 1. Fawcett, T., Provost, F.: Activity monitoring: noticing interesting changes in behavior. In: Proc. of ACM-SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, pp. 53–62 (1999) 2. Guralnik, V., Srivastava, J.: Event detection from time series data. In: Proc. ACMSIGKDD Int’l Conf. Knowledge Discovery and Data Mining, pp. 33–42 (1999) 3. Hawkins, D.M.: Point estimation of parameters of piecewise regression models. J. Royal Statistical Soc. Series C 25(1), 51–57 (1976) 4. Rissanen, J.: Information and Complexity in Statistical Modeling. Springer, Heidelberg (2007) 5. Rissanen, J., Roos, T., Myllym¨ aki, P.: Model selection by sequentially normalized least squares. Jr. Multivariate Analysis 101(4), 839–849 (2010) 6. Roos, T., Rissanen, J.: On sequentially normalized maximum likelihood models. In: Proc. of 1st Workshop on Information Theoretic Methods in Science and Engineering, WITSME 2008 (2009) 7. Shtarkov, Y.M.: Universal sequential coding of single messages. Problems of Information Transmission 23(3), 175–186 (1987) 8. Song, X., Wu, M., Jermaine, C., Ranka, S.: Statistical change detection for multidimensional data. In: Proc. Fifteenth ACM-SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, pp. 667–675 (2009) 9. Takeuchi, J., Yamanishi, K.: A unifying framework for detecting outliers and change-points from time series. IEEE Transactions on Knowledge and Data Engineering 18(44), 482–492 (2006) 10. Wang, J., Deng, P., Fan, Y., Jaw, L., Liu, Y.: Virus detection using data mining techniques. In: Proc. of ICDM 2003 (2003) 11. Yamanishi, K., Takeuchi, J.: A unifying approach to detecting outliers and changepoints from nonstationary data. In: Proc. of the Eighth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (2002) 12. Ye, Y., Li, T., Jiang, Q., Han, Z., Wan, L.: Intelligent file scoring system for malware detection from the gray list. In: Proc. of the Fifteenth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (2009)
Compression for Anti-Adversarial Learning Yan Zhou1 , Meador Inge2 , and Murat Kantarcioglu1 1
Erik Jonnson School of Engineering and Computer Science University of Texas at Dallas Richardson, TX 75080 {yan.zhou2,muratk}@utdallas.edu 2 Mentor Graphics Corporation 739 N University Blvd. Mobile, AL 36608
[email protected]
Abstract. We investigate the susceptibility of compression-based learning algorithms to adversarial attacks. We demonstrate that compressionbased algorithms are surprisingly resilient to carefully plotted attacks that can easily devastate standard learning algorithms. In the worst case where we assume the adversary has a full knowledge of training data, compression-based algorithms failed as expected. We tackle the worst case with a proposal of a new technique that analyzes subsequences strategically extracted from given data. We achieved near-zero performance loss in the worst case in the domain of spam filtering. Keywords: adversarial differentiation.
1
learning,
data
compression,
subsequence
Introduction
There is an increasing use of machine learning algorithms for more efficient and reliable performance in areas of computer security. Like any other defense mechanisms, machine learning based security systems have to face increasingly aggressive attacks from the adversary. The attacks are often carefully crafted to target specific vulnerabilities in a learning system, for example, the data set used to train the system, or the internal logic of the algorithm, or often both. In this paper we focus on both causative attacks and exploratory attacks. Causative attacks influence a learning system by altering its training set while their exploratory counterparts do not alter training data but rather exploit misclassifications for new data. When it comes to consider security violations, we focus on integrity violations where the adversary’s goal is injecting hostile input into the system to cause more false negatives. In contrast, availability violations refer to cases where the adversary aims at increasing false positives by preventing “good” input into the system1 . 1
For a complete taxonomy defining attacks against machine learning systems, the readers are referred to the Berkeley technical report [1].
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 198–209, 2011. c Springer-Verlag Berlin Heidelberg 2011
Anti-Adversarial Learning
199
Many security problems, such as intrusion detection, malware detection, and spam filtering, involve learning on strings. Recent studies demonstrate that superior classification performance can be achieved with modern statistical data compression models instrumented for such learning tasks [2,3,4]. Unlike traditional machine learning methods, compression-based learning models can be used directly on raw strings without error-prone preprocessing, such as tokenization, stemming, and feature selection. The methods treat every string input as a sequence of characters instead of a vector of terms (words). This effectively eliminates the need for defining word boundaries. In addition, compression based methods naturally take into consideration both alphabetic and non-alphabetic characters, which prevents information loss as a result of preprocessing. The robustness of compression-based learning algorithms was briefly discussed recently [5]. To the best of our knowledge, there is not a full scale investigation on the susceptibility of this type of learning algorithm to adversarial attacks, and thus no counter-attack techniques have been developed to address potential attacks against compressors trained in learning systems. In this paper, we demonstrate that compression-based learning algorithms are surprisingly resilient to carefully plotted attacks that can easily devastate standard learning algorithms. We further demonstrate that as we launch more aggressive attacks as in the worst case where the adversary has a full knowledge of training data, compression-based algorithms failed as expected. We tackle the worst case with a proposal of a new technique that analyzes subsequences strategically extracted from given data. We achieved near-zero performance loss in the worst case in the domain of spam filtering. The remainder of this paper is organized as follows. In Section 2, we briefly review the current state-of-the-art data compression model that has been frequently used in machine learning and data mining techniques. Section 3 demonstrates that learning systems with modern compressors are resilient to attacks that have a significant impact on standard learning algorithms. We show that modern compressors are susceptible to attacks when the adversary alters data with negative instances in training data. We propose a counter-attack learning method that enhances the compression-based algorithm in Section 4. Section 5 presents the experimental results. Section 6 concludes the work and discusses future directions.
2
Context-Based Data Compression Model—Prediction by Partial Matching
Statistical data modeling plays an important role in arithmetic encoding [6] which turns a string of symbols into a rational number between [0, 1). The number of bits needed to encode a symbol depends on the probability of its appearance in the current context. Finite-context modeling estimates the probability of an incoming symbol based on a finite number of symbols previously encountered. The current state-of-the-art adaptive model is prediction by partial matching (PPM) [7,8,9]. PPM is one of the best dynamic finite-context models
200
Y. Zhou, M. Inge, and M. Kantarcioglu
that provide good estimate of the true entropy of data by using symbol-level dynamic Markov modeling. PPM predicts the symbol probability conditioned on its k immediately prior symbols, forming a k th order Markov model. For example, the context cki of the ith symbol xi in a given sequence is {xi−k , . . . , xi−1 }. The total number of contexts of an order-k model is O(|Σ|k+1 ), where Σ is the alphabet of input symbols. As the order of the model increases the number of contexts increases exponentially. High-order models are more likely to capture longer-range correlations among adjacent symbols, if they exist; however, an unnecessarily high order can result in context dilution leading to inaccurate probability estimate. PPM solves the dilemma by using dynamic context match between the current sequence and the ones that occurred previously. It uses high-order predictions if they exist, otherwise “drops gracefully back to lower order predictions” [10]. More specifically, the algorithm first looks for a match of an order-k context. If such a match does not exist, it looks for a match of an order k − 1 context, until it reaches order-0. Whenever a match is not found in the current context, the model falls back to a lower-order context and the total probability is adjusted by what is called an escape probability. The escape probability models the probability that a symbol will be found in a lower-order context. When an input symbol xi is found in context cki where k ≤ k, the conditional probability of xi given its k th order context cki is: ⎛ p(xi |cki ) = ⎝
⎞
k
p(Esc|cji )⎠ · p(xi |cki )
j=k +1
where p(Esc|cji ) is the escape probability conditioned on context cji . If the symbol is not predicted by the order-0 model, a probability defined by a uniform distribution is predicted. PPMC [11] and PPMD [12] are two well known variants of the PPM algorithm. Their difference lies in the estimate of the escape probabilities. In both PPMC and PPMD, an escape event is counted every time a symbol occurs for the first time in the current context. In PPMC, the escape count and the new symbol count are each incremented by 1 while in PPMD both counts are incremented by 1/2. Therefore, in PPMC, the total symbol count increases by 2 every time a new symbol is encountered, while in PPMD the total count only increases by 1. When implemented on a binary computer, PPMD sets the escape probability |d| to 2|t| , where |d| is the number of distinct symbols in the current context and |t| is the total number of symbols in the current context. Now, given an input X = x1 x2 . . . xd of length d, where x1 , x2 , . . . , xi is a sequence of symbols, its probability given a compression model M can be estimated as d p(X|M ) = p(xi |xi−1 i−k , M ) i=1
where xji = xi xi+1 xi+2 . . . xj for i < j.
Anti-Adversarial Learning
3
201
Compression-Based Classification and Adversarial Attacks
Consider binary classification problems: X → Y where Y ∈ {+, −}. Given a set of training data, compression-based classification works as follows. First, the algorithm builds two compression models, one from each class. Each compression model maintains a context tree, together with context statistics, for training data in one of the two classes. Then, to classify an unlabeled instance, it requires the instance to run through both compression models. The model that provides a better compression of the instance makes the prediction. A common measure for classification based on compression is minimum cross-entropy [13,3]: |X| 1 c = argmin − log p(xi |xi−1 i−k , Mc ) |X| c∈{+,−} i=1
where |X| is the length of the instance, xi−k , . . . , xi is a subsequence in the instance, k is the length of the context, and Mc is the compression model associated with class c. When classifying an unlabeled instance, a common practice is to compress it with both compression models and check to see which one compresses it more efficiently. However, PPM is an incremental algorithm, which means once an unlabeled instance is compressed, the model that compresses it will be updated as well. This requires the changes made to both models be reverted every time after an instance is classified. Although, the PPM algorithm has a linear time complexity, the constants in its complexity are by no means trivial. It is desirable to eliminate the redundancy of updating and then reverting changes made to the models. We propose an approximation algorithm (See Algorithm 1) that we found works quite well in practice. Given context C = xi−k . . . xi−1 in an unlabeled instance, if any suffix c of C has not occurred in the context trees 1 of the compression models, we set p(Esc|c) = |A| , thus the probability of xi is 1 discounted by |A| , where |A| |Σ|, the size of the alphabet. More aggressive discount factors set the prediction further away from the decision boundary. Empirical results will be discussed in Section 5. Compression-based algorithms have demonstrated superior classification performance in learning tasks where the input consists of strings [3,4]. However, it is not clear whether this type of learning algorithm is susceptible to adversarial attacks. We investigate several ways to attack the compression-based classifier on real data in the domain of e-mail spam filtering. We choose this domain in our study for the following reasons: 1.) previous work has demonstrated great success of compression-based algorithms in this domain [3]; 2.) it is conceptually simple to design various adversarial attacks and establish a ground truth; 3.) there have been studies of several variations of adversarial attacks against standard learning algorithms in this domain [14,5]. Good word attacks are designed specifically to target the integrity of statistical spam filters. Good words are those that appear frequently in normal e-mail
202
Y. Zhou, M. Inge, and M. Kantarcioglu
Algorithm 1. Symbol Probability Estimate Input: xi−k . . . xi−1 , xi , Mc Output: p(xi |xi−1 i−k ) p = 1.0; foreach s = suffix(xi−k . . . xi−1 ) do if s ∈ / Mc then 1 p = p · |A| ; else p = p · p(xi |s, Mc ); end end return p;
but rarely in spam e-mail. Existing studies show that good word attacks are very effective against the standard learning algorithms that are considered the state of the art in text classification [14]. The attacks against multinomial na¨ıve Bayes and support vector machines with 500 good words caused more than 45% decrease in recall while the precision was fixed at 90% in a previous study [5]. We repeated the test on the 2006 TREC Public Spam Corpus [15] using our implementation of PPMD compressors. It turns out that the PPMD classifier is surprisingly robust against good word attacks. With 500 good words added to 50% of the spam in e-mail, we observed no significant change in both precision and recall (See Figure 1)2 . This remains true even when 100% of spam is appended with 500 highly ranked good words. Its surprising resilience to good word attacks led us to more aggressive attacks against the PPMD classifier. We randomly chose a legitimate e-mail message from the training corpus and appended it to a spam e-mail during testing. 50% of the spam was altered this way. This time we were able to bring the average recall value down to 57.9%. However, the precision value remains above 96%. Figure 1 shows the accuracy, precision and recall values when there are no attacks, 500-goodword attacks, and in the worst case, attacks with legitimate training e-mail. Details on experimental setup will be given in Section 5.
4
Robust Classification via Subsequence Differentiation
In this section, we present a robust anti-adversarial learning algorithm that is resilient to attacks with no assumptions on the adversary’s knowledge of the system. We do, however, assume that the adversary cannot alter the negative instances, such as legitimate e-mail messages, normal user sessions, and benign software, in training data. This is a reasonable assumption in practice since in many applications such as intrusion detection, malware detection, and spam 2
It has been reported previously that PPM compressors appeared to be vulnerable to good word attacks [5]. The results were produced by using the PPMD implementation in the PSMSLib C++ library [16]. Since we do not have access to the source code, we cannot investigate the cause of the difference and make further conclusions.
Anti-Adversarial Learning
203
Accuracy/Precision/Recall Values
100% Accuracy Precision Recall
80%
60%
40%
20%
0%
PPMD−No−Attack
PPMD−500GW−Attack
PPMD−WorstCase−Attack
Fig. 1. The Accuracy/precision/recall values of the PPMD classifier with no attacks, 500-goodword attacks, and attacks where original training data is used
filtering, the adversary’s attempts are mostly focused on altering positive instances to make them less distinguishable among ordinary data in that domain. Now that we know the adversary can alter the “bad” data to make it appear to be good, our goal is to find a way to separate the target from its innocent looking. Suppose we have two compression models M+ and M− . Intuitively, a positive instance would compress better with the positive model M+ than it would with M− , the negative model. When a positive instance is altered with features that would ordinarily appear in negative data, we expect the subsequences in the altered data that are truly positive to retain relatively higher compression rates when compressed against the positive model. We apply a sliding window approach to scan through each instance and extract subsequences in the sliding window that require a smaller number of bits when compressed against M+ than M− . Ideally, more subsequences would be identified in a positive instance than in a negative instance. In practice, there are surely exceptional cases where the number of subsequences in a negative instance would exceed its normal average. For this reason, we decide to compute the difference between the total number of bits required to compress the extracted subsequences S using M− and M+ , respectively. If an instance is truly positive, we expect BitsM− (S) BitsM+ (S), where BitsM− (S) is the number of bits needed to compress S using the negative compression model, and BitsM+ (S) is the bits needed using the positive model. Now for a positive instance, not only we expect a longer list of subsequences extracted, but also a greater discrepancy between the bits after they are compressed using the two different models. For the adversary, any attempt to attack this counter-attack strategy will always boil down to finding a set of “misleading” features and seamlessly blend them into the target (positive instance). To break down the first step of our counter-attack strategy, that is, extracting subsequences that compress better against the positive compression model, the adversary would need to select a set of good words {wi |BitsM+ (wi ) < BitsM− (wi )} so that the good words can pass, together with the “bad” ones, our first round screening. To attack the second step, the adversary must find good words that compress better against the negative compression model, that is, {wi |BitsM+ (wi ) > BitsM− (wi )}, to offset the impact of the “bad” words in the extracted subsequences. These two
204
Y. Zhou, M. Inge, and M. Kantarcioglu
goals inevitably contradict each other, thus making strategically attacking the system much more difficult. We now formally present our algorithm. Given a set of training data T , where T = T+ ∪ T− , we first build two compression models M+ and M− from T+ and T− , respectively. For each training instance t in T , we scan t using a sliding window W of size n, and extract the subsequence si in the current sliding window W if BitsM+ (si ) < BitsM− (Si ). This completes the first step of our algorithm— subsequence extraction. Next, for each instance t in the training set, we compute dt = si (BitsM− (si ) − BitsM+ (si )), where si is a subsequence in t that has been extracted in the first step. We then compute the classification threshold by maximizing the information gain: r=
argmax InfoGain(T ). r∈{d1 ,...,d|T | }
For a more accurate estimate, r should be computed using k-fold cross validation. To classify an unlabeled instance u, we first extract the set of subsequences S from u in the same manner, then compute du = BitsM− (S) − BitsM+ (S). If du ≤ r, u is classified as a negative instance, otherwise, u is positive. Detailed description of the algorithm is given in Algorithm 2.
5
Experimental Results
We evaluate our counter-attack algorithm on e-mail data from the 2006 TREC Public Spam Corpus [15]. The data consists of 36,674 spam and legitimate email messages, sorted chronologically by receiving date and evenly divided into 11 subsets D1 , · · · , D11 . Experiments were run in an on-line fashion by training on the ith subset and testing on the subset i + 1. The percentage of spam messages in each subset varies from approximately 60% to a little bit over 70%. The good word list consists of the top 1,000 unique words from the entire corpus ranked according to the frequency ratio. In order to allow a true apples-toapples comparison among compression-based algorithms and standard learning algorithms, we preprocessed the entire corpus by removing HTML and nontextual parts. We also applied stemming and stop-list to all terms. The to, from, cc, subject, and received headers were retained, while the rest of the headers were removed. Messages that had an empty body after preprocessing were discarded. In all of our experiments, we used 6th -order context and a sliding window of 5 where applicable. We first test our counter-attack algorithm under the circumstances where there are no attacks and there are only good word attacks. In the case of good words attacks, the adversary has a general knowledge about word distributions in the entire corpus, but lacks a full knowledge about the training data. As discussed in Section 3, the PPMD-based classifier demonstrated superior performance in these two cases. We need to ensure that our anti-adversarial classifier would perform the same. Figure 2 shows the comparison between the PPMD-based classifier and our anti-adversarial classifier when there are no attacks and 500goodword attacks on 50% of spam in each test set.
Anti-Adversarial Learning
205
Algorithm 2. Anti-AD Learn Input: T = T+ ∪ T− —a set of training data Output: c(u)—the class label of the unlabeled instance u // Training Build M+ from T+ and M− from T− ; // K-fold Cross Validation for finding threshold r Partition T into {T1 , . . . , TK }; i = 0; while i < K do i i Build M+ and M− from {T1 , · · · , Ti−1 , Ti+1 , · · · , Tk }; S = ∅; /* sequence vector */ D = ∅; /* bit difference vector */ foreach t ∈ Ti do initialize W to cover the first n-symbols sj in t; while W = ∅ do if BITSM i (sj ) < BITSM i (sj ) then + − S = S ∪ {sj }; Shift W n symbols down the input; else Shift W 1 symbol down the input; end end // bit difference of instance t D = D ∪ { sj ∈S (BIT SM i (sj ) − BIT SM i (sj ))}; −
+
end i++; end r = argmaxInfoGain(T ); r∈D
/* threshold r */
// Classifying Extract S from u such that BITSM+ (S) < BITSM− (S); if BITSM− (S) − BITSM+ (S) ≤ r then return Negative; else return Positive; end
As can be observed, the performance are comparable between both algorithms, with and without good word attacks. For each algorithm, both precision and recall values remain nearly unchanged before and after attacks. Note that for PPMD-based classifiers, we tried both incremental compression and bit approximation. The results shown were obtained using incremental compression. With bit approximation, we achieved significant run-time speedup, however, with 5% performance drop, from 94.2% to 89.2%, in terms of recall values. Due to its run-time efficiency, we used bit approximation in all experiments involving our anti-adversarial classifiers. We used 10-fold cross validation to determine the threshold value r by maximizing the information gain. In practice, overfitting may occur and several k (number of folds) values should be tried in cross
206
Y. Zhou, M. Inge, and M. Kantarcioglu Accuracy Precision Recall
Accuracy/Precision/Recall Values
100%
80%
60%
40%
20%
0%
PPMD−No−Attack
AADL−NO−Attack
PPMD−500GW−Attack
AADL−500GW−Attack
Fig. 2. The accuracy, precision and recall values of the PPMD classifier and the antiadversarial classifier (AADL) with no attacks and 500 good words attacks Accuracy Precision Recall
Accuracy/Precision/Recall Values
100%
80%
60%
40%
20%
0%
AADL−No−Attack
AADL−500GW−Attack
AADL−WorstCase(1)−Attack
AADL−WorstCase(5)−Attack
Fig. 3. The accuracy, precision and recall values of the anti-adversarial classifier (AADL) with no attacks, 500 good word attacks, 1 legit mail attacks, and 5 legit mail attacks
validation. We also experimented when the number of good words varied from 50 to 500, and the percentage of spam altered from 50% to 100% in each test set, the results remain similar. We conducted more aggressive attacks with exact copies of legitimate messages randomly chosen from the training set. We tried two attacks of increasing strength by attaching one legitimate message and five legitimate messages, respectively, to spam in the test set. In total 50% of the spam in the test set was altered in each attack. Figure 3 illustrates the results. As can be observed, our anti-adversarial classifier is very robust to any of the attacks, while the PPMD-based classifier, for which the results are shown in Figure 4, is obviously vulnerable to the more aggressive attacks. Furthermore, similar results were obtained when: 1.) the percentage of altered spam increased to 100%; 2.) legitimate messages used to alter spam were randomly selected from the test set, and 3.) injected legtimate messages were randomly selected from data sets that are neither training nor test sets. To make the matter more complicated, we also tested our algorithm when legitimate messages were randomly scattered into spam. This was done in two different fashions. In the first case, we first randomly picked a position in spam; then we took a random length (no greater than 10% of the total length) of a legitimate message and inserted it to the selected position in spam. The two steps
Anti-Adversarial Learning
207
Accuracy/Precision/Recall Values
100% Accuracy Precision Recall
80%
60%
40%
20%
0%
PPMD−No−Attack
PPMD−500GW−Attack
PPMD−WorstCase(1)−Attack
PPMD−WorstCase(5)−Attack
Fig. 4. The accuracy, precision and recall values of the PPMD-based classifier with no attacks, 500 good word attacks, 1 legit mail attacks, and 5 legit mail attacks Accuracy/Precision/Recall Values
100% Accuracy Precision Recall
80%
60%
40%
20%
0%
PPMD−500GW−AttackBoth
PPMD−WorstCase(1)−AttackBoth
AADL−500GW−AttackBoth
AADL−WorstCase(1)−AttackBoth
Fig. 5. The accuracy, precision and recall values of the PPMD-based classifier and the anti-adversarial classifier (AADL) with 500-goodword attacks and 1 legit mail attacks to 50% spam in both training and test data
were repeated until the entire legitimate message has been inserted. Empirical results show that random insertion does not appear to affect the classification performance. In the second case, we inserted terms from a legitimate message in a random order after every terms in spam. The process is as follows: 1.) tokenize the legitimate message and randomly shuffle the tokens; 2.) insert a random number of tokens (less than 10% of the total number of tokens) after terms in spam. Repeat 1) and 2) until all tokens are inserted to spam. We observed little performance change when ≥ 3. When ≤ 2, nearly all local context is completely lost in the altered spam, the average recall values dropped to below 70%. Note that in the latter case ( ≤ 2), the attack is most likely useless to the adversary in practice since the scrambled instance would also fail to deliver the malicious attacks the adversary has set out to accomplish. Previous studies [14,17,5] show that retraining on altered data as a result of adversarial attacks may improve the performance of classifiers against the attacks. This observation is further verified in our experiments, in which we randomly select 50% of spam in the training set and appended good words and legitimate e-mail to it, separately. Figure 5 shows the results of the PPMD classifier and the anti-adversarial classifier with 500-goodword attacks and 1-legitimatemail attacks in both training and test data. As can be observed, retraining improved, to an extent, the classification results in all cases.
6
Concluding Remarks
We demonstrate that compression-based classifiers are much more resilient to good word attacks compared to standard learning algorithms. On the other
208
Y. Zhou, M. Inge, and M. Kantarcioglu
hand, this type of classifier is vulnerable to attacks when the adversary has a full knowledge of the training set. We propose a counter-attack technique that extracts and analyzes subsequences that are more informative in a given instance. We demonstrate that the proposed technique is robust against any attacks, even in the worst case where the adversary can alter positive instances with exact copies of negative instances taken directly from the training set. A fundamental theory needs to be developed to explain the strength of the compression-based algorithm and the anti-adversarial learning algorithm. It remains less clear, in theory, why the compression-based algorithms are remarkably resilient to strategically designed attacks that would normally defeat classifiers trained using standard learning algorithms. It is certainly to our great interest to find out how well the proposed counter-attack strategy performs in other domains, and under what circumstances this seemingly bullet-proof algorithm would break down.
Acknowledgement The authors would like to thank Zach Jorgensen for his valuable input. This work was partially supported by Air Force Office of Scientific Research MURI Grant FA9550-08-1-0265, National Institutes of Health Grant 1R01LM009989, National Science Foundation (NSF) Grant Career-0845803, and NSF Grant CNS-0964350, CNS-1016343.
References 1. Barreno, M., Nelson, B.A., Joseph, A.D., Tygar, D.: The security of machine learning. Technical Report UCB/EECS-2008-43, EECS Department, University of California, Berkeley (April 2008) 2. Sculley, D., Brodley, C.E.: Compression and machine learning: A new perspective on feature space vectors. In: DCC 2006: Proceedings of the Data Compression Conference, pp. 332–332. IEEE Computer Society, Washington, DC (2006) 3. Bratko, A., Filipiˇc, B., Cormack, G.V., Lynam, T.R., Zupan, B.: Spam filtering using statistical data compression models. J. Mach. Learn. Res. 7, 2673–2698 (2006) 4. Zhou, Y., Inge, W.: Malware detection using adaptive data compression. In: AISec 2008: Proceedings of the 1st ACM Workshop on Artificial Intelligence and Security, Alexandria, Virginia, USA, pp. 53–60 (2008) 5. Jorgensen, Z., Zhou, Y., Inge, M.: A multiple instance learning strategy for combating good word attacks on spam filters. Journal of Machine Learning Research 9, 1115–1146 (2008) 6. Witten, I., Neal, R., Cleary, J.: Arithmetic coding for data compression. Communications of the ACM, 520–540 (June 1987) 7. Cleary, J., Witten, I.: Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications COM-32(4), 396–402 (1984) 8. Cormack, G., Horspool, R.: Data compression using dynamic markov modeling. The Computer Journal 30(6), 541–550 (1987) 9. Cleary, J., Witten, I.: Unbounded length contexts of ppm. The computer Journal 40(2/3), 67–75 (1997)
Anti-Adversarial Learning
209
10. Moffat, A., Turpin, A.: Compression and Coding Algorithms. Kluwer Academic Publishers, Boston (2002) 11. Moffat, A.: Implementing the ppm data compression scheme. IEEE Trans. Comm. 38, 1917–1921 (1990) 12. Howard, P.: The design and analysis of efficient lossless data compression systems. Technical report, Brown University (1993) 13. Teahan, W.J.: Text classification and segmentation using minimum cross-entropy. In: RIAO 2000, 6th International Conference Recherche d’Informaiton Assistee par ordinateur (2000) 14. Lowd, D., Meek, C.: Good word attacks on statistical spam filters. In: Proceedings of the 2nd Conference on Email and Anti-Spam (2005) 15. Cormack, G.V., Lynam, T.R.: Spam track guidelines – TREC 2005-2007 (2006), http://plg.uwaterloo.ca/~ gvcormac/treccorpus06/ 16. Bratko, A.: Probabilistic sequence modeling shared library (2008), http://ai.ijs.si/andrej/psmslib.html 17. Webb, S., Chitti, S., Pu, C.: An experimental evaluation of spam filter performance and robustness against attack. In: The 1st International Conference on Collaborative Computing: Networking, Applications and Worksharing, pp. 19–21 (2005)
Mining Sequential Patterns from Probabilistic Databases Muhammad Muzammal and Rajeev Raman Department of Computer Science, University of Leicester, UK {mm386,r.raman}@mcs.le.ac.uk
Abstract. We consider sequential pattern mining in situations where there is uncertainty about which source an event is associated with. We model this in the probabilistic database framework and consider the problem of enumerating all sequences whose expected support is sufficiently large. Unlike frequent itemset mining in probabilistic databases [C. Aggarwal et al. KDD’09; Chui et al., PAKDD’07; Chui and Kao, PAKDD’08], we use dynamic programming (DP) to compute the probability that a source supports a sequence, and show that this suffices to compute the expected support of a sequential pattern. Next, we embed this DP algorithm into candidate generate-and-test approaches, and explore the pattern lattice both in a breadth-first (similar to GSP) and a depth-first (similar to SPAM) manner. We propose optimizations for efficiently computing the frequent 1-sequences, for re-using previously-computed results through incremental support computation, and for elmiminating candidate sequences without computing their support via probabilistic pruning. Preliminary experiments show that our optimizations are effective in improving the CPU cost. Keywords: Mining Uncertain Data, Mining complex sequential data, Probabilistic Databases, Novel models and algorithms.
1
Introduction
The problem of sequential pattern mining (SPM), or finding frequent sequences of events in data with a temporal component, has been studied extensively [23,17,4] since its introduction in [18,3]. In classical SPM, the data to be mined is deterministic, but it it recognized that data obtained from a wide range of data sources is inherently uncertain [1]. This paper is concerned with SPM in probabilistic databases [19], a popular framework for modelling uncertainty. Recently several data mining and ranking problems have been studied in this framework, including top-k [24,8] and frequent itemset mining (FIM) [2,5,6,7]. In classical SPM, the event database consists of tuples eid, e, σ, where e is an event, σ is a source and eid is an event-id which incorporates a time-stamp. A tuple may record a retail transaction (event) by a customer (source), or an observation of an object/person (event) by a sensor/camera (source). Since event-ids have a timestamp, the event database can be viewed as a collection of source sequences, one J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 210–221, 2011. c Springer-Verlag Berlin Heidelberg 2011
Mining Sequential Patterns from Probabilistic Databases
211
per source, containing a sequence of events (ordered by time-stamp) associated with that source, and classical SPM problem is to find patterns of events that have a temporal order that occur in a significant number of source sequences. Uncertainty in SPM can occur in three different places: the source, the event and the time-stamp may all be uncertain (in contrast, in FIM, only the event can be uncertain). In a companion paper [16] the first two kinds of uncertainty in SPM were formalized as source-level uncertainty (SLU) and event-level uncertainty (ELU), which we now summarize. In SLU, the “source” attribute of each tuple is uncertain: each tuple contains a probability distribution over possible sources (attribute-level uncertainty [19]). As noted in [16], this formulation applies to scenarios such as the ambiguity arising when a customer makes a retail transaction, but the customer is either not identified exactly, or the customer database itself is probabilistic as a result of “deduplication” or cleaning [11]. In ELU, the source of the tuple is certain, but the events are uncertain. For example, the PEEX system [13] aggregates unreliable observations of employees using RFID antennae at fixed locations into uncertain higher-level events such as “with probability 0.4, at time 103, Alice and Bob had a meeting in room 435”. Here, the source (Room 435) is deterministic, but the event ({Alice, Bob}) only occurred with probability 0.4. Furthermore, in [16] two measures of “frequentness”, namely expected support and probabilistic frequentness, used for FIM in probabilistic databases [5,7], were adapted to SPM, and the four possible combinations of models and measures were studied from a computational complexity viewpoint. This paper is focussed on efficient algorithms for the SPM problem in SLU probabilistic databases, under the expected support measure, and the contributions are as follows: 1. We give a dynamic-programming (DP) algorithm to determine efficiently the probability that a given source supports a sequence (source support probability), and show that this is enough to compute the expected support of a sequence in an SLU event database. 2. We give depth-first and breadth-first methods to find all frequent sequences in an SLU event database according to the expected support criterion. 3. To speed up the computation, we give subroutines for: (a) highly efficient computation of frequent 1-sequences, (b) incremental computation of the DP matrix, which allows us to minimize the amount of time spent on the expensive DP computation, and (c) probabilistic pruning, where we show how to rapidly compute an upper bound on the probability that a source supports a candidate sequence. 4. We empirically evaluate our algorithms, demonstrating their efficiency and scalability, as well as the effectiveness of the above optimizations. Significance of Results. The source support probability algorithm ((1) above) shows that in probabilistic databases, FIM and SPM are very different – there is no need to use DP for FIM under the expected support measure [2,6,7]. Although the proof that source support probability allows the computation of the expected support of a sequence in an SLU database is simple, it is
212
M. Muzammal and R. Raman
unexpected, since in SLU databases, there are dependencies between different sources – in any possible world, a given event can only belong to one source. In contrast, determining if a given sequence is probabilistically frequent in an SLU event database is #P-complete because of the dependencies between sources [16]. Also, as noted in [16], (1) can be used to determine if a sequence is frequent in an ELU database using both expected support and probabilistic frequentness. This implies efficient algorithms for enumerating frequent sequences under both frequentness criteria for ELU databases, and by using the framework of [10], we can also find maximal frequent sequential patterns in ELU databases. The breadth-first and depth-first algorithms (2) have a high-level similarity to GSP [18] and SPADE/SPAM [23,4], but checking the extent to which a sequence is supported by a source requires an expensive DP computation, and major modifications are needed to achieve good performance. It is unclear how to use either the projected database idea of PrefixSpan [17], or bitmaps as in SPAM); we instead use the ideas ((3) above) of incremental computation, and probabilistic pruning. Although there is a high-level similiarity between this pruning and a technique of [6] for FIM in probabilistic databases, the SPM problem is more complex, and our pruning rule is harder to obtain. Related Work. Classical SPM has been studied extensively [18,23,17,4]. Modelling uncertain data as probabilistic databases [19,1] has led to several ranking/mining problems being studied in this context. The top-k problem (a ranking problem) has been studied intensively (see [12,24,8] and references therein). FIM in probabilistic databases was studied under the expected support measure in [2,7,6] and under the probabilistic frequentness measure in [5]. To the best of our knowledge, apart from [16], the SPM problem in probabilistic databases has not been studied. Uncertainty in the time-stamp attribute was considered in [20] – we do not consider time to be uncertain. Also [22] studies SPM in “noisy” sequences, but the model proposed there is very different to ours and does not fit in the probabilistic database framework.
2
Problem Statement
Classical SPM [18,3]. Let I = {i1 , i2 , . . . , iq } be a set of items and S = {1, . . . , m} be a set of sources. An event e ⊆ I is a collection of items. A database D = r1 , r2 , . . . , rn is an ordered list of records such that each ri ∈ D is of the form (eid i , ei , σi ), where eid i is a unique event-id, including a time-stamp (events are ordered by this time-stamp), ei is an event and σi is a source. A sequence s = s1 , s2 , . . . , sa is an ordered list of events. The events si in the sequence are called its elements. The length of a sequence s is the total number of items in it, i.e. aj=1 |sj |; for any integer k, a k-sequence is a sequence of length k. Let s = s1 , s2 , . . . , sq and t = t1 , t2 , . . . , tr be two sequences. We say that s is a subsequence of t, denoted s t, if there exist integers 1 ≤ i1 < i2 < · · · < iq ≤ r such that sk ⊆ tij , for k = 1, . . . , q. The source sequence corresponding to a source i is just the multiset {e|(eid, e, i) ∈ D}, ordered by eid. For a sequence s and source i, let Xi (s, D) be an indicator variable, whose value is 1 if s is
Mining Sequential Patterns from Probabilistic Databases
213
a subsequence of the source sequence for source i, and 0 otherwise. For any m sequence s, define its support in D, denoted Sup(s, D) = i=1 Xi (s, D). The objective is to find all sequences s such that Sup(s, D) ≥ θm for some userdefined threshold 0 ≤ θ ≤ 1. Probabilistic Databases. We define an SLU probabilistic database Dp to be an ordered list r1 , . . . , rn of records of the form (eid , e, W) where eid is an eventid, e is an event and W is a probability distribution over S; the list is ordered by eid. The distribution W contains pairs of the form (σ, c), where σ ∈ S and 0< c ≤ 1 is the confidence that the event e is associated with source σ and (σ,c)∈W c = 1. An example can be found in Table 1(L). Table 1. A source-level uncertain database (L) transformed to p-sequences (R). Note that events like e1 (marked with † on (R)) can only be associated with one of the sources X and Y in any possible world. eid e1 e2 e3 e4
event W (a, d) (X : 0.6)(Y : 0.4) (a) (Z : 1.0) (a, b) (X : 0.3)(Y : 0.2)(Z : 0.5) (b, c) (X : 0.7)(Z : 0.3)
p-sequence p DX (a, d : 0.6)† (a, b : 0.3)(b, c : 0.7) DYp (a, d : 0.4)† (a, b : 0.2) p DZ (a : 1.0)(a, b : 0.5)(b, c : 0.3)
The possible worlds semantics of Dp is as follows. A possible world D∗ of D is generated by taking each event ei in turn, and assigning it to one of the possible sources σi ∈ Wi . Thus every record ri = (eidi , ei , Wi ) ∈ Dp takes the form ri = (eidi , ei , σi ), for some σi ∈ S in D∗ . By enumerating all such possible combinations, we get the complete set of possible worlds. We assume that the distributions associated with each record ri in Dp are stochastically n independent; the probability of a possible world D∗ is therefore Pr[D∗ ] = i=1 PrWi [σi ]. For example, a possible world D∗ for the database of Table 1 can be generated by assigning events e1 , e3 and e4 to X with probabilities 0.6, 0.3 and 0.7 respectively, and e2 to Z with probability 1.0, and Pr[D ∗ ] = 0.6 × 1.0 × 0.3 × 0.7 = 0.126. As every possible world is a (deterministic) database, concepts like the support of a sequence in a possible world are well-defined. The definition of the expected support of a sequence s in D p follows naturally: ES(s, Dp ) = Pr[D∗ ] ∗ Sup(s, D∗ ), (1) p
D∗ ∈P W (Dp )
The problem we consider is: Given an SLU probabilistic database D p , determine all sequences s such that ES(s, D p ) ≥ θm, for some user-specifed threshold θ, 0 ≤ θ ≤ 1. Since there are potentially an exponential number of possible worlds, it is infeasible to compute ES(s, Dp ) directly using Eq. 1; next we show how to do this computation more efficiently using linearity of expectation and DP.
214
3
M. Muzammal and R. Raman
Computing Expected Support
p-sequences. A p-sequence is analogous to a source sequence in classical SPM, and is a sequence of the form (e1 , c1 ) . . . (ek , ck ), where ej is an event and cj is a confidence value. In examples, we write a p-sequence ({a, d}, 0.4), ({a, b}, 0.2) as (a, d : 0.4)(a, b : 0.2). An SLU database D p can be viewed as a collection of pp sequences D1p , . . . , Dm , where Dip is the p-sequence of source i, and contains a list of those events in Dp that have non-zero confidence of being assigned to source i, ordered by eid, together with the associated confidence (see Table 1(R)). However, the p-sequences corresponding to different sources are not independent, as illustrated in Table 1(R). Thus, one may view an SLU event database as a collection of p-sequences with dependencies in the form of x-tuples [8]. Nevertheless, we show that we can still process the p-sequences independently for the purposes of expected suppport computation: m ES(s, Dp ) = D∗ ∈P W (Dp ) Pr[D ∗ ] ∗ Sup(s, D ∗ ) = D∗ Pr[D ∗ ] ∗ i=1 Xi (s, D∗ ) m ∗ ∗ p = m (2) i=1 D∗ Pr[D ] ∗ Xi (s, D ) = i=1 E[Xi (s, D )], where E denotes the expected value of a random variable. Since Xi is a 0-1 variable, E[Xi (s, D p )] = Pr[s Dip ], and we calculate the right-hand quantity, which we refer to as the source support probability. This cannot be done naively: e.g., if Dip = (a, b : c1 )(a, b : c2 ) . . . (a, b : cq ), then there are O(q 2k ) ways in which (a)(a, b) . . . (a)(a, b) could be supported by source i, and so we use DP. k
times
Computing the Source Support Probability. Given a p-sequence Dip = (e1 , c1 ), . . . , (er , cr ) and a sequence s = s1 , . . . , sq , we create a (q + 1)×(r + 1) matrix Ai,s [0..q][0..r] (we omit the subscripts on A when the source and sequence are clear from the context). For 1 ≤ k ≤ q and 1 ≤ ≤ r, A[k, ] will contain Pr[s1 , . . . , sk (e1 , c1 ), . . . , (e , c )], so A[q, r] the desired value of Pr[s Dip ]. We set A[0, ] = 1 for all , 0 ≤ ≤ r and A[k, 0] = 0 for all 1 ≤ k ≤ q, and compute the other values row-by-row. For 1 ≤ k ≤ q and 1 ≤ ≤ r, define: c if sk ⊆ e ∗ ck = (3) 0 otherwise The interpretation of Eq. 3 is that c∗k is the probability that e allows the element sk to be matched in source i; this is 0 if sk ⊆ e , and is otherwise equal to the probability that e is associated with source i. Now we use the equation: A[k, ] = (1 − c∗k ) ∗ A[k, − 1] + c∗k ∗ A[k − 1, − 1].
(4)
Table 2 shows the computation of the source support probability of an example sequence s = (a)(b) for source X in the probabilistic database of Table 1. p Similarly, we can compute Pr[s DYp ] = 0.08 and Pr[s DZ ] = 0.35, so the expected support of (a)(b) in the database of Table 1 is 0.558 + 0.08 + 0.35 = 1.288, the same value obtained by direct application of Eq 1.
Mining Sequential Patterns from Probabilistic Databases
215
p Table 2. Computing Pr[s DX ] for s = (a)(b) using DP in the database of Table 1
(a, d : 0.6) (a, b : 0.3) (b, c : 0.7) (a) 0.4 × 0 + 0.6 × 1 = 0.6 0.7 × 0.6 + 0.3 × 1 = 0.72 0.72 (a)(b) 0 0.7 × 0 + 0.3 × 0.6 = 0.18 0.3 × 0.18 + 0.7 × 0.72 = 0.558
The reason Eq. 4 is correct is that if sk ⊆ e then the probability that s1 , . . . , sk e1 , . . . , e is the same as the probability that s1 , . . . , sk e1 , . . . , e−1 (note that if sk ⊆ e then c∗k = 0 and A[k, ] = A[k, − 1]). Otherwise, c∗k = c , and we have to consider two disjoint sets of possible worlds: those where e is not associated with source i (the first term in Eq. 4) and those where it is (the second term in Eq. 4). In summary: Lemma 1. Given a p-sequence Dip and a sequence s, by applying Eq. 4 repeatedly, we correctly compute Pr[s Dip ].
4
Optimizations
We now describe three optimized sub-routines for computing all frequent 1sequences, for incremental support computation, and for probabilistic pruning. Fast L1 Computation. Givena 1-sequence s = {x}, a simple closed-form r expression for Pr[s Dip ] is 1 − =1 (1 − c∗1 ). It is easy to verify by induction t−1 that Eq. 4 gives the same answer, since (1 − =1 (1 − c∗1 ))(1 − c∗1t ) + c∗1t = t (1 − =1 (1 − c∗1 )) – recall that A[0, − 1] = 1 for all ≥ 1. This allows us to compute ES(s, Dp ) for all 1-sequences s in just one (linear-time) pass through Dp . Initialize two arrays F and G, each of size q = |I|, to zero and consider each source i in turn. If Dip = (e1 , c1 ), . . . , (er , cr ), for k = 1, . . . , r take the pair (ek , ck ) and iterate through each x ∈ ek , setting F [x] := 1 − ((1 − F [x]) ∗ (1 − ck )). Once we are finished with source i, if F [x] is non-zero, we update G[x] := G[x] + F [x] and reset F [x] to zero (we use a linked list to keep track of which entries of F are non-zero for a given source). At the end, for any item x ∈ I, G[x] = ES(x, D p ). Incremental Support Computation. Let s and t be two sequences of length j and j + 1 respectively. Say that t is an S-extension of s if t = s · {x} for some item x, where · denotes concatenation (i.e. we obtain t by appending a single item as a new element to s). We say that t is an I-extension of s if s = s1 , . . . , sq and t = s1 , . . . , sq ∪{x} for some x ∈ sq , and x is lexicographically not less than any item in sq (i.e. we obtain t by adding a new item to the last element of s). For example, if s = (a)(b, c) and x = d, S- and I-extensions of s are (a)(b, c)(d) and (a)(b, c, d) respectively. Similar to classical SPM, we generate candidate sequences t that are either S- or I-extensions of existing frequent sequences s, and compute ES(t, D p ) by computing Pr[t Dip ] for all sources i. While computing
216
M. Muzammal and R. Raman
Pr[t Dip ] for source i, we would like to exploit the similarity between s and t to compute Pr[t Dip ] more rapidly. Let i be a source, Dip = (e1 , c1 ), . . . , (er , cr ), and s = s1 , . . . , sq be any sequence. Now let Ai,s be the (q + 1) × (r + 1) DP matrix used to compute Pr[s Dip ], and let Bi,s denote the last row of Ai,s , that is, Bi,s [] = Ai,s [q, ] for = 0, . . . , r. We now show that if t is an extension of s, then we can quickly compute Bi,t from Bi,s , and thereby obtain Pr[t Dip ] = Bi,t [r]: Lemma 2. Let s and t be sequences such that t is an extension of s, and let i be a source whose p-sequence has r elements in it. Then, given Bi,s and Dip , we can compute Bi,t in O(r) time. Proof. We only discuss the case where t is an I-extension of s, i.e. t = s1 , . . . , sq ∪ {x} for some x ∈ sq . Firstly, observe that since the first q − 1 elements of s and t are pairwise equal, the first q − 1 rows of Ai,s and Ai,t are also equal. The (q − 1)-st row of Ai,s is enough to compute the q-th row of Ai,t , but we only have Bi,s , the q-th row of Ai,s . If tq = sq ∪ {x} ⊆ e , then Ai,t [q, ] = Ai,t [q, − 1], and we can move on to the next value of . If tq ⊆ e , then sq ⊆ e and so: Ai,s [q, ] = (1 − c ) ∗ Ai,s [q, − 1] + c ∗ Ai,s [q − 1, − 1] Since we know Bi,s [] = Ai,s [q, ], Bi,s [ − 1] = Ai,s [q, − 1] and c , we can compute Ai,s [q − 1, − 1]. But this value is equal to Ai,t [q − 1, − 1], which is the value from the (q − 1)-st row of Ai,t that we need to compute Ai,t [q, ]. Specifically, we compute: Bi,t [] = (1 − c ) ∗ Bi,t [ − 1] + (Bi,s [] − Bi,s [ − 1] ∗ (1 − c )) if tq ⊆ e (otherwise Bi,t [] = Bi,t [ − 1]). The (easier) case of S-extensions and an example illustrating incremental computation can be found in [15]. Probabilistic Pruning. We now describe a technique that allows us to prune non-frequent sequences s without fully computing ES(s, Dp ). For each source i, we obtain an upper bound on Pr[s Dip ] and add up all the upper bounds; if the sum is below the threshold, s can be pruned. We first show (proof in [15]): Lemma 3. Let s = s1 , . . . , sq be a sequence, and let Dip be a p-sequence. Then: Pr[s Dip ] ≤ Pr[s1 , . . . , sq−1 Dip ] ∗ Pr[sq Dip ]. We now indicate how Lemma 3 is used. Suppose, for example, that we have a candidate sequence s = (a)(b, c)(a), and a source X. By Lemma 3: p p p ] ≤ Pr[(a)(b, c) DX ] ∗ Pr[(a) DX ] Pr[(a)(b, c)(a) DX p p p ] ∗ Pr[(b, c) DX ] ∗ Pr[(a) DX ] ≤ Pr[(a) DX p 2 p p ≤ (Pr[(a) DX ]) ∗ min{Pr[(b) DX ], Pr[(c) DX ]}
Note that the quantities on the RHS are computed for each source by the fast L1 computation, and can be stored in a small data structure. However, the last p line is the least accurate upper bound bound: if Pr[(a)(b, c) DX ] is available p p when pruning, an tighter bound is Pr[(a)(b, c) DX ] ∗ Pr[(a) DX ].
Mining Sequential Patterns from Probabilistic Databases
5
217
Candidate Generation
We now describe two candidate generation methods for enumerating all frequent sequences, one each based on breadth-first and depth-first exploration of the sequence lattice, which are similar to GSP [18,3] and SPAM [4] respectively. We first note that an “Apriori” property holds in our setting: Lemma 4. Given two sequences s and t, and a probabilistic database Dp , if s is a subsequence of t, then ERS(s, D p ) ≥ ERS(t, Dp ). Proof. In Eq. 1 note that for all D ∗ ∈ P W (Dp ), Sup(s, D∗ ) ≥ Sup(t, D∗ ). Breadth-First Exploration. An overview of our BFS approach is in Fig. 1(L). We now describe some details. Each execution of lines (6)-(10) is called a phase. Line 2 is done using the fast L1 computation (see Section 4). Line 4 is done as in [18,3]: two sequences s and s in Lj are joined iff deleting the first item in s and the last item in s results in the same sequence, and the result t comprises s extended with the last item in s . This item is added the way it was in s i.e. either a separate element (t is an S-extension of s) or to the last element of s (t is an I-extension of s). We apply apriori pruning to the set of candidates in the (j + 1)-st phase, Cj+1 , and probabilistic pruning can additionally be applied to Cj+1 (note that for C2 , probabilistic pruning is the only possibility). In Lines 6-7, the loop iterates over all sources, and for the i-th source, first consider only those sequences from Cj+1 that could potentially be supported by source i, Ni,j+1 , (narrowing). For the purpose of narrowing, we put all the sequences in Cj+1 in a hashtree, similar to [18]. A candidate sequence t ∈ Cj+1 is stored in the hashtree by hashing on each item in t upto the j-th item, and the leaf node contains the (j + 1)-st item. In the (j + 1)-st phase, when considering source i, we recursively traverse the hashtree by hashing on every item in Li,1 until we have traversed all the leaf nodes, thus obtaining Ni,j+1 for source i. Given Ni,j+1 we compute the support of t in source i as follows. Consider s = s1 , . . . , sq and t = t1 , . . . , tr be two sequences, then if s and t have a common prefix, i.e. for k = 1, 2, . . . , z, sk = tk , then we start the computation of Pr[t Dip ] from tz+1 . Observe that our narrowing method naturally tends to place sequences with common prefixes in consecutive positions of Ni,j+1 . Depth-First Exploration. An overview of our depth-first approach is in Fig. 1 (R) [23,4]. We first compute the set of frequent 1-sequences, L1 (Line 1) (assume L1 is in ascending order). We then explore the pattern sub-lattice as follows. Consider a call of TraverseDFS(s), where s is some k-sequence. We first check that all lexicographically smaller k-subsequences of t are frequent, and reject t as infrequent if this test fails (Line 7). We can then apply probabilistic pruning to t, and if t is still not pruned we compute its support (Line 8). If at any stage t is found to be infrequent, we do not consider x, the item used to extend s to t, as a possible alternative in the recursive tree under s (as in [4]). Observe that for sequences s and t, where t is an S- or I- extension of s, if Pr[s Dip ] = 0, then Pr[t Dip ] = 0. When computing ES(s, Dp ), we keep track of all the sources
218
M. Muzammal and R. Raman 1: L1 ← ComputeFrequent-1(Dp ) 2: for all sequences x ∈ L1 do 3: Call TraverseDFS(x) 4: Output all frequent sequences
1: j ← 1 2: L1 ← ComputeFrequent-1(Dp ) 3: while Lj = ∅ do 5: function TraverseDFS(s) 4: Cj+1 ← Join Lj with itself 6: for all x ∈ L1 do 5: Prune Cj+1 7: t ← s · {x} {S-extension} 6: for all s ∈ Cj+1 do 8: Compute ES(t, Dp ) p 7: Compute ES(s, D ) 9: if ES(t, D p ) ≥ θm then 8: Lj+1 ← all sequences s ∈ Cj+1 {s.t. 10: TraverseDFS(t) ES(s, Dp ) ≥ θm}. 11: t ← s1 , . . . , sq ∪ {x} {I-extension} 9: j ←j+1 12: Compute ES(t, Dp ) 10: Stop and output L1 ∪ . . . ∪ Lj 13: if ES(t, Dp ) ≥ θm then 14: TraverseDFS(t) 15: end function Fig. 1. BFS (L) and DFS (R) Algorithms. Dp is the input database and θ the threshold.
where Pr[s Dip ] > 0, denoted by S s . If s is frequent then when computing ES(t, D p ), we need only to visit the sources in S s . Furthermore, with every source i ∈ S s , we assume that the array Bi,s (see Section 4) has been saved prior to calling TraverseDFS(s), allowing us to use incremental computation. By implication, the arrays Bi,r for all prefixes r of s are also stored for all sources i ∈ S r ), so in the worst case, each source may store up to k arrays, if s is a k-sequence. The space usage of the DFS traversal is quite modest in practice, however.
6
Experimental Evaluation
We report on an experimental evaluation of our algorithms. Our implementations are in C# (Visual Studio .Net), executed on a machine with a 3.2GHz Intel CPU and 3GB RAM running XP (SP3). We begin by describing the datasets used for experiments. Then, we demonstrate the scalability of our algorithms (reported running times are averages from multiple runs), and also evaluate probabilistic pruning. In our experiments, we use both real (Gazelle from Blue Martini [14]) and synthetic (IBM Quest [3]) datasets. We transform these deterministic datasets to probabilistic form in a way similar to [2,5,24,7]; we assign probabilities to each event in a source sequence using a uniform distribution over (0, 1], thus obtaining a collection of p-sequences. Note that we in fact generate ELU data rather than SLU data: a key benefit of this approach is that it tends to preserve the distribution of frequent sequences in the deterministic data. We follow the naming convention of [23]: a dataset named CiDjK means that the average number of events per source is i and the number of sources is j (in thousands). Alphabet size is 2K and all other parameters are set to default. We study three parameters in our experiments: the number of sources D, the average number of events per source C, and the threshold θ. We test our
Mining Sequential Patterns from Probabilistic Databases
219
algorithms for one of the three parameters by keeping the other two fixed. Evidently, all other parameters being fixed, increasing D and C, or decreasing θ, all make an instance harder. We choose our algorithm variants according to two “axes”: – Lattice traversal could be done using BFS or DFS. – Probabilistic Pruning (P) could be ON or OFF. We thus report on four variants in all, for example “BFS+P” represents the variant with BFS lattice traversal and with probabilistic pruning ON. Probabilistic Pruning. To show the effectiveness of probabilistic pruning, we kept statistics on the number of candidates both for BFS and for DFS. Due to space limitations, we report statistics only for the dataset C10D20K here. For more details, see [15]. Table 3 shows that probabilistic pruning is highly effective at eliminating infrequent candidates in phase 2 — for example, in both BFS and DFS, over 95% of infrequent candidates were eliminated without support computation. However, probabilistic pruning was less effective in BFS compared to DFS in the later phases. This is because we compute a coarser upper bound in BFS than in DFS, as we only store Li,1 probabilities in BFS, whereas we store both Li,1 and Li,j probabilities in DFS. We therefore, turn probabilistic pruning OFF after Phase 2 in BFS in our experiments. If we could also store Li,j probabilities in BFS, a more refined upper bound could be attained (as mentioned after Lemma 3 and shown in (Section 6) [15]). Table 3. Effectiveness of probabilistic pruning at θ = 2%, for dataset C10D20K in BFS (L) and in DFS (R). The columns from L to R indicate the numbers of candidates created by joining, remaining after apriori pruning, remaining after probabilistic pruning, and deemed as frequent, respectively. Phase Joining Apriori Prob. prun. Frequent 2 15555 15555 246 39 3 237 223 208 91
Phase Joining Apriori Prob. prun. Frequent 2 15555 15555 246 39 3 334 234 175 91
C10D10K
Gazelle
400
250
BFS BFS+P DFS DFS+P
BFS BFS+P DFS DFS+P
350 300 Running time (in sec)
Running time (in sec)
200
150
100
250 200 150 100
50 50 0
0 0
2
4 θ values (in %age)
6
8
0
0.01
0.02
0.03
0.04
0.05
θ values (in %age)
Fig. 2. Effectiveness of probabilistic pruning for decreasing values of θ, for synthetic dataset (C10D10K) (L) and for real dataset Gazelle (R)
220
M. Muzammal and R. Raman C = 10, θ = 1%
D = 10K, θ = 25%
1600 1000
1400
Running time (in sec)
Running time (in sec)
1200 1000 800 600 400
BFS BFS+P DFS DFS+P
200
100
10
BFS BFS+P DFS DFS+P
1
0
0 0
40
80
120
160
No. of sources (in thousands)
0
20
40
60
80
No. of events per source
Fig. 3. Scalability of our algorithms for increasing number of sources D (L), and for increasing number of events per sources C (R)
In Fig. 2, we show the effect of probabilistic pruning on overall running time as θ decreases, for both synthetic (C10D10K) and real (Gazelle) datasets. It can be seen that pruning is effective particularly for low θ, for both datasets. Scalability Testing. We test the scalability of our algorithms by fixing C = 10 and θ = 1%, for increasing values of D (Fig. 3(L)), and by fixing D = 10K and θ = 25%, for increasing values of C (Fig. 3(R)). We observe that all our algorithms scale essentially linearly in both sets of experiments.
7
Conclusions and Future Work
We have considered the problem of finding all frequent sequences in SLU databases. This is a first study on efficient algorithms for this problem, and naturally a number of open directions remain e.g. exploring further the notion of ”interestingness”. In this paper, we have used the expected support measure which has the advantage that it can be computed efficiently for SLU databases – probabilistic frequentness [5] is provably intractable for SLU databases [16]. Our approach yields (in principle) efficient algorithms for both measures in ELU databases, and comparing both measures in terms of computational cost versus solution quality is an interesting future direction. A number of longer-term challenges remain, including creating a data generator that gives an “interesting” SLU database and considering more general models of uncertainty (e.g. it is not clear that the assumption of independence between successive uncertain events is justified).
References 1. Aggarwal, C.C. (ed.): Managing and Mining Uncertain Data. Springer, Heidelberg (2009) 2. Aggarwal, C.C., Li, Y., Wang, J., Wang, J.: Frequent pattern mining with uncertain data. In: Elder et al. [9], pp. 29–38 3. Agrawal, R., Srikant, R.: Mining sequential patterns. In: Yu, P.S., Chen, A.L.P. (eds.) ICDE, pp. 3–14. IEEE Computer Society, Los Alamitos (1995) 4. Ayres, J., Flannick, J., Gehrke, J., Yiu, T.: Sequential pattern mining using a bitmap representation. In: KDD, pp. 429–435 (2002)
Mining Sequential Patterns from Probabilistic Databases
221
5. Bernecker, T., Kriegel, H.P., Renz, M., Verhein, F., Z¨ ufle, A.: Probabilistic frequent itemset mining in uncertain databases. In: Elder et al. [9], pp. 119–128 6. Chui, C.K., Kao, B.: A decremental approach for mining frequent itemsets from uncertain data. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD 2008. LNCS (LNAI), vol. 5012, pp. 64–75. Springer, Heidelberg (2008) 7. Chui, C.K., Kao, B., Hung, E.: Mining frequent itemsets from uncertain data. In: Zhou, Z.-H., Li, H., Yang, Q. (eds.) PAKDD 2007. LNCS (LNAI), vol. 4426, pp. 47–58. Springer, Heidelberg (2007) 8. Cormode, G., Li, F., Yi, K.: Semantics of ranking queries for probabilistic data and expected ranks. In: ICDE, pp. 305–316. IEEE, Los Alamitos (2009) 9. Elder, J.F., Fogelman-Souli´e, F., Flach, P.A., Zaki, M.J. (eds.): Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, June 28-July 1. ACM, New York (2009) 10. Gunopulos, D., Khardon, R., Mannila, H., Saluja, S., Toivonen, H., Sharm, R.S.: Discovering all most specific sentences. ACM Trans. DB Syst. 28(2), 140–174 (2003) 11. Hassanzadeh, O., Miller, R.J.: Creating probabilistic databases from duplicated data. The VLDB Journal 18(5), 1141–1166 (2009) 12. Hua, M., Pei, J., Zhang, W., Lin, X.: Ranking queries on uncertain data: a probabilistic threshold approach. In: Wang [21], pp. 673–686 13. Khoussainova, N., Balazinska, M., Suciu, D.: Probabilistic event extraction from RFID data. In: ICDE, pp. 1480–1482. IEEE, Los Alamitos (2008) 14. Kohavi, R., Brodley, C., Frasca, B., Mason, L., Zheng, Z.: KDD-Cup 2000 organizers’ report: Peeling the onion. SIGKDD Explorations 2(2), 86–98 (2000) 15. Muzammal, M., Raman, R.: Mining sequential patterns from probabilistic databases. Tech. Rep. CS-10-002, Dept. of Comp. Sci. Univ. of Leicester, UK (2010), http://www.cs.le.ac.uk/people/mm386/pSPM.pdf 16. Muzammal, M., Raman, R.: On probabilistic models for uncertain sequential pattern mining. In: Cao, L., Feng, Y., Zhong, J. (eds.) ADMA 2010, Part I. LNCS, vol. 6440, pp. 60–72. Springer, Heidelberg (2010) 17. Pei, J., Han, J., Mortazavi-Asl, B., Wang, J., Pinto, H., Chen, Q., Dayal, U., Hsu, M.: Mining sequential patterns by pattern-growth: The PrefixSpan approach. IEEE Trans. Knowl. Data Eng. 16(11), 1424–1440 (2004) 18. Srikant, R., Agrawal, R.: Mining sequential patterns: Generalizations and performance improvements. In: Apers, P.M.G., Bouzeghoub, M., Gardarin, G. (eds.) EDBT 1996. LNCS, vol. 1057, pp. 3–17. Springer, Heidelberg (1996) ¨ 19. Suciu, D., Dalvi, N.N.: Foundations of probabilistic answers to queries. In: Ozcan, F. (ed.) SIGMOD Conference, p. 963. ACM, New York (2005) 20. Sun, X., Orlowska, M.E., Li, X.: Introducing uncertainty into pattern discovery in temporal event sequences. In: ICDM, pp. 299–306. IEEE Computer Society, Los Alamitos (2003) 21. Wang, J.T.L. (ed.): Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12. ACM, New York (2008) 22. Yang, J., Wang, W., Yu, P.S., Han, J.: Mining long sequential patterns in a noisy environment. In: Franklin, M.J., Moon, B., Ailamaki, A. (eds.) SIGMOD Conference, pp. 406–417. ACM, New York (2002) 23. Zaki, M.J.: SPADE: An efficient algorithm for mining frequent sequences. Machine Learning 42(1/2), 31–60 (2001) 24. Zhang, Q., Li, F., Yi, K.: Finding frequent items in probabilistic data. In: Wang [21], pp. 819–832
Large Scale Real-Life Action Recognition Using Conditional Random Fields with Stochastic Training Xu Sun1 , Hisashi Kashima1 , Ryota Tomioka1 , and Naonori Ueda2 1
Department of Mathematical Informatics, The University of Tokyo {xusun,kashima,tomioka}@mist.i.u-tokyo.ac.jp 2 NTT Communication Science Laboratories, Kyoto, Japan
[email protected]
Abstract. Action recognition is usually studied with limited lab settings and a small data set. Traditional lab settings assume that the start and the end of each action are known. However, this is not true for the real-life activity recognition, where different actions are present in a continuous temporal sequence, with their boundaries unknown to the recognizer. Also, unlike previous attempts, our study is based on a large-scale data set collected from real world activities. The novelty of this paper is twofold: (1) Large-scale non-boundary action recognition; (2) The first application of the averaged stochastic gradient training with feedback (ASF) to conditional random fields. We find the ASF training method outperforms a variety of traditional training methods in this task. Keywords: Continuous Action Recognition, Conditional Random Fields, Online Training.
1
Introduction
Acceleration sensor based action recognition is useful in practical applications [1,2,3,4]. For example, in some medical programmes, researchers hope to prevent lifestyle diseases from being exacerbated. However, the traditional way of counseling is ineffective both in time and accuracy, because it requires many manual operations. In sensor-based action recognition, an accelerometer is employed (e.g., attached on the wrist of people) to automatically capture the acceleration statistics (e.g., a temporal sequence of three-dimension acceleration data) in the daily life of counselees, and the corresponding categories of behaviors (actions) can be automatically identified with a certain level of accuracy. Although there is a considerable literature on action recognition, most of the prior work discusses action recognition in a pre-defined limited environment [1,2,3]. It is unclear whether or not the previous methods perform well in a more natural real-life environment. For example, most of the prior work assumes that the beginning and ending time of each action are known to the target recognizing system, and the produced system only performs simple classifications to the J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 222–233, 2011. c Springer-Verlag Berlin Heidelberg 2011
Large Scale Real-Life Action Recognition Using Conditional Random Fields
223
2.5 2
X-acceler. Y-acceler. Z-acceler.
1.5 Signal strength (g)
act-5
act-3
act-5
act-0
act-4
1 0.5 0 -0.5 -1 -1.5 -2 1000
1500
2000
2500
Time (seconds)
Fig. 1. An example of real-life continuous actions in our data, in which the corresponding 3D acceleration signals are collected from the attached sensors. See Section 5 for the meaning of the ‘g’ and action types, act-0 to act-5.
action signals [1,2,3]. However, this is not the case for real-life action sequences of human beings, in which different types of actions are performed one by one without an explicit segmentation on the boundaries. For example, people may first walk, and then take a taxi, and then take an elevator, in which the boundaries of the actions are unknown to the target action recognition system. An example of real-life actions with continuous sensor signals is shown in Figure 1. For this concern, it is necessary and important to develop a more powerful system not only to predict the types of the actions, but also to disambiguate the boundaries of those actions. With this motivation, we collected a large-scale real-life action data (continuous sensor-based three-dimension acceleration signals) from about one hundred people for continuous real-life action recognition. We adopt a popular structured classification model, conditional random fields (CRFs), for recognizing the action types and at the same time disambiguate the action boundaries. Moreover, good online training methods are necessary for training CRFs on a large-scale data in our task. We will compare different online training methods for training CRFs on this action recognition data.
2
Related Work and Motivations
Most of the prior work on action recognition treated the task as a single-label classification problem [1,2,3]. Given a sequence of sensor signals, the action recognition system predicts a single label (representing a type of action) for the whole
224
X. Sun et al.
sequence. Ravi et al. [3] used decision trees, support vector machines (SVMs) and K-nearest neighbors (KNN) models for classification. Bao and Intille [1] and P¨ arkk¨a et al. [2] used decision trees for classification. A few other works treated the task as a structured classification problem. Huynh et al. [4] tried to discover latent activity patterns by using a Bayesian latent topic model. Most of the prior work of action recognition used a relatively small data set. For example, in Ravi et al. [3], the data was collected from two persons. In Huynh et al. [4], the data was collected from only one person. In P¨ arkk¨a et al. [2], the data was collected from 16 persons. There are two major approaches for training conditional random fields: batch training and online training. Standard gradient descent methods are normally batch training methods, in which the gradient computed by using all training instances is used to update the parameters of the model. The batch training methods include, for example, steepest gradient descent, conjugate gradient descent (CG), and quasi-Newton methods like Limited-memory BFGS (LBFGS) [5]. The true gradient is usually the sum of the gradients from each individual training instance. Therefore, batch gradient descent requires the training method to go through the entire training set before updating parameters. Hence, the batch training methods are slow on training CRFs. A promising fast online training method is the stochastic gradient method, for example, the stochastic gradient descent (SGD) [6,7]. The parameters of the model are updated much more frequently, and much fewer iterations are needed before the convergence. For large-scale data sets, the SGD can be much faster than batch gradient based training methods. However, there are problems on the current SGD literature: (1) The SGD is sensitive to noise. The accuracy of the SGD training is limited when the data is noisy (for example, the data inconsistency problem that we will discuss in the experiment section). (2) The SGD is not robust. It contains many hyper-parameters (not only regularization, but also learning rate) and it is quite sensitive to them. Tuning the hyperparameters for SGD is not a easy task. To deal with the problems of the traditional training methods, we use a new online gradient-based learning method, the averaged SGD with feedback (ASF) [8], for training conditional random fields. According to the experiments, the ASF training method is quite robust for training CRFs for the action recognition task.
3
Conditional Random Fields
Many traditional structured classification models may suffer from a problem, which is usually called “the label bias problem” [9,10]. Conditional random fields (CRFs) are proposed as an alternative solution for structured classification by solving “the label bias problem” [10]. Assuming a feature function that maps a pair of observation sequence x and label sequence y to a global feature vector f , the probability of a label sequence y conditioned on the observation sequence x is modeled as follows [10,11,12]:
Large Scale Real-Life Action Recognition Using Conditional Random Fields
P (y|x, Θ) =
exp[Θ·f (y, x)] , ∀y exp[Θ·f (y, x)]
225
(1)
where Θ is a parameter vector. Typically, computing ∀y exp Θ·f (y, x) could be computationally intractable: it is too large to explicitly sum over all possible label sequences. However, if the dependencies between labels have a linear-chain structure, this summation can be computed using dynamic programming techniques [10]. To make the dynamic programming techniques applicable, the dependencies of labels must be chosen to obey the Markov property. More precisely, we use Forward-Backward algorithm for computing the summation in a dynamic programming style. This has a computational complexity of O(N K M ). N is the length of the sequence; K is the dimension of the label set; M is the length of the Markov order used by local features. Given a training set consisting of n labeled sequences, (xi , yi ), for i = 1 . . . n, parameter estimation is performed by maximizing the objective function, L(Θ) =
n
log P (yi |xi , Θ) − R(Θ).
(2)
i=1
The first term of this equation represents a conditional log-likelihood of a training data. The second term is a regularizer for reducing overfitting. In what follows, we denote the conditional log-likelihood of each sample log P (yi |xi , Θ) as Ls (i, Θ), and therefore: n L(Θ) = Ls (i, Θ) − R(Θ). (3) i=1
3.1
Stochastic Gradient Descent
The SGD uses a small randomly-selected subset of the training samples to approximate the gradient of the objective function given by Equation 3. The number of training samples used for this approximation is called the batch size. By using a smaller batch size, one can update the parameters more frequently and speed up the convergence. The extreme case is a batch size of 1, and it gives the maximum frequency of updates, which we adopt in this work. Then, the model parameters are updated in such a way: ∂ [Ls (i, Θ) − R(Θ)], ∂Θ where k is the update counter and γk is the learning rate. A proper learning rate can guarantee the convergence of the SGD method [6,7]. A typical convergent choice of learning rate can be found in Collins et al. [13]: γ0 γk = , 1 + k/n Θk+1 = Θk + γk
where γ0 is a constant. This scheduling guarantees ultimate convergence [6,7]. In this paper we adopt this learning rate schedule for the SGD.
226
X. Sun et al. Notes m is the number of periods when the ASF reaches the convergence; b is the current number of period; c is the current number of iteration; n is the number of training samples; γ0 The learning rate, γ ←− 1+b/Z , is only for theoretical analysis. In practice we can simply set γ ← 1, i.e., remove the learning rate. Procedure ASF-train Initialize Θ with random values c ←− 0 for b ←− 1 to m γ0 . γ ←− 1+b/Z with Z n, or simply γ ← 1 . for 1 to b . Θ ←− SGD-update(Θ) . c ←− c + b iter(c) . Θ ←− Θ in Eq. 4 Return Θ Procedure SGD-update(Θ) for 1 to n . select a sample j randomly ∂ . Θ ←− Θ + γ ∂Θ Ls (j, Θ) Return Θ Fig. 2. The major steps of the ASF training
4
Averaged SGD with Feedback
Averaged SGD with feedback (ASF) is a modification and extension of the traditional SGD training method [8]. The naive version of averaged SGD is inspired by the averaged perceptron technique [14]. Let Θiter(c),sample(d) be the parameters after the d’th training example has been processed in the c’th iteration over the training data. We define the averaged parameters at the end of the iteration c as: iter(c),sample(d) iter(c ) c=1...c , d=1...n Θ Θ . (4) nc However, a straightforward application of parameter averaging is not adequate. A potential problem of traditional parameter averaging is that the model parameters Θ receive no information from the averaged parameters: the model parameters Θ are trained exactly the same like before (SGD without averaging). Θ could be misleading as the training goes on. To solve this problem, a natural idea is to reset Θ by using the averaged parameters, which are more reliable. The ASF refines the averaged SGD by applying a “periodic feedback”. The ASF periodically resets the parameters Θ by using the averaged parameters Θ. The interval between a feedback operation and its previous operation is called a training period or simply a period. It is important to decide when to do
Large Scale Real-Life Action Recognition Using Conditional Random Fields
227
the feedback, i.e., the length of each period should be adjusted reasonably as the training goes on. For example, at the early stage of the training, the Θ is highly noisy, so that the feedback operation to Θ should be performed more frequently. As the training goes on, less frequent feedback operation would be better in order to adequately optimize the parameters. In practice, the ASF adopts a schedule of linearly slowing-down feedback : the number of iterations increases linearly in each period, as the training goes on. Figure 2 shows the steps of the ASF. We denote Θb,c,d as the model parameters after the d’th sample is processed in the c’th iteration of the b’th period. Without making any difference, we denote Θb,c,d more simply as Θb,cn+d where n is the number of samples in a training data. Similarly, we use g b,cn+d to denote ∂ L (d, Θ) in the c’th iteration of the b’th period. Let γ (b) be the learning rate ∂Θ s (b)
in the b’th period. Let Θ be the averaged parameters produced by the b’th (1) period. We can induce the explicit form of Θ : (1)
Θ
n−d+1 g 1,d . n
= Θ1,0 + γ (1)
(5)
d=1...n
When the 2nd period ends, the parameters are again averaged over all previous model parameters, Θ1,0 , . . . , Θ1,n , Θ2,0 , . . . , Θ2,2n , and it can be expressed as: (2)
Θ
= Θ1,0 + γ (1)
n−d+1 g 1,d n
d=1...n
+γ
(2)
d=1...2n
2n − d + 1 2,d g . 3n
(6)
Similarly, the averaged parameters produced by the b’th period can be expressed as follows: Θ
(b)
= Θ1,0 +
i=1...b
(γ (i)
d=1...in
in − d + 1 i,d g ). ni(i + 1)/2
(7)
The best possible convergence result for stochastic learning is the “almost sure convergence”: to prove that the stochastic algorithm converges towards the solution with probability 1 [6]. The ASF guarantees to achieve almost sure convergence [8]. The averaged parameters produced at the end of each period of the optimization procedure of the ASF training are “almost surely convergent” towards the optimum Θ∗ [8]. On the implementation side, there is no need to keep all the gradients in the past for computing the averaged gradient Θ: we can compute Θ on the fly, just like the averaged perceptron case.
5
Experiments and Discussion
We use one month data of the ALKAN dataset [15] for experiments. This is a new data, and the data contains 2,061 sessions, with totally 3,899,155 samples
228
X. Sun et al.
Table 1. Features used in the action recognition task. For simplicity, we only describe the features on x-axis, because the features on y-axis and z-axis are in the same setting like the x-axis. A × B means a Cartesian product between the set A and the set B. The time interval feature do not record the absolute time from the beginning to the current window. This feature only records the time difference between two neighboring windows: sometimes there is a jump of time between two neighboring windows. Signal strength features: {si−2 , si−1 , si , si+1 , si+2 , si−1 si , si si+1 } ×{yi , yi−1 yi } Time interval features: {ti+1 − ti , ti − ti−1 } ×{yi , yi−1 yi } Mean, standard deviation, energy, covariance features: mi ×{yi , yi−1 yi } di ×{yi , yi−1 yi } ei ×{yi , yi−1 yi } {cx,y,i , cy,z,i , cx,z,i } ×{yi , yi−1 yi }
(in a temporal sequence). The data was collected by iPod accelerometers with the sampling frequency of 20HZ. A sample contains 4 values: {time (the seconds past from the beginning of a session), x-axis-acceleration, y-axis-acceleration, z-axisacceleration}, for example, {539.266(s), 0.091(g), -0.145(g), -1.051(g)}1. There are six kinds of action labels: act-0 means “walking or running”, act-1 means “on an elevator or escalator”, act-2 means “taking car or bus”, act-3 means “taking train”, act-4 means “up or down stairs”, and act-5 means “standing or sitting”. 5.1
How to Design and Implement Good Features
We split the data into a training data (85%), a development data for hyperparameters (5%), and the final evaluation data (10%). The evaluation metric are sample-accuracy (%) (equals to recall in this task: the number of correctly predicted samples divided by the number of all the samples). Following previous work on action recognition [1,2,3,4], we use acceleration features, mean features, standard deviation, energy, and correlation (covariance between different axis) features. Features are extracted from the iPod accelerometer data by using a window size of 256. Each window is about 13 seconds long. For two consecutive windows (each one contains 256 samples), they have 128 samples overlapping to each other. Feature extraction on windows with 50% of the window overlapping was shown to be effective in previous work [1]. The features are listed in Table 1. All features are used without pruning. We use exactly the same feature set for all systems. 1
In the example, ‘g’ is the acceleration of gravity.
Large Scale Real-Life Action Recognition Using Conditional Random Fields
229
The mean feature is simply the averaged signal strength in a window: |w| sk mi = k=1 , |w| where s1 , s2 , . . . are the signal magnitudes in a window. The energy feature is defined as follows: |w| 2 s ei = k=1 k . |w| The deviation feature is defined as follows: |w| 2 k=1 (sk − mi ) di = , |w| where the mi is the mean value defined before. The correlation feature is defined as follows: covariancei (x, y) ci (x, y) = , di (x)di (y) where the di (x) and di (y) are the deviation values on the i’th window of the x-axis and the y-axis, respectively. The covariancei (x, y) is the covariance value between the i’th windows of the x-axis and the y-axis. In the same way, we can define ci (y, z) and ci (x, z). A naive implementation of the proposed features is to design several realvalue feature templates representing the mean value, standard deviation value, energy value, and correlation value, and so on. However, in preliminary experiments, we found that the model accuracy is low based on such straightforward implementation of real features. A possible reason is that the different values of a real value (e.g., the standard deviation) may contain different indications on the action, and the difference of the indications can not be directly reflected by evaluating the difference of their real values. The most easy way to deal with this problem is to split an original real value feature into multiple features (can still be real value features). In our case, the feature template function automatically splits the original real value features into multiple real value features by using a heuristic splitting interval of 0.1. For example, the standard deviations of 0.21 and 0.31 correspond to two different feature IDs, and they correspond to two different model parameters. The standard deviations of 0.21 and 0.29 correspond to an identical feature ID, with only difference on the feature values. In our experiment, we found splitting the real features improves the accuracy (more than 1%). It is important to describe the implementation of edge features, which are based on the label transitions, yi−1 yi . For traditional implementation of CRF systems (e.g., the HCRF package), usually the edges features contain only the information of yi−1 and yi , without the information of the observation sequence (i.e., x). The major reason for this simple implementation of edge features is for reducing the dimension of features. Otherwise, there can be an explosion of edge features in some tasks. For our action recognition task, since the feature
230
X. Sun et al.
Table 2. Comparisons among methods on the sensor-based action recognition task. The number of iterations are decided when a training method goes to its empirical convergence state. The deviation means the standard deviation of the accuracy over four repeated experiments. Methods ASF Averaged SGD SGD LBFGS (batch)
Accuracy Iteration Deviation Training time 58.97% 60 0.56% 0.6 hour 57.95% 50 0.28% 0.5 hour 55.20% 130 0.69% 1.3 hour 57.85% 800 0.74% 8 hours
dimension is quite small, we can combine observation information of x with label transitions yi−1 yi , and therefore make “rich edge features”. We simply used the same observation templates of node features for building rich edge features (see Table 1). We found the rich edge features significantly improves the prediction accuracy of CRFs. 5.2
Experimental Setting
Three baselines are adopted to make a comparison with the ASF method, including the traditional SGD training (SGD), the SGD training with parameteraveraging but without feedback (averaged SGD), and the popular batch training method, limited memory BFGS (LBFGS). For the LBFGS batch training method, which is considered to be one of the best optimizers for log-linear models like CRFs, we use the OWLQN open source package [16]2 . The hyper-parameters for learning were left unchanged from the default settings of the software: the convergence tolerance was 1e-4; and the LBFGS memory parameter was 10. 2 To reduce overfitting, we employed an L2 prior R(Θ) = ||Θ|| 2σ2 for both SGD and LBFGS, by setting σ = 5. For the ASF and the averaged SGD, we did not employ regularization priors, assuming that the ASF and the averaged SGD contain implicit regularization by performing parameter averaging. For the stochastic training methods, we set the γ0 as 1.0. We will also test the speed of the various methods. The experiments are run on a Intel Xeon 3.0GHz CPU, and the time for feature generation and data input/output is excluded from the training cost. 5.3
Results and Discussion
The experimental results are listed in Table 2, and the more detailed results of the respective action categories are listed in Table 3. Since recognizing actions from real-life continuous signals require the action-identification and the boundarydisambiguation at the same time, it is expected to be much more difficult than 2
Available online at: https://www.cs.washington.edu/homes/galen/ or http://research.microsoft.com/en-us/um/people/jfgao/
Large Scale Real-Life Action Recognition Using Conditional Random Fields
231
Table 3. Comparisons among methods on different action labels action types walk/run on elevat. # samples 2,246 37 Acc.(%) ASF 62.09 0 Acc.(%) Averaged SGD 60.84 0 Acc.(%) SGD 62.40 0 Acc.(%) LBFGS 51.66 0
car/bus 3,848 77.07 76.35 73.75 76.41
train 2,264 25.43 25.88 15.31 22.98
stairs stand/sit overall 221 5,275 13,891 0 62.31 58.97 0 61.36 57.95 1.50 58.08 55.20 1.02 66.57 57.85
the previous work on simply action-identification. An additional difficulty is that the data is quite noisy. The number of iterations are decided when a training method goes to its empirical convergence state3 . Note that, the ASF training achieved better sample-accuracy than other online training methods. The ASF method is relatively stable among different iterations when the training goes on, while the SGD training faces severe fluctuation when the training goes on. The averaged SGD training reached its empirical convergence state faster than the ASF training. The ASF training converged much faster than the SGD training. All of the online training methods converged faster than the batch training method, LBFGS. In Figure 3, we show the curves of sample-accuracies on varying the number of training iterations of the ASF, the averaged SGD, and the traditional SGD. As can be seen, the ASF training is much more stable/robust than the SGD training. The fluctuation of the SGD is quite severe, probably due to the noisy data of the action recognition task. The robustness of the ASF method relates to the stable nature of the averaging technique with feedback. The ASF outperformed the averaged SGD, which indicates that the feedback technique is helpful to the naive parameter averaging. The ASF also outperformed the LBFGS batch training with much fewer iteration numbers (therefore, with much faster training speed), which is surprising. 5.4
A Challenge in Real-Life Action Recognition: Axis Rotation
One of the tough problems in this action recognition task is the rotation of the xaxis, y-axis, and z-axis in the collected data. Since different people attached the iPod accelerometer with a different rotation of iPod accelerometer, the x-axis, y-axis, and z-axis faced the risk of inconsistency in the collected data. Take an extreme case for example, while the x-axis may represent a horizontal direction for an instance, the same x-axis may represent a vertical direction for another instance. As a result, the acceleration signals of the same axis may face the problem of inconsistency. We suppose this is an important reason that prevented the experimental results reaching a higher level of accuracy. A candidate solution to keep the consistency is to tell the people to adopt a standard rotation when 3
Here, the empirical convergence state means an empirical evaluation of the convergence.
232
X. Sun et al.
65 ASF Averaged SGD SGD
Accuracy (%)
60
55
50
45 0
10
20
30
40
50
60
Number of Iterations Fig. 3. Curves of accuracies of the different stochastic training methods by varying the number of iterations
collecting the data. However, this method will make the collected data not “natural” or “representative”, because usually people put the accelerometer sensor (e.g., in iPod or iPhone) randomly in their pocket in daily life.
6
Conclusions and Future Work
In this paper, we studied automatic non-boundary action recognition with a large-scale data set collected in real-life activities. Different from traditional simple classification approaches to action recognition, we tried to investigate reallife continuous action recognition, and adopted a sequential labeling approach by using conditional random fields. To achieve good performance in continuous action recognition, we presented how to design and implement useful features in this task. We also compared different online optimization methods for training conditional random fields in this task. The ASF training method demonstrated to be a very robust training method in this task with noisy data, and with good performance. As future work, we plan to deal with the axis rotation problem through a principled statistical approach.
Large Scale Real-Life Action Recognition Using Conditional Random Fields
233
Acknowledgments X.S., H.K., and N.U. were supported by the FIRST Program of JSPS. We thank Hirotaka Hachiya for helpful discussion.
References 1. Bao, L., Intille, S.S.: Activity recognition from user-annotated acceleration data. In: Ferscha, A., Mattern, F. (eds.) PERVASIVE 2004. LNCS, vol. 3001, pp. 1–17. Springer, Heidelberg (2004) 2. Pr¨ akk¨ a, J., Ermes, M., Korpip¨ aa ¨, P., M¨ antyj¨ arvi, J., Peltola, J., Korhonen, I.: Activity classification using realistic data from wearable sensors. IEEE Transactions on Information Technology in Biomedicine 10(1), 119–128 (2006) 3. Ravi, N., Dandekar, N., Mysore, P., Littman, M.L.: Activity recognition from accelerometer data. In: AAAI, pp. 1541–1546 (2005) 4. Huynh, T., Fritz, M., Schiele, B.: Discovery of activity patterns using topic models. In: Proceedings of the 10th International Conference on Ubiquitous Computing, pp. 10–19. ACM, New York (2008) 5. Nocedal, J., Wright, S.J.: Numerical optimization. Springer, Heidelberg (1999) 6. Bottou, L.: Online algorithms and stochastic approximations. In: Saad, D. (ed.) Online Learning and Neural Networks. Cambridge University Press, Cambridge (1998) 7. Spall, J.C.: Introduction to stochastic search and optimization. Wiley-IEEE (2005) 8. Sun, X., Kashima, H., Matsuzaki, T., Ueda, N.: Averaged stochastic gradient descent with feedback: An accurate, robust and fast training method. In: Proceedings of the 10th International Conference on Data Mining (ICDM 2010), pp. 1067–1072 (2010) 9. Bottou, L.: Une Approche th´eorique de l’Apprentissage Connexionniste: Applications ` a la Reconnaissance de la Parole. PhD thesis, Universit´e de Paris XI, Orsay, France (1991) 10. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning (ICML 2001), pp. 282–289 (2001) 11. Daum´e III, H.: Practical Structured Learning Techniques for Natural Language Processing. PhD thesis, University of Southern California (2006) 12. Sun, X.: Efficient Inference and Training for Conditional Latent Variable Models. PhD thesis, The University of Tokyo (2010) 13. Collins, M., Globerson, A., Koo, T., Carreras, X., Bartlett, P.L.: Exponentiated gradient algorithms for conditional random fields and max-margin markov networks. J. Mach. Learn. Res. (JMLR) 9, 1775–1822 (2008) 14. Collins, M.: Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In: Proceedings of EMNLP 2002, pp. 1–8 (2002) 15. Hattori, Y., Takemori, M., Inoue, S., Hirakawa, G., Sudo, O.: Operation and baseline assessment of large scale activity gathering system by mobile device. In: Proceedings of DICOMO 2010 (2010) 16. Andrew, G., Gao, J.: Scalable training of L1 -regularized log-linear models. In: Proceedings of ICML 2007, pp. 33–40 (2007)
Packing Alignment: Alignment for Sequences of Various Length Events Atsuyoshi Nakamura and Mineichi Kudo Hokkaido University, Kita 14 Nishi 9 Kita-ku Sapporo 060-0814, Japan {atsu,mine}@main.ist.hokudai.ac.jp
Abstract. We study an alignment called a packing alignment that is an alignment for sequences of various length events like musical notes. One event in a packing alignment can have a number of consecutive opposing events unless the total length of them exceeds the length of that one event. Instead of using a score function that depends on event length, which was studied by Mongeau and Sankoff [5], packing alignment deals with event lengths explicitly using a simple score function. This makes the problem clearer as an optimization problem. Packing alignment can be calculated efficiently using dynamic programming. As an application of packing alignment, we conducted experiments on frequent approximate pattern extraction from MIDI files of famous musical variations. The patterns and occurrences extracted from the variations using packing alignment have more appropriate boundaries than those using conventional string alignments from the viewpoints of the repetition structure of the variations.
1
Introduction
Sequence alignment is now one of the most popular tools to compare sequences. In molecular biology, various types of alignments are used in various kinds of problems: global alignments of pairs of proteins related by common ancestry throughout their length, local alignments involving related segments of proteins, multiple alignments of members of protein families, and alignments made during database searches to detect homologies [4]. Dynamic time warping (DTW), a kind of alignments between two time series, are often used in speech recognition [7] and aligning of audio recordings [1]. Most of previous works on alignments have dealt with strings in which each component (letter) is assumed to have the same length. In comparison of musical sequences, there is a research on an alignment that consider the length of each note [5]. In their research, general alignment framework is adapted to deal with note sequences by using a score (distance) function between notes that depends on note length. Their method is very flexible but it heuristically defines its score function so as to reflect note length. In this paper, we study packing alignment that explicitly treats the length of each component (event) together with a constraint on length. One event in a packing alignment can have a number of consecutive opposing events unless J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 234–245, 2011. c Springer-Verlag Berlin Heidelberg 2011
Packing Alignment: Alignment for Sequences of Various Length Events
235
the total length of them exceeds the length of that one event. Compared to the method using a length-dependent score function, our setting reduces flexibility but makes the problem clearer as an optimization problem. We can show that an optimal solution of this extended alignment problem for two event sequences s and t can be obtained in O(p(s, t)n(s)n(t)) time and O(p(s, t)(n(s) + n(t))) space using dynamic programming, where n(s) and n(t) is the number of events in sequences s and t, respectively, and p(s, t) is the maximum packable number that is defined as the maximum number of events in s or t which can be opposed to any one event in the other sequence in packing alignment. Alignment distance can be shown to be equivalent to edit distance even in packing alignment if two additional 0-cost edit operations, partition and concatenation, are introduced. Alignment of various length events is also possible indirectly by general string alignment or DTW if all events are partitioned uniformly in preprocessing. There are two significant differences between packing alignment and these conventional alignments. First, one event must not be divided in packing alignment while gaps can be inserted in the middle of one event divided by uniform partitioning in preprocessed conventional alignment. Second, an optimal solution in packing alignment can be calculated faster than that in preprocessed conventional alignment when the number of events increases significantly by uniform partitioning. DTW also allows one event to be opposed to more than one event, but packing alignment is more strict on length of opposing events and more flexible at the point that it allows gap insertions. Though alignment becomes flexible by virtue of gap insertion, alignment with long consecutive gaps are not desirable for many applications. So, we also developed gap-constraint version algorithm of packing alignment. In our experiments, we applied packing alignment to frequent approximate pattern extraction from a note sequence of a musical piece. We used mining algorithm EnumSubstrFLOO [6], which heavily uses an alignment algorithm as a subprocedure. For two MIDI files of Bach’s musical pieces, EnumSubstrFLOO using packing alignment, which is directly applied to the original sequence, was more than four times faster than that using DTW and general alignment, which are applied to the sequence made by uniform partitioning. We also applied EnumSubstrFLOO to the melody tracks in MIDI files of three musical variations in order to check whether themes and variations can be extracted as patterns and occurrences. The 80% of extracted patterns and occurrences by EnumSubstrFLOO with packing alignment were nearly whole themes, nearly whole variations or whole two consecutive variations while the algorithm using DTW and general alignment, which were directly applied without uniform partitioning by ignoring note length, could extract no such appropriate ranges except a few.
2
Packing Aliment of Event Sequences
Let Σ denote a finite set of event types. The gap ‘-’ is a special event type that does not belong to Σ. Assume existence of a real-valued score function w on (Σ ∪ {-}) × (Σ ∪ {-}), which measures similarity between two event types.
236
A. Nakamura and M. Kudo
Measures 2-5: (Theme) Measures 50-53: (Variation 1) Measures 245-248: (Variation 5) Fig. 1. Parts of the score of 12 Variations on “Ah Vous Dirai-je Maman (Twinkle Twinkle Little Star)” K.265
An event (a, l) ∈ Σ × R+ is a pair of type a and length l, where R+ denotes the set of positive real numbers. For event b = (a, l), b and |b| are used instead of a and l, respectively. An event sequence s is a sequence s[1]s[2] · · · s[n(s)] whose component s[i] is an event for all i = 1, 2, ..., n(s), where n(s) is the number of events in s. When gap events are also allowed to be components of a sequence, we call such an event sequence a gaped event sequence. Length of an event sequence n(s) s is defined as i=1 |s[i]|. Range r(s, j) of the jth event of an event sequence s j−1 j is [ i=1 |s[i]|, i=1 |s[i]|). Example 1. The melody in measures 2-5 of Twelve Variations on “Twinkle Twinkle Little Star” shown in Figure 1 is represented as (C5 , 14 )(C5 , 14 )(G5 , 14 )(G5 , 14 )(A5 , 14 )(A5 , 14 )(G5 , 14 )(G5 , 14 ) using event sequence representation, where scientific pitch notation is used as event type notation. Let s denote this event sequence and let s[i] denote the ith event in s. Then, the length of s is 84 , n(s) = 8, s[3] = (G5 , 14 ), s[3] = G5 , |s[3]| = 14 and r(s, 3) = [ 24 , 34 ). The melody in measures 245-248 is represented as (C5 , 14 )(R, 18 )(C5 , 18 )(G5 , 14 )(R, 18 )(G5 , 18 )(A5 , 14 )(R, 18 )(A5 , 18 )(G5 , 14 )(R, 18 )(G5 , 18 ) using event sequence representation, where event type ‘R’ denotes a rest. A gap insertion into an event sequence s is an operation that inserts (-, l) right before or after s[i] for some i ∈ {1, 2, ..., n(s)} and l ∈ R+ . We define a packing alignment of two event sequences as follows. Definition 1. A packing alignment of two event sequences s and t is a pair (s , t ) that satisfies the following conditions. 1. s and t are gaped event sequences with the same length that are made from s and t, respectively, by repeated gap insertions. 2. For all (j, k) ∈ {1, 2, ..., n(s )} × {1, 2, ..., n(t )}, r(s , j) ⊆ r(t , k) or r(s , j) ⊇ r(t , k) holds if r(s , j) ∩ r(t , k) = ∅. 3. For all (j, k) ∈ {1, 2, ..., n(s )} × {1, 2, ..., n(t )}, r(s , j) ∩ r(t , k) = ∅ if s [j] = t [k] =-.
Packing Alignment: Alignment for Sequences of Various Length Events
s
C5
C5
G5
G5
A5
A5
G5
237
G5
D5 C5 B4 C5 B4 C5 B4 C5 A5 G5 F#5 G5 F5# G5 F5# G5 G5# A5 C6 B5 D6 C6 B5 A5 A5 G5 E6 D6 C6 B5 A5 G5
t
s’ t’
-
C5
G5
G5
A5
A5
G5
G5
D5 C5 B4 C5 B4 C5 B4 C5 A5 G5 F#5 G5 F5# G5 F5# G5 G5# A5 C6 B5 D6 C6 B5 A5 A5 G5 E6 D6 C6 B5 A5 G5
s’’ t’’
C5
C5
-
C5
G5
G5
A5
A5
G5
-
G5
D5 C5 B4 C5 B4 C5 B4 C5 A5 G5 F#5 G5 F5# G5 F5# G5 G5# A5 C6 B5 D6 C6 B5 A5 A5 G5 E6 D6 C6 B5 A5 G5
-
Fig. 2. Examples of packing alignments: (s, t) and (s , t ) are packing alignments of s and t but (s , t ) is NOT
Example 2 (Continued from Example 1). Let s and t denote event sequence representations for the melodies in measures 2-5 and measures 50-53, respectively. By representing the length of each event as the length of a bar, s and t can be illustrated using the diagram shown in Figure 2. In the figure, pair (s , t ) is NOT a packing alignment of s and t because r(s , 2) ∩ r(t , 1) = ∅ but either one of them is not contained in the other one, which violates condition 2 of Definition 1. Pairs (s, t) and (s , t ) are packing alignments of s and t. For event sequences s and t, let A(s, t) denote the set of packing alignments of s and t. Score S(s , t ) between s and t for a packing alignment (s , t ) is defined as
S(s , t ) =
|s [j]|w(s [j], t [k]) +
(j,k):r(s ,j)⊆r(t ,k)
|t [k]|w(s [j], t [k]).
(j,k):r(s ,j)⊃r(t ,k)
Definition 2. Packing alignment score S∗ (s, t) between event sequences s and t is the maximum score S(s , t ) among those for all packing alignments (s , t ) ∈ A(s, t), namely, S∗ (s, t) =
max
(s ,t )∈A(s,t)
S(s , t ).
Problem 1. For given event sequences s and t, calculate the packing alignment score between s and t (and the alignment (s , t ) that achieves the score). Example 3 (Continued from Example 2). Define w(a, b) as 1 if a = b and −1 7 otherwise. Then, S(s, t) = − 21 and S(s , t ) = − 16 . In this case, alignment (s , t ) is one of the optimal packing alignments for s and t. Let u denote an event sequence representation for the melody in measures 245-248. Then, the unique optimal packing alignment for s and u is alignment (s, u) and S(s, u) = 1. Let s[i..j] denote s[i]s[i + 1] · · · s[j]. The following proposition holds. The proof is omitted due to space limitations.
238
A. Nakamura and M. Kudo
Proposition 1. S∗ (s[1..m], t[1..n]) = ⎧ m ⎪ ⎪ ⎪ |s[i]|w(s[i], -) if n = 0, ⎪ ⎪ ⎪ ⎪ i=1 ⎪ n ⎪ ⎪ ⎪ ⎪ ⎪ |t[i]|w(-, t[i]) if m = 0, and otherwise ⎪ ⎪ ⎪ ⎪ i=1 ⎧ ⎪ ⎪ ⎪ S∗ (s[1..m − 1], t[1..n0 ]) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ n n ⎪ ⎪ ⎪ ⎪ ⎪ + ⎪ ⎪ ⎪ |t[i]|w(s[m], t[i]) + |s[m]| − |t[i]| w(s[m], -) ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ i=n0 +1 i=n0 +1 ⎪ ⎪ n ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ for all n ≤ n with |t[i]| ≤ |s[m]|, ⎪ 0 ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ i=n +1 0 ⎪ ⎪ max S∗ (s[1..m0 ], t[1..n − 1]) ⎪ ⎪ ⎪ ⎪ ⎛ ⎞ ⎪ ⎪ ⎪ ⎪ m m ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ |s[j]|w(s[j], t[n]) + ⎝|t[n]| − |s[j]|⎠ w(-, t[n]) ⎪ ⎪ ⎪ + ⎪ ⎪ ⎪ ⎪ ⎪ j=m0 +1 j=m0 +1 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ m ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ for all m ≤ m with |s[j]| ≤ |t[n]|. ⎪ 0 ⎪ ⎩ ⎩ j=m0 +1
Remark 1. Mongeau and Sankoff [5] have already proposed a method using the recurrence equation with the same search space constrained by the lengths of s[m] and t[n]. They heuristically introduced the constraint for efficiency while the constraint is necessary to solve the packing alignment problem. Let s and t be event sequences with ls =
max |s[i]| and lt =
1≤i≤n(s)
max |t[i]|.
1≤i≤n(t)
Maximum packable number p(s, t) is the maximum of the following two numbers: j (1) the maximum number of events s[i], s[i + 1], ..., s[j] with k=i |s[k]| ≤ lt and (2) the maximum number of events t[i], t[i + 1], ..., t[j] with jk=i |t[k]| ≤ ls . Proposition 2. The optimal packing alignment problem for event sequences s and t can be solved in O(p(s, t)n(s)n(t)) time and O(p(s, t)(n(s) + n(t))) space. Proof. Dynamic programming using an n(s)×n(t) table can achieve the bounds. Entry (i, j) of the table is filled by S∗ (s[1..i], t[1..j]) in the dynamic programming. By Proposition 1, this is done using at most p(s, t)+2 entry values that have been already calculated so far. Thus, totally, O(p(s, t)n(s)n(t)) time and O(n(s)n(t)) space are enough. The space complexity can be reduced to O(p(s, t)(n(s)+n(t))) using a technique of a linear space algorithm proposed in [3].
3 3.1
Properties of Packing Alignment Relation to Edit Distance
By using non-negative score function w satisfying that w(a, b) = 0 ⇔ a = b and S(s , t ), we can obtain packing alignment distance defining d(s, t) = min (s ,t )∈A(s,t)
Packing Alignment: Alignment for Sequences of Various Length Events
239
d. (Weighted) alignment distance is known to be equivalent to (weighted) edit distance in general. In fact, packing alignment distance can be seen as a special one of more general edit distance proposed in [5]. The specialty of packing alignment allows us to simplify its corresponding edit operations. (Weighted) edit distance between two event sequences s and t is the minimum cost needed for s to be transformed into t using five edit operations: insertion, deletion, substitution and partition of an event and concatenation of more than one event of the same type. The last two operations are newly introduced 0-cost operations for dealing with event sequences. A partition of event b is to replace it with event sequence b1 b2 · · · bl composed of l(> 1) events satisfying bi = b for all i = 1, 2, ..., l and |b1 | + |b2 | + · · · + |bl | = |b|. A concatenation of more than one consecutive event of the same type is the operation in the opposite direction. Define the cost of insertion, deletion and substitution as follows: |b|w(-, b) when b is inserted, |b|w(b, -) when b is deleted and |b|w(b, b ) when b is substituted to b with |b | = |b|. Note that substitution is only allowed for event type, and event length cannot be changed. Then, the alignment distance between s and t is equal to the edit distance between them because an alignment (s , t ) of s and t has one-to-one correspondence to a set of edit operations transforming s to t with the total cost of d(s , t ). Note that any event created by a partition must not be involved in a concatenation and any event created by a concatenation vice versa. Partition operation that divides s[i] into l events b1 , b2 , ..., bl corresponds to s[i] having an opposing event subsequence composed of l consecutive events whose jth event has length |bj | in alignment (s , t ). Substitution operation that substitutes b to b corresponds to (a part of) some s[i] of type b whose opposing event is b in alignment (s , t ). Deletion operation that deletes b corresponds to (a part of) some s[i] of type b whose opposing event is a gap event in alignment (s , t ). Insertion operation that inserts b corresponds to a gap event whose opposing event sequence contains b in alignment (s , t ). Concatenation operation that concatenates l consecutive events b1 , b2 , ..., bl of the same type b1 corresponds to some event subsequence s [i + 1..i + l] with |s [i + j]| = |bj | l (j = 1, 2, ..., l) whose opposing event is (b1 , j=1 |bj |) in alignment (s , t ). Remark 2. Packing alignment is stricter on length than the edit distance defined by Mongeau and Sankoff [5]. Operations called fragmentation and consolidation introduced by them correspond to the partition and concatenation, respectively. One event can be replaced with any consecutive events in fragmentation and vice versa in consolidation regardless of event type and length while the total length and event types are kept in the partition and concatenation. Besides, their substitution can be allowed with any event of any length. Partition, concatenation and our substitution can be seen as special fragmentation, consolidation and substitution they defined, and each of these their operations can be realized by a series of our operations. Thus, our operations are more basic ones, and their cost can be more easily and naturally determined using score function w on Σ ∪ {-} compared to their operations.
240
3.2
A. Nakamura and M. Kudo
Comparison with General String Alignment
Event sequences can be seen mere strings if all events are partitioned uniformly, namely, partitioned into events with the same length (after quantization if necessary). Then, general string alignment can be applied to them. What is the difference between this method and packing alignment? First, alignment score (multiplied by unit length) obtained by this string alignment is no smaller, and possibly larger, than that obtained by packing alignment. This is because, for all packing alignment (s , t ), pair (s , t ) composed of uniformly partitioned s and t can be regarded as a string alignment with the same score. Furthermore, one event before preprocessing of uniform partitioning can be divided into noncontiguous events in some string alignment. For example, let s = (C, 1)(D, 1)(C, 1) and t = (C, 2). Then, (C, 1)(D, 1)(C, 1) and (C, 1)(-, 1)(C, 1) is an alignment of (C, 1)(D, 1)(C, 1) and (C, 1)(C, 1) but not a packing alignment of s and t. As a result, packing alignment score of s and t is −1 while string alignment score of them is 1 for w(a, b) defined as 1 if a = b and −1 otherwise. Thus, packing alignment is favorable when gaps should not be inserted into the middle of some events in alignment. Another point to mention is time and space efficiency of the algorithms for the problems. The number of events in an event sequence s can become very large when uniform partitioning is applied to s. Let s and t be uniformly partitioned s and t, respectively. Then, O(n(s )n(t )) time and O(n(s )+n(t )) space used by a string alignment algorithm can be significantly lager than O(p(s, t)n(s)n(t)) time and O(p(s, t)(n(s) + n(t))) space used by a packing alignment algorithm. 3.3
Comparison with DTW
The most popular alignment in music is dynamic time warping (DTW), which is first developed for speech recognition. DTW is a kind of string alignment, so it cannot dealt with event sequences directly. However, in exchange for prohibiting gap insertions, DTW allows more than one contiguous symbol in a string to be opposed to one symbol in the other string just like packing alignment does. Unlike packing alignment, there are no limitation on the number of contiguous symbols opposing to one symbol in DTW. Packing alignment’s strictness on length seems not so bad property because some constraint on length is effective in practice. In fact, as constraints on a path in an alignment graph, adjustment window condition (the condition that the path must lie within a fixed distance of the diagonal) and slope constraint condition (the condition that the path must go pm steps in the diagonal direction after m consecutive steps in one axis direction for fixed p) [7] were proposed in DTW. Packing alignment is more flexible than DTW at the point that it allows gap insertions.
4
Gap Constraint
When we use alignment score as similarity measure, one problem is that score can be high for alignments with long contiguous gaps. However, in many real
Packing Alignment: Alignment for Sequences of Various Length Events
241
applications, two sequences with the best alignment with long contiguous gaps should not be considered to be similar. So, we consider a gap-constraint version of packing alignment score defined as follows. For non-negative real number g ≥ 0, let Ag (s, t) denote the set of packing alignments of s and t in which the length of every contiguous gap subsequence defined below is at most g. Then, the gap-constraint version of packing alignment score S∗g is defined as max(s ,t )∈Ag (s,t) S(s , t ) if Ag (s, t) = ∅, S∗g (s, t) = −∞ otherwise. We call the parameter g the maximum contiguous gap length. For packing alignment (s , t ) of s and t, a contiguous subsequence s [i..j] is called a contiguous gap subsequence of s if s [i] = · · · = s [j] = - and no non-gap event in s is opposed to t [h] and t [k] that are opposed to s [i] and s [j], respectively. A contiguous gap subsequence of t can be defined similarly. For example, when s = (C, 1)(E, 1) and t = (C, 2)(D, 1)(E, 2), the pair of s = (C, 1)(-, 1)(-, 1)(-, 1)(E, 1) and t is a packing alignment, but none of s [2..4], s [2..3] and s [3..4] is a contiguous gap subsequence of s because t[1] and t[3] which are opposed to s [2] and s [4] are also opposed to s [1] and s [5], respectively. Let pg (s, t) = max{l :
i+l−1 k=i
|s[k]| ≤ g or
i+l−1 k=i
|t[k]| ≤ g}.
Then, the following proposition holds. The proof is omitted. Proposition 3. The optimal packing alignment problem for event sequences s and t with maximum contiguous gap length g can be solved in O((p(s, t) + pg (s, t))n(s)n(t)) time and O((p(s, t) + pg (s, t))(n(s) + n(t))) space.
5 5.1
Experiments Frequent Approximate Pattern Extraction
By local alignment using packing alignment, we can define similar parts in event sequences, so we can extract frequent approximate patterns in event sequences. Here, we consider the task of approximate pattern extraction frequently appeared in one event sequence. In a note sequence of a musical sheet, such a pattern can be regarded as a most typical and impressive part. We conducted an experiment on this task using MIDI files of famous classical music pieces. As a frequent mining algorithm based on local alignment, we used EnumSubstrFLOO in [6]. For a given event sequence and a minimum support σ, EnumSubstrFLOO extracts contiguous event sequences as approximate patterns that have minimal locally optimal occurrences with frequency of at least σ. Local optimality is first introduced to local alignment by Erickson and Sellers [2], and locally optimal occurrences of approximate patterns are expected to have appropriate boundaries. Unfortunately, EnumSubstrFLOO with packing alignment is not so fast; it is a O(kn3 )-time and O(n3 )-space algorithm, where n is the number of events in a given sequence s and k is the maximum packable number
242
A. Nakamura and M. Kudo
p(s, s). Since EnumSubstrFLOO keeps all occurrence candidates in memory for efficiency and there are a lot of frequent patterns with short length, we prevented memory shortage by setting a parameter called the minimum length θ; only the occurrences for patterns with length of θ quarter notes were extracted. In our experiment, we used the following score function: a = b a is close to b a =- or b =- otherwise w(a, b) 3 0 −1 −2 Here, we say that a is close to b if one of the following conditions are satisfied: (1) just one of them is rest ’R’, (2) the pitch difference between them is at most two semitones or an integral multiple of octaves. We scored each frequent pattern by summing the alignment scores between the pattern and its selected high-scored occurrences which are greedily selected so that the ranges of any selected occurrences do not overlap1. The maximum contiguous gap length was set to the length of one quarter note throughout our experiments. The continuity of one rest does not seem important, so we cut each rest on each beat. 5.2
Running Time Comparison
First, we measured running time of EnumSubstrFLOO with packing alignment, DTW and general alignment for melody tracks of two relatively small-sized MIDI files: the 2nd track2 of Bach-Menuet.mid3 (“Menuet G dur”, BWV Anh. 114) and the 1st track of k140-4.mid4(“Wachet auf, ruft uns die Stimme”, BWV 140). DTW and general alignment were applied after uniform partitioning whose unit length was set to the greatest common divisor of the lengths of all notes in a MIDI file. Note that EnumSubstrFLOO runs in O(n3 )-time and O(n3 )-space when DTW or general alignment is used, where n is the length of a target string, namely, the number of unit notes. The minimum support was set to 5 and the minimum length was set to 10 except for the case of k140-4.mid with general alignment; for the case, the minimum support was set to 5 but the minimum length was set to 40 because of memory shortage. The computer used in our experiments is DELL Precision T7500 (CPU:Intel(R) Xeon(R) E5520 2.27GH, memory:2GB). The result is shown below. MIDI file Method #note #pat time(sec) 1 2 3 4
Bach-Menuet.mid PA DTW GA 387 4632 355 323 1427 1.08 28.8 60.8
k140-4.mid PA DTW GA PA: Packing Alignment, 829 7274 GA: General Alignment 4539 552 0 13.1 60.8 1840
Two instances were not regarded as overlapped ones when all the overlapped part is composed of rests. The sequence of the highest-pitch notes is extracted as the target note sequence even if the track contains overlapping notes. je1emu.jpn.org/midiClassic/Bach-Menuet.mid www.cyborg.ne.jp/kokoyo/bach/dl/k140-4.lzh
Packing Alignment: Alignment for Sequences of Various Length Events
243
For both MIDI files, the number of notes (#note) becomes 9-12 times larger after uniform partitioning. As a result, EnumSubstrFLOOs using DTW and general alignment are slower than that using packing alignment. The reason why DTW is faster than general alignment is that pruning of pattern search space works well for DTW, which means that DTW alignment score is easy to become a negative value. Note that the best alignment score can become larger using gaps for our score function but DTW does not use gaps. The followings are the highest-scored pattern for packing alignment and the longest patterns for the other methods extracted from Bach-Menuet.mid. PA: DTW: GA: The pattern extracted by packing alignment looks most appropriate for a typical melody sequence of the Menuet. 5.3
Musical Variation Extraction Experiment
We conducted an experiment on musical variation extraction from the three famous variations shown in Table 1. For each MIDI file, we used one note sequence in its melody track5 . We checked whether whole themes or whole variations can be extracted as frequent approximate patterns and their occurrences. The minimum support was set to 5 for all MIDI files. The minimum length was set to 40 for mozart k265 and mozart-k331-mov16 and 20 for be-pv-19. Note that the theme lengths of mozart k265, mozart-k331-mov1 and be-pv-19 are 96(> 40), 108(> 40), 32(> 20) quarter notes, respectively. The PA-row of each chart in Fig. 3 shows the highest-scored pattern and its occurrences from each MIDI file extracted by EnumSubstrFLOO using packing alignment. The charts also contains the other three rows (Ex-row, DTW-row and GA-row) for the comparison described later. First, let us review the results shown in the PA-rows. The extracted patterns are whole two consecutive variations for mozart-k265 and nearly whole themes for the other two MIDI files. Here, we say that A is nearly whole B if the length of the symmetric difference of A and B is at most 10% of B’s length. Except 4 among 20 occurrences of the three patterns, all the occurrences have appropriate boundaries, namely, they are nearly whole themes, nearly whole variations and whole two consecutive variations. The extracted patterns and occurrences are considered to owe their accurate boundaries to their local optimality. Note that variations are extracted correctly even if musical time changes: variation 12 ( 24 → 3 ) in mozart k265, variation 6 ( 68 → 44 ) in mozart-k331-mov1, and variation 6 4 2 ( 4 → 34 ) in be-pv-19. 5 6
The highest-pitch note is selected if the track contains overlapping notes. It was set to 10 for mozart-k331-mov1 in the case of DTW and general alignment because nothing was frequent for 40.
244
A. Nakamura and M. Kudo
Table 1. MIDI files used in our experiment on variation extraction File (Track no.)[Composer] Length Form mozart k265 (4)[Mozart] 12m16s AABABA mozart-k331-mov1 (1)[Mozart] 12m20s AA’AA’BA”BA” be-pv-19 (1)[Beethoven] 3m34s AA’BA
2
50
Title Url #Event Musical Time 12 Variations on ”Ah vous dirais-je, Maman” K.265 tirolmusic.blogspot.com/2007/11/12-variations-on-ah-vous-dirais-je.html
3423 2/4[1,589),3/4[589,648) Piano Sonata No. 11 in A major, K 331, Andante grazioso www2s.sni.ne.jp/watanabe/classical.html
3518 6/8[1,218),4/4[218,262) 6 Variations in D on an Original Theme (Op.76) www.classicalmidiconnection.com/cmc/beethoven.html
1458
98
147
2/4[1,50),6/4[50,66),2/4[66,98),3/4[98,156),2/4[156,181)
196
245
294
343
392
441
489
538
589
PA Ex DTW GA theme
v1
v2
v3
v4
v5
v6
v7
v8
v9
v10
v11
v12
(a) mozart-k265 2
38
74
110
146
182
218
PA Ex DTW GA
theme
v1
v2
v3
v4
v5
v6
(b)mozart-k331-mov1 2
18
34
50
66
82
99
136
156
PA Ex DTW GA
theme
v1
v2
v3
v4
v5
v6
v7
theme
(c)be-pv-19
Fig. 3. Result of musical variation extraction. The horizontal axes refer to measure number in each musical piece. The vertical broken lines show the starting and ending positions of themes and variations. The extracted patterns are shown by thick lines and the other thin lines are their occurrences. The thick lines also represent its occurrence except those in DTW-rows of (a) and (b).
Packing Alignment: Alignment for Sequences of Various Length Events
245
Let us compare the result with those by other methods. In each Ex-row, a longest frequent exact pattern and its occurrences are shown. For all the three MIDI files, the longest frequent exact patterns are very short (2-6 measures) and their occurrences are clustered in narrow ranges. This result indicates the importance of using approximate patterns in this extraction task. In the DTW-rows and GA-rows, a longest approximate pattern extracted by EnumSubstrFLOO using DTW and general alignment, respectively, and their occurrences are shown. Note that we applied DTW and general alignment by ignoring note length. No extracted patterns and occurrences are nearly whole themes nor nearly whole variations in the case with DTW, neither are those in the case with general alignment except three nearly whole variations in be-pv-19. These results indicate the importance of length consideration.
6
Concluding Remarks
By explicitly treating event length, we defined the problem of packing alignment for sequences of various length events as an optimization problem, which can be solved efficiently. Direct applicability to such sequences has not only the merit of time and space efficiency but also the merit of non-decomposability. By virtue of these merits, we could extract appropriate frequent approximate patterns and their occurrences in our experiments. We would like to apply packing alignment to other applications in the future.
Acknowledgements This work was partially supported by JSPS KAKENHI 21500128.
References 1. Dixon, S., Widmer, G.: MATCH:A music Alignment Tool Chest. In: Proceedings of ISMIR 2005, pp. 11–15 (2005) 2. Erickson, B.W., Sellers, P.H.: Recognition of patterns in genetic sequences. In: Sankoff, D., Kruskal, J.B. (eds.) Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison, ch. 2, pp. 55–91. Addison-Wesley, Reading (1983) 3. Hirshberg, D.S.: A linear space algorithm for computing maximal common subsequences. Communications of the ACM 18(6), 341–343 (1975) 4. Henikoff, S., Henikoff, J.: Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89, 10915–10919 (1992) 5. Mongeau, M., Sankoff, D.: Comparison of Musical Sequences. Computers and the Humanities 24, 161–175 (1990) 6. Nakamura, A., Tosaka, H., Kudo, M.: Mining Approximate Patterns with Frequent Locally Optimal Occurrences. Division of Computer Science Report Series A, TCSTR-A-10-41, Hokkaido University (2010), http://www-alg.ist.hokudai.ac.jp/tra.html 7. Sakoe, H., Chiba, S.: Dynamic Programming Algorithm Optimization for Spoken Word Recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing ASSP-26(1), 43–49 (1978)
Multiple Distribution Data Description Learning Algorithm for Novelty Detection Trung Le, Dat Tran, Wanli Ma, and Dharmendra Sharma Faculty of Information Sciences and Engineering, University of Canberra, ACT 2601, Australia {trung.le,dat.tran,wanli.ma, dharmendra.sharma}@canberra.edu.au
Abstract. Current data description learning methods for novelty detection such as support vector data description and small sphere with large margin construct a spherically shaped boundary around a normal data set to separate this set from abnormal data. The volume of this sphere is minimized to reduce the chance of accepting abnormal data. However those learning methods do not guarantee that the single spherically shaped boundary can best describe the normal data set if there exist some distinctive data distributions in this set. We propose in this paper a new data description learning method that constructs a set of spherically shaped boundaries to provide a better data description to the normal data set. An optimisation problem is proposed and solving this problem results in an iterative learning algorithm to determine the set of spherically shaped boundaries. We prove that the classification error will be reduced after each iteration in our learning method. Experimental results on 28 well-known data sets show that the proposed method provides lower classification error rates. Keywords: Novelty detection, one-class classification, support vector data description, spherically shaped boundary.
1
Introduction
Novelty detection (ND) or one-class classification involves learning data description of normal data to build a model that can detect any divergence from normality [9]. Data description can be used for outlier detection to detect abnormal samples from a data set. Data description is also used for a classification problem where one class is well sampled while other classes are severely undersampled. In real-world applications, collecting the normal data is cheap and easy while the abnormal data is expensive and is not available in several situations [14]. For instance, in case of machine fault detection, the normal data under the normal operation is easy to obtain while in faulty situation the machine is required to devastate completely. Therefore one-class classification is more difficult than conventional two-class classification because the decision boundary of one-class classification is mainly constructed from samples of only the normal class and J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 246–257, 2011. c Springer-Verlag Berlin Heidelberg 2011
Multiple Distribution Data Description Learning Algorithm
247
hence it is hard to decide how strict decision boundary should be. ND is widely applied to many application domains such as network intrusion, currency validation, user verification in computer systems, medical diagnosis [3], and machine fault detection [16]. There are two main approaches to solving the data description problem which are density estimation approach [1][2][12] and kernel based approach [13][14][20]. In density estimation approach, the task of data description is solved by estimating a probability density of a data set [11]. This approach requires a large number of training samples for estimation, in practice the training data is not insufficient and hence does not represent the complete density distribution. The estimation will mainly focus on modeling the high density areas and can result in a bad data description [14]. Kernel-based approach aims at determining the boundaries of the training set rather than at estimating the probability density. The training data is mapped from the input space into a higher dimensional feature space via a kernel function. Support Vector Machine (SVM) is one of the well-known kernel-based methods which constructs an optimal hyperplane between two classes by focusing on the training samples close to the edge of the class descriptors [17]. These training samples are called support vectors. In One-Class Support Vector Machine (OCSVM), a hyperplane is determined to separate the normal data such that the margin between the hyperplane and outliers is maximized [13]. Support Vector Data Description (SVDD) is a new SVM learning method for one-class classification [14]. A hyperspherically shaped boundary around the normal data set is constructed to separate this set from abnormal data. The volume of this data description is minimized to reduce the chance of accepting abnormal data. SVDD has been proven as one of the best methods for one-class classification problems [19]. Some extensions to SVDD have been proposed to improve the margins of the hyperspherically shaped boundary. The first extension is Small Sphere and Large Margin (SSLM) [20] which proposes to surround the normal data in this optimal hypersphere such that the margin—distance from outliers to the hypersphere, is maximized. This SSLM approach is helpful for parameter selection and provides very good detection results on a number of real data sets. We have recently proposed a further extension to SSLM which is called Small Sphere and Two Large Margins (SS2LM) [7]. This SS2LM aims at maximising the margin between the surface of the hypersphere and abnormal data and the margin between that surface and the normal data while the volume of this data description is being minimised. Other extensions to SVDD regarding data distribution have also been proposed. The first extension is to apply SVDD to multi-class classification problems [5]. Several class-specific hyperspheres that each encloses all data samples from one class but excludes all data samples from other classes. The second extension is for one-class classification which proposes to use a number of hyperspheres to decribe the normal data set [19]. Normal data samples may have some distinctive distributions so they will locate in different regions in the feature space and hence if the single hypersphere in SVDD is used to enclose all normal data,
248
T. Le et al.
it will also enclose abnormal data samples resulting a high false positive error rate. However this work was not presented in detail, the proposed method is heuristic and there is no proof provided to show that the multi-sphere approach can provide a better data description. We propose in this paper a new and more detailed multi-hypersphere approach to SVDD. A set of hyperspheres is proposed to describe the normal data set assuming that normal data samples have distinctive data distributions. We formulate the optimisation problem for multi-sphere SVDD and prove how SVDD parameters are obtained through solving this problem. An iterative algorithm is also proposed for building data descriptors, and we also prove that the classification error will be reduced after each iteration. Experimental results on 28 well-known data sets show that the proposed method provides lower classification error rates comparing with the standard single-sphere SVDD.
2
Single Hypersphere Approach: SVDD
Let xi , i = 1, . . . , p be normal data points with label yi = +1 and xi , i = p + 1, . . . , n be abnormal data points (outliers) with label yi = −1. SVDD [14] aims at determining an optimal hypersphere to include all normal data points while abnormal data points are outside this hypersphere. The optimisation problem is as follows p n min R2 + C1 (1) ξi + C2 ξi R,c,ξ
i=1
i=p+1
subject to ||φ(xi ) − c||2 ≤ R2 + ξi ||φ(xi ) − c||2 ≥ R2 − ξi ξi ≥ 0, i = 1, . . . , n
i = 1, . . . , p i = p + 1, . . . , n (2)
where R is radius of the hypersphere, C1 and C2 are constants, ξ = [ξi ]i=1,...,n is vector of slack variables, φ(.) is a kernel function, and c is centre of the hypersphere. For classifying an unknown data point x, the following decision function is used: f (x) = sign(R2 − ||φ(x) − c||2 ). The unknown data point x is normal if f (x) = +1 or abnormal if f (x) = −1.
3 3.1
Proposed Multiple Hypersphere Approach Problem Formulation
Consider a set of m hyperspheres Sj (cj , Rj )with center cj and radius Rj , j = 1, . . . , m. This hypershere set is a good data description of the normal data set X = {x1 , x2 , . . . , xn } if each of the describes a distribution in this hyperspheres m data set and the sum of all radii j=1 Rj2 should be minimised.
Multiple Distribution Data Description Learning Algorithm
249
Let matrix U = [uij ]p×m , uij ∈ {0, 1}, i = 1, . . . , p, j = 1, . . . , m where uij is the membership representing degree of belonging of data point xi to hypersphere Sj . The optimisation problem of multi-sphere SVDD can be formulated as follows min
R,c,ξ
m
Rj2 + C1
j=1
p i=1
m n
ξi + C2
ξij
(3)
i=p+1 j=1
subject to m j=1
uij ||φ(xi ) − cj ||2 ≤
m
uij Rj2 + ξi
i = 1, . . . , p
j=1
||φ(xi ) − cj ||2 ≥ Rj2 − ξij i = p + 1, . . . , n, ξi ≥ 0, i = 1, . . . , p ξij > 0, i = p + 1, . . . , n, j = 1, . . . , m
j = 1, . . . , m (4)
where R = [Rj ]j=1,...,m is vector of radii, C1 and C2 are constants, ξi and ξij are slack variables, φ(.) is a kernel function, and c = [cj ]j=1,...,m is vector of centres. The mapping φ(xi0 ) of a normal data point xi0 , i0 ∈ {1, 2, . . . , p}, has to be in one of those hyperspheres, i.e. there exists a hypershere Sj0 , j0 ∈ {1, 2, . . . , m} such that ui0 j0 = 1 and ui0 j = 0, j = j0 . Minimising the function in (3) over variables R, c and ξ subject to (4) will determine radii and centres of hyperspheres and slack variables if the matrix U is given. On the other hand, the matrix U will be determined if radii and centres of hyperspheres are given. Therefore an iterative algorithm will be applied to find the complete solution. The algorithm consists of two alternative steps: 1) Calculate radii and centres of hyperspheres and slack variables, and 2) Calculate membership U . We present in the next sections the iterative algorithm and prove that the classification error in the current iteration will be smaller than that in the previous iteration. For classifying a data point x, the following decision function is used f (x) = sign max Rj2 − ||φ(x) − cj ||2 (5) 1≤j≤m
The unknown data point x is normal if f (x) = +1 or abnormal if f (x) = −1. This decision function implies that the mapping of a normal data point has to be in one of the hyperspheres and that the mapping of an abnormal data point has to be outside all of those hyperspheres. The following theorem is used to consider the relation of slack variables to data points classified. Theorem 1. Assume that (R, c, ξ) is a solution of the optimisation problem in (3), xi , i ∈ {1, 2, . . . , n} is the i-th data point. 1. xi is normal: denote Sk (ck , Rk ), k ∈ {1, 2, . . . , m} as the only hypersphere having uik = 1. If xi is missclassified then ξi = ||φ(xi ) − ck ||2 − Rk2 . If xi is correctly classified then ξi = 0, φ(xi ) ∈ Sk or ξi = ||φ(xi ) − ck ||2 − Rk2 , φ(xi ) ∈ Sk .
250
T. Le et al.
2. xi is abnormal: if xi is missclassified and φ(xi ) ∈ Sj then ξij = Rj2 −||φ(xi )− cj ||2 . If xi is missclassified and φ(xi ) ∈ Sj then ξij = 0. If xi is correctly classified then ξij = 0. Proof. From(4) we have ξi = max 0, ||φ(xi ) − ck ||2 − Rk2 , if xi is normal, and ξij = max 0, Rj2 − ||φ(xi ) − cj ||2 , if xi is abnormal. 1. xi is normal: if xi is misclassified then φ(xi ) is outside all of the hypersheres. It follows that ||φ(xi ) − cj ||2 > Rj2 , j = 1, . . . , m. So ξi = ||φ(xi ) − ck ||2 − Rk2 with some k. If xi is correctly classified then the proof is obtained using (5). 2. xi is abnormal: it is easy to prove using ξij = max 0, Rj2 − ||φ(xi ) − cj ||2 . The following empirical error can be defined for a data point xi : ⎧ 2 2 ⎪ min ||φ(x ) − c || − R ⎪ j i j j ⎪ ⎪ ⎪ ⎨ if x i is normal and misclassified error(i) = min R2 − ||φ(x ) − c ||2 , x ∈ S (c , R ) (6) j i j i j j j ⎪ j ⎪ ⎪ ⎪ ⎪ ⎩ if xi is abnormal and misclassified 0 otherwise n nReferring to Theorem 1, it is easy to prove that i=1 ξi is an upper bound of i=1 error(i). 3.2
Calculating Radii, Centres and Slack Variables
The Lagrange function for the optimisation problem in in (3) subject to (4) is as follows L(R, c, ξ, α, β) =
m j=1 p
Rj2 + C1
p
ξi + C2
i=1
n m
ξij +
i=p+1 j=1
2 αi ||φ(xi ) − cs(i) ||2 − Rs(i) − ξi −
i=1 n
m
i=p+1 j=1 p
αij ||φ(xi ) − cj ||2 − Rj2 − ξij −
βi ξi −
i=1
n m
βij ξij
(7)
i=p+1 j=1
where s(i) ∈ {1, . . . , m} is index of the hypersphere to which data point xi belongs and satisfies uis(i) = 1 and uij = 0 ∀j = s(i). Setting derivatives of L(R, c, ξ, α, β)with respect to primal variables to 0, we obtain n ∂L =0 ⇒ αi y i + αij yi = 1 (8) ∂Rj −1 p+1 i∈s
(j)
Multiple Distribution Data Description Learning Algorithm
∂L =0 ∂cj
αij yi φ(xi )
i = 1, . . . , p
i = p + 1, . . . , n,
(10)
j = 1, . . . , m
2 αi ≥ 0, ||φ(xi ) − cs(i) ||2 − Rs(i) − ξi ≤ 0, 2 αi ||φ(xi ) − cs(i) ||2 − Rs(i) − ξi = 0, i = 1, . . . , p
αij ≥ 0, ||φ(xi ) − cj ||2 − Rj2 + ξij ≥ 0, αi ||φ(xi ) − cj ||2 − Rj2 + ξij = 0, i = p + 1, . . . , n, βi ≥ 0, βij ≥ 0,
ξij ≥ 0,
ξi ≥ 0,
βi ξi = 0,
βij ξij = 0,
(9)
i=p+1
⇒ αi + βi = C1 ,
⇒ αij + βij = C2 ,
n
αi yi φ(xi ) +
i∈s−1 (j)
∂L =0 ∂ξj ∂L =0 ∂ξij
⇒ cj =
251
(12)
j = 1, . . . , m (13)
i = 1, . . . , p
i = p + 1, . . . , n,
(11)
(14)
j = 1, . . . , m
(15)
To get the dual form, we substitute (8)-(15) to the Lagrange function in (7) and obtain the following: L=
p
αi ||φ(xi ) − cs(i) ||2 −
i=1
=
p
αi K(xi , xi ) − 2
i=1 n
=
m
αi φ(xi )cs(i) + n m
αij K(xi , xi ) + 2
i=p+1 j=1 p
m
i=1 n
j=1 i∈s−1 (j) m n
αi K(xi , xj ) − 2 m
m j=1
m n
i∈s−1 (j)
αi yi φ(xi ) +
i=1
αij φ(xi )cj −
αi φ(xi )cj +
n
i=p+1 n
m
m
αij ||cj ||2
i=p+1 j=1
αi ||cj ||2 −
αij φ(xi )cj −
αij ||cj ||2
j=1 i=p+1
||cj ||2
j=1
αij yi K(xi , xi )−
2 αij yi φ(xi )
i=p+1
n m
j=1 i∈s−1 (j) m n
αij K(xi , xi ) −
αi yi K(xi , xi ) +
αi ||cs(i) ||2 −
j=1 i=p+1
j=1 i=p+1
i∈s−1 (j)
p
i=p+1 j=1
αij K(xi , xj ) + 2
αi K(xi , xi ) −
i=1
=
αij ||φ(xi ) − cj ||2
i=p+1 j=1
i=1
i=p+1 j=1 p
=
p
n m
(16)
252
T. Le et al.
The result in (16) shows that the optimisation problem in (3) is equivalent to m individual optimisation problems as follows
n 2 αi yi φ(xi ) + αij yi φ(xi ) − min i∈s−1 (j)
i=p+1 n
αi yi K(xi , xi ) −
i∈s−1 (j)
αij yi K(xi , xi )
(17)
i=p+1
subject to i∈s−1 (j)
α i yi +
n
αij yi = 1,
i=p+1 −1
0 ≤ αi ≤ C1 , i ∈ s
(j), 0 ≤ αij ≤ C2 , i = p + 1, . . . , n, j = 1, . . . , m (18)
After solving all of these individual optimization problems, we can calculate the updating radii R = [Rj ] and centres c = [cj ], j = 1, . . . , m using the equations in SVDD. 3.3
Calculating Membership U
We use radii and centres of hyperspheres to update the membership matrix U . The following algorithm is proposed: For normal data points xi , i = 1 to p do If xi is misclassified then Let j0 = arg minj ||φ(xi ) − cj ||2 − Rj2 Set uij0 = 1 and uij = 0 if j = j0 End If Else Denote J = {j : φ(xi) ∈ Sj (cj , Rj )} Let j0 = arg minj∈J ||φ(xi ) − cj ||2 Set uij0 = 1 and uij = 0 if j = j0 End Else End For
3.4
Iterative Learning Process
The proposed iterative learning process for multi-sphere SVDD will run two alternative steps until a convergence is reached as follows Initialise U by clustering the normal data set in the input space Repeat the following Calculate R, c and ξ using U Calculate U using R and c Until a convergence is reached
Multiple Distribution Data Description Learning Algorithm
253
We can prove that the classification error in the current iteration will be smaller than that in the previous iteration through the following key theorem. Theorem 2. Let (R, c, ξ, U ) and (R, c, ξ, U) be solutions at the previous iteration and current iteration, respectively. The following inequality holds m
2
Rj + C1
j=1
p
ξi + C2
i=1
n m
ξij ≤
i=p+1 j=1
m j=1
Rj2 + C1
p i=1
ξi + C2
n m
ξij (19)
i=p+1 j=1
Proof. We prove that (R, c, ξ, U ) is a feasible solution at current iteration. Case 1: xi is normal and misclassified. m
m uij ||φ(xi ) − cj ||2 − Rj2 − uij ||φ(xi ) − cj ||2 − Rj2 =
j=1
uis(i) ||φ(xi ) − cs(i) || − 2
2 Rs(i)
j=1 − min uij ||φ(xi ) − cj ||2 − Rj2 ≥ 0 (20) j
Hence m
m 2 2 uij ||φ(xi ) − cj || − Rj ≤ uij ||φ(xi ) − cj ||2 − Rj2 ≤ ξi
j=1
(21)
j=1
(21) is reasonable due to (R, c, ξ, U ) is solution at the previous step. Case 2: xi is normal and correctly classified. Denote J = {j : φ(xi ) ∈ S(cj , Rj )} and j0 = arg minj∈J ||φ(xi ) − cj ||2 then m
uij ||φ(xi ) − cj ||2 − Rj2 = ||φ(xi ) − cj0 ||2 − Rj20 ≤ 0 ≤ ξi
(22)
j=1
Case 3: xi is abnormal. It is seen that ||φ(xi ) − cj ||2 ≥ Rj2 − ξij ,
i = p + 1, . . . , n, j = 1, . . . , m
(23)
From (21) - (23), we can conclude that (R, c, ξ, U ) is a feasible solution at current iteration. In addition, (R, c, ξ, U) is optimal solution at current iteration. That results in our conclusion.
4
Experimental Results
We performed our experiments on 28 well-known data sets related to machine fault detection and bioinformatics. These data sets were originally balanced data sets and some of them contain several classes. For each data set, we picked up a class at a time and divided the data set of this class into two equal subsets. One subset was used as training set and the other one with data sets of other
254
T. Le et al.
Table 1. Number of data points in 28 data sets. #normal: number of normal data points, #abnormal: number of abnormal data points and d: dimension. Data set #normal #abnormal d Arrhythmia 237 183 278 Astroparticle 2000 1089 4 Australian 383 307 14 Breast Cancer 444 239 10 Bioinformatics 221 117 20 Biomed 67 127 5 Colon cancer 40 22 2000 DelfPump 1124 376 64 Diabetes 500 268 8 Dna 464 485 180 Duke 44 23 7129 Fourclass 307 555 2 Glass 70 76 9 Heart 303 164 13 Hepatitis 123 32 19 Ionosphere 255 126 34 Letter 594 567 16 Liver 200 145 6 Protein 4065 13701 357 Sonar 97 111 67 Spectf 254 95 44 Splice 517 483 60 SvmGuide1 2000 1089 4 SvmGuide3 296 947 22 Thyroid 3679 93 21 USPS 1194 6097 256 Vehicle 212 217 18 Wine 59 71 13
classes were used for testing. We repeated dividing a data set ten times and calculated the average classification rates. We also compared our multi-sphere SVDD method with SVDD and OCSVM. The classification rate acc is measured as [6] √ acc = acc+ acc− (24) where acc+ and acc− are the classification accuracy on normal and abnormal data, respectively. 2 The popular RBF kernel function K(x, x ) = e−γ||x−x || was used in our exk periments. The parameter γ was searched in {2 : k = 2l + 1, l = −8, −7, . . . , 2}. For SVDD and multi-sphere SVDD, the trade-off parameter C1 was searched
Multiple Distribution Data Description Learning Algorithm
255
Table 2. Classification results (in %) on 28 data sets for OCSVM, SVDD and Multisphere SVDD (MS-SVDD). Data set OCSVM SVDD MS-SVDD Arrhythmia 63.16 70.13 70.13 Astroparticle 89.66 90.41 93.23 Australian 77.19 80.00 81.80 B. Cancer 95.25 98.64 98.64 Bioinformatics 68.34 68.10 72.00 Biomed 74.98 63.83 74.76 Colon cancer 69.08 67.42 67.42 DelfPump 63.20 70.65 75.27 Diabetes 68.83 72.30 78.72 Dna 76.08 73.70 83.01 Duke cancer 62.55 65.94 65.94 FourClass 93.26 98.48 98.76 Glass 80.60 79.21 79.21 Heart 73.40 77.60 79.45 Hepatitis 76.82 80.17 81.90 Ionosphere 90.90 88.73 92.12 Letter 91.42 95.86 98.03 Liver 73.80 62.45 74.12 Protein 63.65 70.68 71.11 Sonar 65.97 72.91 72.91 Spectf 77.10 70.71 77.36 Splice 64.43 70.51 70.51 SVMGuide1 89.56 87.92 93.05 SvmGuide3 63.14 70.63 70.63 Thyroid 87.88 87.63 91.44 USPS 92.85 92.83 96.23 Vehicle 64.50 70.38 75.04 Wine 88.30 98.31 98.31
over the grid {2k : k = 2l + 1, l = −8, −7, . . . , 2} and C2 was searched such that the ratio C2 /C1 belonged to 1 p 1 p p p p × , × , ,2 × ,4× (25) 4 n−p 2 n−p n−p n−p n−p For OCSVM, the parameter ν was searched in {0.1k : k = 1, . . . , 9}. For multisphere SVDD, the number of hyperspheres was changed from 1 to 10 and 50 iterations were applied to each training. Table 2 presents classification results for OCSVM, SVDD, and multi-sphere SVDD (MS-SVDD). Those results over 28 data sets show that MS-SVDD always performs better than SVDD. The reason is that SVDD is regarded as a special case of MS-SVDD when the number of hyperspheres is 1. MS-SVDD provides the highest accuracies for data sets except for Colon cancer and Biomed data sets. For some cases, MS-SVDD obtains the same result as SVDD. This could be
256
T. Le et al.
explained as only one distribution for those data sets. Our new model seems to attain the major improvement for the larger data sets. It is quite obvious since the large data sets could have different distributions and can be described by different hyperspheres.
5
Conclusion
We have proposed a new multiple hypersphere approach to solving one-class classification problem using support vector data description. A data set is described by a set of hyperspheres. This is an incremental learning process and we can prove theoretically that the error rate obtained in current iteration is less than that in previous iteration. We have made comparison of our proposed method with support vector data description and one-class support vector machine. Experimental results have shown that our proposed method provided better performance than those two methods over 28 well-known data sets.
References 1. Bishop, C.M.: Novelty detection and neural network validation. In: IEEE Proceedings of Vision, Image and Signal Processing, pp. 217–222 (1994) 2. Barnett, V., Lewis, T.: Outliers in statistical data, 3rd edn. Wiley, Chichester (1978) 3. Campbell, C., Bennet, K.P.: A linear programming approach to novelty detection. Advances in Neural Information Processing Systems 14 (2001) 4. Chang, C.-C., Lin, C.-J.: LIBSVM: A Library for Support Vector Machines, http://www.csie.ntu.edu.tw/~ cjlinlibsvm 5. Hao, P.Y., Liu, Y.H.: A New Multi-class Support Vector Machine with Multisphere in the Feature Space. In: Okuno, H.G., Ali, M. (eds.) IEA/AIE 2007. LNCS (LNAI), vol. 4570, pp. 756–765. Springer, Heidelberg (2007) 6. Kubat, M., Matwin, S.: Addressing the curse of imbalanced training set: One-sided selection. In: Proc. 14th International Conference on Machine Learning, pp. 179– 186 (1997) 7. Le, T., Tran, D., Ma, W., Sharma, D.: An Optimal Sphere and Two Large Margins Approach for Novelty Detection. In: Proc. IEEE World Congress on Computational Intelligence, WCCI (accepted 2010) 8. Lin, Y., Lee, Y., Wahba, G.: Support vector machine for classification in nonstandard situations. Machine Learning 15, 1115–1148 (2002) 9. Moya, M.M., Koch, M.W., Hostetler, L.D.: One-class classifier networks for target recognition applications. In: Proceedings of World Congress on Neural Networks, pp. 797–801 (1991) 10. Mu, T., Nandi, A.K.: Multiclass Classification Based on Extended Support Vector Data Description. IEEE Transactions on Systems, Man and Cybernetics Part B: Cybernetics 39(5), 1206–1217 (2009) 11. Parra, L., Deco, G., Miesbach, S.: Statistical independence and novelty detection with information preserving nonlinear maps. Neural Computation 8, 260–269 (1996) 12. Roberts, S., Tarassenko, L.: A Probabilistic Resource Allocation Network for Novelty Detection. Neural Computation 6, 270–284 (1994)
Multiple Distribution Data Description Learning Algorithm
257
13. Schlkopf, Smola, A.J.: Learning with kernels. The MIT Press, Cambridge (2002) 14. Tax, D.M.J., Duin, R.P.W.: Support vector data description. Machine Learning 54, 45–56 (2004) 15. Tax, D.M.J.: Datasets (2009), http://ict.ewi.tudelft.nl/~ davidt/occ/index.html 16. Towel, G.G.: Local expert autoassociator for anomaly detection. In: Proc. 17th International Conference on Machine Learning, pp. 1023–1030. Morgan Kaufmann Publishers Inc., San Francisco (2000) 17. Vapnik, V.: The nature of statistical learning theory. Springer, Heidelberg (1995) 18. Vert, J., Vert, J.P.: Consistency and convergence rates of one class svm and related algorithm. Journal of Machine Learning Research 7, 817–854 (2006) 19. Xiao, Y., Liu, B., Cao, L., Wu, X., Zhang, C., Hao, Z., Yang, F., Cao, J.: Multi-sphere Support Vector Data Description for Outliers Detection on MultiDistribution Data. In: Proc. IEEE International Conference on Data Mining Workshops, pp. 82–88 (2009) 20. Yu, M., Ye, J.: A Small Sphere and Large Margin Approach for Novelty Detection Using Training Data with Outliers. IEEE Transaction on Pattern Analysis and Machine Intelligence 31, 2088–2092 (2009)
RADAR: Rare Category Detection via Computation of Boundary Degree Hao Huang, Qinming He, Jiangfeng He, and Lianhang Ma College of Computer Science and Technology, Zhejiang University, Hangzhou, 310027, China {howardhuang,hqm,jerson_hjf2003,badmartin}@zju.edu.cn
Abstract. Rare category detection is an open challenge for active learning. It can help select anomalies and then query their class labels with human experts. Compared with traditional anomaly detection, this task does not focus on finding individual and isolated instances. Instead, it selects interesting and useful anomalies from small compact clusters. Furthermore, the goal of rare category detection is to request as few queries as possible to find at least one representative data point from each rare class. Previous research works can be divided into three major groups, model-based, density-based and clustering-based methods. Performance of these approaches is affected by the local densities of the rare classes. In this paper, we develop a density insensitive method for rare category detection called RADAR. It makes use of reverse k-nearest neighbors to measure the boundary degree of each data point, and then selects examples with high boundary degree for the class-label querying. Experimental results on both synthetic and real-world data sets demonstrate the effectiveness of our algorithm. Keywords: active learning, anomaly detection, rare category detection.
1
Introduction
Rare category detection is an interesting task which is derived from anomaly detection. This task was firstly proposed by Pelleg et al [2] to help the user to select useful and interesting anomalies. Compared with traditional anomaly detection, it aims to find representative data points of the compact rare classes that differ from the individual and isolated instances in the low-density regions. Furthermore, a human expert is required for labeling the selected data point under a known class or a previously undiscovered class. A good rare category detection algorithm should discover at least one example from each class with the least label requests. Rare category detection has many applications in real world. In Sloan Digital Sky Survey [2], it helps astronomer to find the useful anomalies from mass sky survey images, which may lead to some new astronomic discoveries. In financial fraud detection [3], although most of the financial transactions are legitimate, there are few fraudulent ones. Compared with checking them one by one, using rare category detection is much more efficient to detect instances of the fraud patterns. In J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 258–269, 2011. c Springer-Verlag Berlin Heidelberg 2011
RADAR: Rare Category Detection via Computation of Boundary Degree
259
intrusion detection [4], the authors adopted an active learning framework to select ”interesting traffic” from huge volume traffic data sets. Then engineers could find out the meaningful malicious network activities. In visual analytics [5], by locating the attractive changes in mass remote sensing imagery, geographers could determine which changes on a particular geographic area are significant. Up until now, several approaches have been proposed for rare category detection. The main techniques can be categorized into model-based [2], density-based [7] [8] [9], and clustering-based [10] methods. The model-based methods assume a mixture model to fit the data, and select the strangest records in the mixture components for class labeling. However, this assumption has limited their applicable scope. For example, they require that the majority classes and the rare classes be separable or work best in the separable case [7]. The densitiesbased methods employ essentially a local-density-differential-sampling strategy, which selects the points from the regions where the local densities fall the most. This kind of approaches can discover examples of the rare classes rapidly, despite non-separability with the majority classes. But when the local densities of some rare classes are not dramatically higher than that of the majority classes, their performance is not as good as that in the high density-differential case. The clustering-based methods first perform a hierarchical mean shift clustering. Then, select the clusters which are compact and isolated and query the cluster modes. Intuitively, if each rare class has a high local density and is isolated, its points will easily converge at the mode of density by using mean shift. But in real-world data sets, it is actually not the case. First, the rare classes are often hidden in the majority classes. Second, if the local densities of the rare classes are not high enough, their points may converge to the other clusters. In a word, although the density-based and clustering-based methods work reasonably well compared with model-based methods, their performance is still affected by the local densities of the rare classes. In order to avoid the effect of the local densities of the rare classes, we propose a density insensitive approach called RADAR. To the best of our knowledge, RADAR is the first sophisticated density insensitive method for rare category detection. In our approach, we use the change in the number of RkNN to estimate the boundary degree for each data point. The point with a higher boundary degree has a higher probability to be the boundary point of the rare class. Then we sort the data points by their boundary degrees and query their class labels with human experts. The key contribution of our work is twofold: (1) We proposed a density insensitive method for rare category detection. (2) Our approach has a higher efficiency on finding new classes, effectively reduces the number of queries to human experts. The rest of the paper is organized as follows. Section 2 formalizes the problem and defines its scope. Section 3 explains the working principle and working steps of our approach. In Section 4, we compare RADAR with existing approaches on both synthetic data sets and real data sets. Section 5 is the conclusion of this paper.
260
2
H. Huang et al.
Problem Formalization
Following the definition of He et al. [7], we are given a set of unlabeled examples S = {x1 ,x2 ,...,xn }, xi ∈ Rd , which come from m distinct categories, i.e. yi = {1,2,...,m}. Our goal is to find at least one example from each category by as few label requests as possible. For convenience, assume that there is only one majority class, which corresponds to yi = 1, and all the other categories are minority classes with prior pc , c = 2,...,m. Let p1 denote the prior of the majority category. Notice that pi , i = 1, is much smaller than p1 . Our rare category detection strategy is selecting the points which get the highest boundary degree for labeling. To understand our approach clearly, we introduce the following definitions to be used for the rest of the paper. Definition 1. (Reverse k-nearest neighbor)The reverse k-nearest neighbor (RkNN) of a point denoted as [6]: Given a data set DB, a point p, a positive integer k and a distance metric M, reverse k-nearest neighbors of p, i.e. RkN Np (k ), is a set of points pi that pi ∈ DB and ∀pi , p ∈ kN Npi (k ), where kN Npi (k ) are the k-nearest neighbors of points pi . Definition 2. (Significant point) A point is significant point if its point number of RkNN is above a certain threshold τ : τ (k, w) = mean(k) − w ∗ std(k). mean(k) =
q∈S
(1)
|RkN Nq (k)|
std(k) = (|RkN Nq (k)| −
n p∈S
q∈S
.
(2)
|RkN Np (k)| n
)2 /n.
(3)
Definition 3. (Coarctation index) The coarctation index of a point p, i.e. CI, denoted as: CI (p, k) = |RkN Nq (k)|. (4) q∈p
kN Np (k)
Definition 4. (Uneven index) The uneven index of a point p, i.e. U I, denoted as: CI(p, k) 2 U I (p, k) = (|RkN Nq (k)| − ) (5) k+1 q∈p
kN Np (k)
Definition 5. (Boundary degree) The boundary degree of a point p, i.e. BD, denoted as: U I(p, k) BD (p, k) = (6) CI(p, k)
RADAR: Rare Category Detection via Computation of Boundary Degree
3 3.1
261
RADAR Algorithm Working Principle
In this subsection, we explain why we have adopted a RkNN-based measurement for boundary degree, and illustrate the reason for adopting the conception of significant point. RkNN-based boundary point detection. RkNN has some unique properties [6]: (1)The cardinality of a point’s reverse k-nearest neighbors varies by data distribution. (2)The RkNN of the boundary points is fewer than that of the inner points. The first property means that RkNN only considers the relative position between data points, and thus has nothing to do with the Euclidean distance or the local density. There is a simple example: two clusters have the same data distribution but differ in absolute distances between data points. Obviously, although the two clusters have different densities, the kNN and the RkNN of corresponding data points in two clusters remain the same. According to Definition 3, 4 and 5, CI, U I and BD of corresponding data points in two clusters are equal too. In other words, CI, U I and BD are not designed to evaluate the local densities of data points. Instead, they are used to find the local regions where data distribution changes. The second property is the reason why the boundary degree can be used to detect the change in the data distribution. Generally, the kNN of boundary points include partial inner points of the cluster, several outer points and some other boundary points nearby. These three types of data points are very different in the number of RkNN. By contrast, usually an inner point’s kNN are still inner points. According to Definition 5, U I indicates the variation of the number of RkNN between a data point and its kNN. Thus, the boundary points have higher U I than the inner points. On the other hand, CI of a data point represents the sum of the number of RkNN. When the number of kNN is fixed, a point with more inner nearest neighbors has a higher CI. In other words, CI of inner points is higher than that of the boundary points. Therefore, the boundary points have higher U I, lower CI, and thus higher BD. By querying the class labels for the data points with high BD, we can discover examples of rare classes hidden in the majority class. Just like using radar to scan the data set, we do not consider the situation in which the local data distribution is even. However, if there is some changes in local data distribution, we will get a feedback signal and locate the target. Significant point. Before discussing the significant points which have more than τ RkNN, we begin with an example illustrated in Fig. 1 which comes from literature [6]. When k = 2, the Table 1 shows the kNN and the RKNN of each point in Fig. 1. The cardinality of each point’s RkNN is as follows: p2 , p3 , p5 and p7 has 3 RkNN; p6 has 2 RkNN; p1 and p4 has 1 RkNN; p8 has none. Notice that p8 ’s nearest neighbors are in a relatively compact cluster consisted of p5 , p6 , p7 . However, points in this cluster are each other’s kNN. Since the capacity of each
262
H. Huang et al.
Fig. 1. Example of RkNN Table 1. The kNN and RkNN of each point Point p1 p2 p3 p4 p5 p6 p7 p8 kNN p2 , p3 p1 , p3 p2 , p4 p2 , p3 p6 , p7 p5 , p7 p5 , p6 p5 , p7 RkNN p2 p1 , p3 , p4 p1 , p2 , p4 p3 p6 , p7 , p8 p5 , p7 p5 , p6 , p8
point’s kNN list is limited, p8 is not in the kNN lists of its nearest neighbors and thus has no RkNN. According to Fig. 1, it is hard to say that p8 is a candidate of minority-class points. In other words, if the RkNN of a point is extremely few, it means this point is relatively far from the other points. It is not significant to query its class label because of the low probability of this point belonging to a compact cluster. Therefore, in our approach, we will query a point only if it’s a significant point in advance. 3.2
Algorithm
The RADAR algorithm is presented in Algorithm 1. It works as follows. Firstly, we estimate the point number ki of each minority category i. Then, we find the ki -nearest neighbors for each point, and calculate the number of RkNN. Furthermore, we calculate the ri for each minority class. ri is the global minimum distance between each point and its kith nearest neighbor. It will be used for updating the querying-duty-exemption list EL in Step 13. In the outer loop of Step 9, firstly, we choose the smallest undiscovered class i and set k to be the corresponding ki . Then, in Step 11, we calculate each point’s boundary degree (BD), and determine which points are significant points. By setting BD of the non-significant points to be negative infinity, we can prevent them from the
RADAR: Rare Category Detection via Computation of Boundary Degree
263
Algorithm 1. Rare Category Detection via Computation of Boundary Degree (RADAR) Input: Unlabeled data set S, p2 ,..., pm , w Output: The set I of selected examples and the set L of their labels. 1: for i = 2 to m do 2: Let ki = |S| ∗ pi . 3: ∀xj ∈ S, k distxj (ki ) is the distance between xj and its kith nearest neighbor. Set kN Nxj (ki )={x|x ∈ S, x − xj ≤ k distxj (ki ), x = xj }. 4: ∀xj ∈ S, | kN Nx (k i ) |xj is the number of occurrences of xj in kN Nx (ki ). Set | RkN Nxj (ki ) |= | kN Nx (ki ) |xj . x∈S
5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17:
Let ri =min(k distx (ki )). x∈S end for Set r1 = minm i=2 ri Build an empty querying-duty-exemption point list EL. while not all the categories have been discovered do Let k =min{ki |2 ≤ i ≤ m, and category i has not been discovered}. ∀ x ∈ S, calculate BD(x, k ), if | RkN Nx (k ) | < τ (k , w), set BD(x, k )=–∞. for t = 1 to n do for each xj that yj , has been selected and labeled let EL=EL {x | x ∈ S, x − xj ≤ ry j }. Query x=arg maxx∈S\EL(BD(x, k )). if x belongs to a new category, break. end for end while
class-label querying and thus save the querying budget. In addition, setting the parameter w to be 1 is a suitable experimental choice. In the inner loop of Step 12, we query the point which has the maximum boundary degree with human experts. When we find an example from a previously undiscovered class, we quit the inner loop. In order to reduce the number of queries caused by repeatedly selecting examples from the same discovered category, we employ a discreet querying-duty-exemption strategy: (1) in Step 8, we build a empty point list EL to record the points which do not need to be queried; (2) in Step 13, if a point xj from class yj is labeled, the points falling inside a hyper-ball B of radius ry j centered at xj will be added into EL. A good exemption strategy can help us to reduce the querying cost. But if the exemption strategy is greedy, more points near to the labeled points will be added into EL. Then, the risk of preventing some minority classes from querying will be higher, especially when the minority classes are near to each other. In order to avoid such case, we should ensure that the number of querying-duty-exemption points will not be too large. In our discreet exemption strategy, when we label a point under a minority class i, the number of points in the hyper-ball B will not be more than ki , i.e. | B |≤ ki . The reason is that the radius ri is the global
264
H. Huang et al.
minimum distance between each point and its kith nearest neighbor. When we label a point under the majority class, we do the querying-duty exemption more carefully because this point is usually close to a rare category’s boundary. We do not set the corresponding radius of B to be min(k distx (k1 )). Instead, for the x∈S
sake of discreetness, we set the r1 = minm i=2 ri so that the nearby rare-category points can keep their querying duties completely or partially.
4
Performance Evaluation
In this section, we compare RADAR with NNDM (density-based method proposed in [7]), HMS (clustering-based method proposed in [10]) and random sampling (RS) on both synthetic and real date sets. For RS, we run the experiments 50 times and take the average numbers of queries as the results. 4.1
Synthetic Data Sets
In this subsection, we compare RADAR with NNDM, HMS and RS on 3 synthetic data sets. In Fig. 2(a) and Fig. 2(b), the first two synthetic data sets contain the same majority class, which has 1000 examples (green points) with Gaussian distribution. Each minority class (red points) in Fig. 2(a) has 20 examples, which fall inside a hyper-ball of radius 5. In Fig. 2(b), we double the distances between the 20 examples of each minority class (red points). Then, each minority class in the second data set falls inside a hyper-ball of radius 10. Obviously, the densities of minority classes in Fig. 2(a) are about 4 times higher than that in Fig. 2(b). The corresponding comparison results are illustrate in Fig. 3(a) and Fig. 3(b) respectively. The curves in Fig. 3 show the number of discovered classes as the function of the number of selected examples which is equal to the number of queries to the user. To discover all the classes in the first two data sets, RS needs 200
200
150
150
100
100
50
50
0
0
50
100
150
200
(a) High-density minority classes
0
0
50
100
150
200
(b) Low-density minority classes
Fig. 2. Synthetic data sets
RADAR: Rare Category Detection via Computation of Boundary Degree
5
4 3 RS
2
HMS NNDM
1
RADAR 0
0
20 40 60 80 Number of Selected Examples
Number of Classes Discovered
Number of Classes Discovered
5
265
4 3
HMS
(a) Results of the high-density case
NNDM
1
RADAR 0
100
RS
2
0
20 40 60 80 Number of Selected Examples
100
(b) Results of the low-density case
Fig. 3. Comparison results on synthetic data sets
101 and 100 queries respectively; HMS needs 62 and 89 queries respectively; NNDM needs 10 and 31 queries respectively; RADAR needs 8 and 10 queries respectively. From these results we can see that the performance of NNDM and HMS are dramatically affected by the local densities of the rare classes. By contrast, RADAR and RS are more insensitive to these local densities. Furthermore, our approach is much more sophisticated than the straightforward method RS, and has a high efficiency on finding new classes. The third synthetic data set in Fig. 4(a) is a multi-density data set. The majority class has 1000 examples (green points) with Gaussian distribution. Each minority class (red points) has 20 examples and a different density from each other. The comparison results are shown in Fig. 4(b). From this figure, we can learn that the performance of RADAR is better than NNDM, HMS and RS in this multi-density data set. To find all the classes, RS needs 103 queries; HMS needs 343 queries; NNDM needs 55 queries; RADAR needs 17 queries. 5 Number of Classes Discovered
200
150
100
50
0
4 3
HMS
50
100
150
(a) Data set
200
NNDM
1
RADAR 0
0
RS
2
0
20 40 60 80 Number of Selected Examples
(b) Results
Fig. 4. Experiment on synthetic multi-density data set
100
266
H. Huang et al.
4.2
Real Date Sets
In this section, we compare RADAR with NNDM, HMS, and RS on 4 real data sets from the UCI data repository [1]: the Abalone, Statlog, Wine Quality and Yeast data sets. The properties of these data sets are summarized in Table 2. In addition, the Statlog is sub-sampled because the original Image Segmentation (Statlog) data set contains almost the same number of examples for each class. The sub-sampling can create an imbalanced data set which suits the rare category detection scenario. With the sub-sampling, the largest class in Statlog contains 256 examples; the examples of the next class are half as many as that Table 2. Properties of the real data sets Data Set Records Dimension Classes Largest Class Smallest Class Abalone 4177 7 20 16.5% 0.34% Statlog 512 19 7 50% 1.5% Wine Quality 4898 11 6 44.88% 0.41% Yeast 1484 8 10 31.68% 0.33% 70 60 50
150
Local Density
Local Density
200
100
40 30 20
50 10 0
0
5
10 Minority Class
15
0
20
1
2
(a) Abalone
3 4 Minority Class
5
6
(b) Statlog
160
20
140 15 Local Density
Local Density
120 100 80 60 40
10
5
20 0
1
2
3 4 Minority Class
(c) Wine Quality
5
0
1
2
3
4 5 6 7 Minority Class
(d) Yeast
Fig. 5. Local densities of the minority classes in real data sets
8
9
RADAR: Rare Category Detection via Computation of Boundary Degree
267
Table 3. Number of queries needed to find out all classes for each algorithm Data Set RS Algorithm HMS Algorithm NNDM Algorithm RADAR Algorithm Abalone 462 539 146 131 Statlog 94 33 63 28 Wine Quality 197 51 20 Yeast 261 91 124 112
7 Number of Classes Discovered
Number of Classes Discovered
20
15
10 RS HMS 5
NNDM RADAR
0
0
50 100 150 Number of Selected Examples
6 5 4 3
HMS NNDM
1 0
200
RS
2
RADAR 0
(a) Abalone 10 Number of Classes Discovered
Number of Classes Discovered
100
(b) Statlog
6 5 4 3
RS HMS
2
NNDM 1 0
20 40 60 80 Number of Selected Examples
RADAR 0
50 100 150 Number of Selected Examples
(c) Wine Quality
200
8 6 RS 4
HMS NNDM
2 0
RADAR
0
50 100 Number of Selected Examples
150
(d) Yeast
Fig. 6. Comparison results on real data sets
of the former one; the smallest classes all have 8 examples. The results are summarized in Table 3. The mark ’-’ indicates that the algorithm cannot find out all classes in the data set. These real data sets are multi-density data sets. To estimate the local density of each minority class, we adopt a measurement for the local density of a data point. We first calculate the average distance between a data point and its k-nearest neighbors. Next, multiply the reciprocal of this average distance by the global maximum distance between the points. The product is roughly in proportion to the local density of the data point. Finally, we calculate average value of the products for each minority class and take this value as the
268
H. Huang et al.
local-density metric. For the sake of generalization and convenience, we set k = min{ki | 2 ≤ i ≤ m}. Fig. 5 shows the local-density values of the minority classes in real each data set. The standard deviation of these local-density values is 55.32 in Abalone, 20.47 in Statlog, 28.03 in Wine Quality and 3.99 in Yeast. Therefore, the Abalone data set is more ”extreme” on the change of local densities than the other data sets. By contrast, the Yeast data set is the most ”moderate” one. Fig. 6 illustrates the comparison results on the 4 real data sets in details. From this figure, we can learn that RADAR effectively reduces the number of queries to human experts in each data set. It takes the least number of queries to discover all classes in Abalone, Statlog and Wine Quality. In Yeast, which is the most ”moderate” date set, the HMS method owns the highest efficiency of finding new classes. But in Abalone, which is the most ”extreme” data set, HMS is not as good as RADAR, NNDM or even the RS method. Furthermore, in Wine Quality, the NNDM method falls short for its performance, and finds out only 5 classes. In summary, RADAR has a high efficiency on finding new classes, and is more suitable for processing multi-density data because of its stability.
5
Conclusion
We have proposed a novel approach (RADAR) for rare category detection. Compared with existing algorithms, RADAR is a density insensitive method, which is based on reverse k-nearest neighbors (RkNN). In this paper, the boundary degree of each point is measured by variation of RkNN. Data points with high boundary degrees are selected for the class-label querying. Experimental results on both synthetic and real-world data sets demonstrate that the number of queries has dramatically decreased by using our approach. Moreover, RADAR has a more attractive property. It is more suitable to handle the multi-density data sets. Future works involve adopting a technique of parameter automatization to set the parameter w and adapting our approach to the prior-free case.
References 1. Asuncion, A., Newman, D.J.: UCI machine learning repository (2007) 2. Pelleg, D., Moore, A.W.: Active learning for anomaly and rare-category detection. In: Proc. NIPS 2004, pp. 1073–1080. MIT Press, Boston (2004) 3. Bay, S., Kumaraswamy, K., Anderle, M., Kumar, R., Steier, D.: Large scale detection of irregularities in accounting data. In: ICDM 2006, pp. 75–86 (2006) 4. Stokes, J.W., Platt, J.C., Kravis, J., Shilman, M.: ALADIN: active learning of anomalies to detect intrusions. Technical report, Microsoft Research (2008) 5. Porter, R., Hush, D., Harvey, N., Theiler, J.: Toward interactive search in remote sensing imagery. In: Proc. SPIE, Vol. 7709, pp. 77090V–77090V–10 (2010) 6. Xia, C., Hsu, W., Lee, M.L., Ooi, B.C.: BORDER: efficient computation of boundary points. IEEE Trans. on Knowledge and Data Engineering 18(3), 289–303 (2006)
RADAR: Rare Category Detection via Computation of Boundary Degree
269
7. He, J., Carbonell, J.: Nearest-neighbor-based active learning for rare category detection. In: Proc. NIPS 2007, pp. 633–640. MIT Press, Boston (2007) 8. He, J., Liu, Y., Lawrence, R.: Graph-based rare category detection. In: Proc. ICDM 2008, pp. 833–838 (2008) 9. He, J., Carbonell, J.: Prior-free rare category detection. In: Proc. SDM 2009, pp. 155–163 (2009) 10. Vatturi, P., Wong, W.: Category detection using hierarchical mean shift. In: Proc. KDD 2009, pp. 847–856 (2009)
RKOF: Robust Kernel-Based Local Outlier Detection Jun Gao1 , Weiming Hu1 , Zhongfei (Mark) Zhang2 , Xiaoqin Zhang3 , and Ou Wu1 1
National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China {jgao,wmhu,wuou}@nlpr.ia.ac.cn 2 Dept. of Computer Science, State Univ. of New York at Binghamton, Binghamton, NY 13902, USA
[email protected] 3 College of Mathematics & Information Science, Wenzhou University, Zhejiang, China
[email protected]
Abstract. Outlier detection is an important and attractive problem in knowledge discovery in large data sets. The majority of the recent work in outlier detection follow the framework of Local Outlier Factor (LOF), which is based on the density estimate theory. However, LOF has two disadvantages that restrict its performance in outlier detection. First, the local density estimate of LOF is not accurate enough to detect outliers in the complex and large databases. Second, the performance of LOF depends on the parameter k that determines the scale of the local neighborhood. Our approach adopts the variable kernel density estimate to address the first disadvantage and the weighted neighborhood density estimate to improve the robustness to the variations of the parameter k, while keeping the same framework with LOF. Besides, we propose a novel kernel function named the Volcano kernel, which is more suitable for outlier detection. Experiments on several synthetic and real data sets demonstrate that our approach not only substantially increases the detection performance, but also is relatively scalable in large data sets in comparison to the state-of-the-art outlier detection methods. Keywords: Outlier detection, Kernel methods, Local density estimate.
1
Introduction
Compared with the other knowledge discovery problems, outlier detection is arguably more valuable and effective in finding rare events and exceptional cases from the data in many applications such as stock market analysis, intrusion detection, and medical diagnostics. In general, there are two definitions of the
This work is supported in part by the NSFC (Grant No. 60825204, 60935002 and 60903147) and the US NSF (Grant No. IIS-0812114 and CCF-1017828).
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 270–283, 2011. c Springer-Verlag Berlin Heidelberg 2011
RKOF: Robust Kernel-Based Local Outlier Detection
271
outlier detection: Regression outlier and Hawkins outlier. Regression outlier defines that an outlier is an observation which does not match the predefined metric model of the interesting data [1]. Hawkins outlier defines that an outlier is an observation that deviates so much from other observations as to arouse suspicion that this observation is generated by a different mechanism [2]. Compared with Regression outlier detection, Hawkins outlier detection is more challenging work because of the unknown generative mechanism of the normal data. In this paper, we focus on the unsupervised methods for Hawkins outlier detection. In the rest of this paper, outlier detection refers particularly to Hawkins outlier detection. Over the past several decades, the research on outlier detection varies from the global computation to the local analysis, and the descriptions of outliers vary from the binary interpretations to probabilistic representations. Breunig et al. propose a density estimation based Local Outlier Factor (LOF) [4]. This work is so influential that there is a rich body of the literature on the local density-based outlier detection. On the one hand, plenty of local density-based methods are proposed to compute the outlier factors, such as the local correlation integral [5], the connectivity-based outlier factor [8], the spatial local outlier measure [9], and the local peculiarity factor [7]. On the other hand, many efforts are committed to combining machine learning methods with LOF to accommodate the large and high dimensional data [10,14]. Although LOF is popular in use in the literature, there are two major disadvantages restricting its applications. First, since LOF is based on the local density estimate theory, it is obvious that the more accurate the density estimate, the better the detection performance. The local reach-ability density used in LOF is the reciprocal of the average of reach-distances between the given object and its neighbors. This density estimate is an extension of the nearest neighbor density estimate, which is defined as f (p) =
(a) Old Faithful data
k 1 · 2n dk (p)
(1)
(b) Density estimate
Fig. 1. (a) Eruption lengths of 107 eruptions of Old Faithful geyser. (b) The density of Old Faithful data based on the nearest neighbor density estimate, redrawn from [3].
272
J. Gao et al.
where n is the total number of the objects, and dk (p) is the distance between object p and its kth nearest neighbor. As shown in Fig. 1, the heavy tails of the density function and the discontinuities in the derivative reduce the accuracy of the density estimate. This dilemma indicates that with the LOF method, an outlier is unable to deviate substantially from the normal objects in the complex and large databases. Second, like all other local density-based outlier detection methods, the performance of LOF depends on the parameter k which is defined as the least number of the nearest neighbors in the neighborhood of an object [4]. However, in LOF, the value of k is determined based on the average density estimate of the neighborhood, which is statistically vulnerable to the presence of an outlier. Hence, it is hard to determine an appropriate value of this parameter to ensure the acceptable performance in the complex and large databases. In order to address these two disadvantages of LOF, we propose a Robust Kernel-based Outlier Factor (RKOF) in this paper. Specifically, the main contributions of our work are as follows: – We propose a kernel-based outlier detection method which brings the variable kernel density estimate method into the computation of outlier factors, in order to achieve a more accurate density estimate. Besides, we propose a new kernel function named the Volcano kernel which requires a smaller value of the parameter k for outlier detection than other kernels, resulting in less detection time. – We propose the weighted density estimate of the neighborhood of a given object to improve the robustness of determining the value of the parameter k. Furthermore, we demonstrate that this weighted density estimate method is superior to the average density estimate method used in LOF in robust outlier detection. – We keep the same framework of local density-based outlier detection with LOF. This makes that RKOF can be directly used in the extensions of LOF, such as Feature Bagging [10], Top-n outlier detection [14], Local Kernel Regression [15], and improve the detection performance of these extensions. The remainder of this paper is organized as follows. Section 2 introduces our RKOF method with a novel kernel function, named the Volcano kernel, and analyzes the special property of the Volcano kernel. Section 3 shows the robustness and computational complexity of RKOF. Section 4 reports the experimental results. Finally, Section 5 concludes the paper.
2
Main Framework
A density-based outlier is detected by comparing its density estimate with its neighborhood density estimate [4]. Hence, we first introduce the notions of the local kernel density estimate of object p, the weighted density estimate of p’s neighborhood. Then, we introduce the notion of the robust kernel-based outlier factor of p, which is used to detect outliers. Besides, we analyze the influences of different kernels to the performance of our method, and propose a novel kernel function named the Volcano kernel with its special property in outlier detection.
RKOF: Robust Kernel-Based Local Outlier Detection
273
To make this work self-contains, we introduce the notions of the k-distance of an object p, and the k-distance neighborhood of p, which are defined in LOF. Definition 1. Given a data set D, an object p, and any positive integer k, the k-distance(p) is defined as the distance d(p, o) between p and an object o ∈ D, such that: – for at least k objects o ∈ D\{p}, it holds that d(p, o ) ≤ d(p, o). – for at most k − 1 objects o ∈ D\{p}, it holds that d(p, o ) < d(p, o). Definition 2. Given a data set D, an object p, and any positive integer k, the k-distance neighborhood of p, named Nk (p), contains every object whose distance from p is not greater than the k-distance(p), i.e., Nk (p) = {q ∈ D\{p}|d(p, q) ≤ k-distance(p)}, where any such object q is called a k-distance neighbor of p. |Nk (p)| is the number of the k-distance neighbors of p. 2.1
Robust Kernel-Based Outlier Factor (RKOF)
Let p = [x1 , x2 , x3 , . . . , xd ] be an object in the data set D, where d is the number of the attributes. |D| is the number of all the objects in D. Definition 3. (Local kernel density estimate of object p) The local kernel density estimate of p is defined as −γ −γ h λo K(h−1 λ−1 o (p − o)) kde(p) =
o∈Nk (p)
λo = {f (o)/g}−α
|Nk (p)| −1
log g = |D|
log (f (q))
q∈D
where h is the smoothing parameter, γ is the sensitivity parameter, K(x) is the multivariate kernel function and λo is the local bandwidth factor. f (x) is a pilot density estimate that satisfies f (x) > 0 for all the objects, α is the sensitivity parameter that satisfies 0 ≤ α ≤ 1, and g is the geometric mean of f (x). kde(p) is an extension of the variable kernel density estimate [3]. kde(p) not only retains the adaptive kernel window width that is allowed to vary from one object to another, but also is computed locally in the k-distance neighborhood of object p. The parameter γ equals the dimension number d in the original variable kernel density estimate [3]. For the local kernel density estimate, the larger γ, the more sensitive kde(p). However, the high sensitivity of kde(p) is not always a merit for the local outlier detection in high dimensional data. For example, if λo is always very small for all the objects in a sparse and high dimensional data set, (λo )−γ always equals infinity. This makes kde(p) lack of the capacity to discriminate between outliers and normal data. We give γ a default value 2 to obtain a balance between the sensitivity and the robustness.
274
J. Gao et al.
In this paper, we compute the pilot density function f (x) by the approximate nearest neighbor density estimate according to Equation 1. f (o) =
1 k-distance(o)
(2)
Substituting Equation 2 into kde(p) in Definition 3, we obtain Equation 3, where the default values of C and α are 1. In the following experiments, we estimate the local kernel density of object p as follows: p−o 1 (C·k-distance(o)α )2 K( C·k-distance(o)α ) kde(p) =
o∈Nk (p)
C = h · gα
|Nk (p)|
(3)
Definition 4. (Weighted density estimate of object p’s neighborhood) The weighted density estimate of p’s neighborhood is defined as ωo · kde(o) ( k-distance(o) − 1)2 o∈Nk (p) mink wde(p) = ωo = exp − ωo 2σ 2 o∈Nk (p)
where ωo is the weight of object o in the k-distance neighborhood of object p, σ is the variance with the default value 1, and mink = mino∈Nk (p) (k-distance(o)). In the majority of local density-based methods, outlier factor is computed by the ration of the neighborhood’s density estimate to the given object’s density estimate. The neighborhood’s density is generally measured by the average value of all the neighbors’ local densities in the neighborhood. In this estimate approach, the detection performance is sensitive to the parameter k. The larger the value of k, the larger the scale of the neighborhood. When k is large enough that the majority in the neighborhood are normal objects, outliers have the chance to be detected. In the weighted neighborhood density estimate, the weight of the neighbor object is a monotonically decreasing function of its k-distance. The neighbor object with the smallest k-distance has the largest weight 1. Compared with the average neighborhood density estimate, the weighted neighborhood density estimate makes that outliers can be detected accurately even if the number of outliers in the neighborhood equals the number of normal objects. This means that the interval of the acceptable k in the weighted neighborhood density estimate is much larger than that of the average neighborhood density estimate, and our method is more robust to the variations of the parameter k. Definition 5. (Robust kernel-based outlier factor of object p) The robust kernelbased outlier factor of p is defined as RKOF (p) =
wde(p) kde(p)
where wde(p) is the density estimate of the k-distance neighborhood of p, and kde(p) is the local density estimate of p.
RKOF: Robust Kernel-Based Local Outlier Detection
275
RKOF is computed by dividing the weighted density estimate of the neighborhood of the given object by its local kernel density estimate. The larger RKOF, the more probable to be an outlier the given object. It is obvious that the smaller the object p’s local kernel density, and the larger the weighted density of its neighborhood, the larger its outlier factor. 2.2
Choice of Kernel Functions
In LOF method, for most objects in a cluster, their outlier factors are approximately equal to 1; for most outliers isolated from the cluster, their outlier factors are much larger than 1 [4]. This property makes it easy to distinguish between outliers and normal objects. The multivariate Gaussian and Epanechnikov kernel functions are commonly used in the kernel density estimation, whose formulations are defined as follows: 1 K(x) = (2π)−d/2 exp − x2 2 K(x) =
(3/4)d (1 − x2 ), 0,
if x ≤ 1 otherwise
(4)
(5)
where x denotes the norm of a vector x and it can be used to compute the distances between objects. Our RKOF method with the Gaussian kernel cannot ensure that outlier factors of the normal objects in a cluster are approximately equal to 1. Then, we need to determine the threshold value of outlier factors in addition. The Epanechnikov kernel function equals zero when x is larger than 1. Hence, for most of outliers and normal objects lying in the border of clusters, their outlier factors equal infinity. In order to achieve the same property with LOF, we define a novel kernel function called the Volcano kernel as follows: Definition 6. The Volcano kernel is defined as K(x) =
β, βg(x),
if x ≤ 1 otherwise
where β assures that K(x) integrates to one, and g(x) is a monotonically decreasing function, lying in a close interval [0, 1] and equal to zero at the infinity. Unless otherwise specified, we use g(x) = e−|x|+1 as the default function for our experiments. Fig. 2 shows the curve of the Volcano kernel for the univariate data. When x is not larger than 1, the kernel value equals a constant value β. This generates that outlier factors of the objects deeply in the cluster are approximately equal to 1. When x is larger than 1, the kernel value is the monotonically decreasing
276
J. Gao et al.
Fig. 2. The curve of the Volcano kernel for the univariate data
function of x and less than 1. This not only makes outlier factors continuous and finite, but also makes outlier factors of outliers much larger than 1. Hence, RKOF method with the Volcano kernel can capture outliers much easier, and sort all the objects according to their RKOF values.
3
Robustness and Computation Complexity of RKOF
In this section, we first analyze the robustness of RKOF to the parameter k. Then, we analyze the computation complexity of RKOF in detail. Compared with the average neighborhood density estimate used in LOF, the weighted neighborhood density estimate defined in Definition 4 is more robust to the parameter k and it substantially helps improve the detection performance. As shown in Theorem 1, if the weighted neighborhood density estimate replaces the average neighborhood density estimate in the computation of outlier factors, any local density-based outlier detection method following the framework of LOF can be less sensitive to the parameter k. Theorem 1. Let Nk (p) be the neighborhood of object p, and p be an outlier in a data set D. Let r be the proportion of the outliers in Nk (p). Suppose that these outliers have the same local density estimate (DE) α and k-distance α with p. Also suppose that the normal data in Nk (p) have the same local density estimate β and k-distance β , with α < β and α > β . The Outlier Factor (OF) can be computed based on any local density-based outlier detection method that follows the framework of LOF. Then for the average density estimate, it holds that: OF (p) = r + (1 − r)ρ For the weighted density estimate, it holds that: OF (p) =
(ρ − w)r − ρ (1 − w)r − 1
where ρ = β/α and ω is the weight of the outlier in Nk (p). Proof: For the average density estimate: DE(o) r|Nk (p)|α + (1 − r)|Nk (p)|β o∈Nk (p) OF (p) = = = r + (1 − r)ρ |Nk (p)|DE(p) |Nk (p)|α
RKOF: Robust Kernel-Based Local Outlier Detection
277
Fig. 3. The curves of OF (p) for the average and the weighted density estimates
For the weighted density estimate: Let ωoi and ωoj be the weights of the outlier and the normal object, respectively. According to Definition 4, ωoi < 1 and ωoj = 1 because α > β . Replacing ωoi with ω, we have ωo · DE(o) r|Nk (p)|ωα + (1 − r)|Nk (p)|β o∈Nk (p) OF (p) = = DE(p) · ωo α(r|Nk (p)|ω + (1 − r)|Nk (p)|) o∈Nk (p)
rω + (1 − r)ρ (ρ − ω)r − ρ = = rω + (1 − r) (1 − ω)r − 1 According to Theorem 1, OF (p) is a function of the parameter r while ρ has a fixed value. r is determined by the parameter k. As shown in Fig. 3, for the average neighborhood density estimate, OF (p) is a monotonically decreasing linear function when r increases. For the weighted neighborhood density estimate, OF (p) is a quadratic curve of r. When r ∈ [0, 1], OF (p) of the average method is always much less than that of the weighted method. Fig. 3 shows that OF (p) of the weighted method is larger than τ % of the maximum of OF (p) when r ∈ [0, τ ]. τ depends on ρ and the weights of the outliers in the neighborhood of p. More importantly, OF (p) is approximately a constant in the interval [0, τ ]. This property indicates that the weighted method makes the local outlier detection more robust to the variations of the parameter k. Since RKOF shares the same framework with LOF, RKOF has the same computational complexity as that of LOF. To compute the RKOF values with the parameter k, the RKOF algorithm includes two steps. In the first step, the k-distance neighbors for each object need to be found with their distances to the given object computed in the data set D of n objects. The computational complexity of this step is O(n log n) by using the index technology for k-nn queries, which has been used in LOF [4]. In the second step, the kde(p), wde(p), and RKOF (p) values are computed by scanning the whole data set. Since both kde(p) and wde(p) are computed in the k-distance neighborhood of the given object, the computational complexity of this step is O(nk). Hence, the total
278
J. Gao et al.
computational complexity of the RKOF algorithm is O(n log n + nk). Clearly, the larger k is, the more the running time is consumed.
4
Experiments
In this section, we evaluate the outlier detection capability of RKOF based on different kernel functions and compare RKOF with the state-of-the-art outlier detection methods on several synthetic and real data sets.
(a) Synthetic-1
(b) Synthetic-2
Fig. 4. The distributions of the synthetic data sets
4.1
Synthetic Data
As shown in Fig. 4, the Synthetic-1 data set consists of 1500 normal objects and 16 outliers with two attributes. The normal objects distribute in three Gaussian clusters with 500 normal objects in each cluster with the same variance, respectively. Fifteen outliers lie in a dense Gaussian cluster, and the other outlier is isolated from the others. The Synthetic-2 data set consists of 500 normal objects uniformly in the annular region, 500 normal objects in a Gaussian cluster, and 20 outliers in two Gaussian clusters. Table 1 exhibits the outlier detection results of LOF and RKOF on the Synthetic-1 data set, respectively, where σ is the parameter of the weight in RKOF. Top-16 objects are the sixteen objects that have the largest outlier factors in the synthetic data set. Obviously, if all top-16 objects are outliers, the Table 1. Outlier detection for the Synthetic-1 data set k 26 27 30 31 59 60 70
Number of outliers in the top-16 objects (coverage) LOF RKOF(σ = 0.1) RKOF(σ = 1) 1(6.25%) 15(93.75%) 15(93.75%) 2(12.5%) 16(100%) 15(93.75%) 4(25%) 16(100%) 15(93.75%) 5(31.25%) 16(100%) 16(100%) 15(93.75%) 16(100%) 16(100%) 16(100%) 16(100%) 16(100%) 16(100%) 16(100%) 16(100%)
RKOF: Robust Kernel-Based Local Outlier Detection
(a) RKOF: k=14
279
(b) LOF: k=20
Fig. 5. The best performances of RKOF and LOF on the Synthetic-2 data (Top-20)
detection rate is 100% and the false alarm rate is zero. coverage is the ratio of the number of the detected outliers to the 16 total outliers. RKOF(σ = 0.1) can identify all the outliers when k ≥ 27. RKOF(σ = 1) can detect all the outliers when k ≥ 31. Clearly, the parameter σ directly relates to the sensitivity of the outlier detection for RKOF. LOF is unable to identify all the outliers until k = 60. Table 1 indicates that the available k interval of RKOF is larger than that of LOF, which means that RKOF is less sensitive to the parameter k. As shown in Fig. 5, RKOF with k = 14 captures all the outliers in Top-20 objects. LOF obtains its best performance with k = 20, whose detection rate is 85%. Compared with RKOF, LOF can not detect all the outliers whatever the value of k is. It is obviously that the annular cluster and the Gaussian cluster pose an obstacle to the choice of k. This result indicates that RKOF is more adapted to the complex data sets than LOF. 4.2
Real Data
We compare RKOF with several state-of-the-art methods, including LOF [4], LDF [6], LPF [7], Feature Bagging [10], Active Learning [11], Bagging [12], and Boosting [13], on the real data sets. The performance of RKOF with the Gaussian, Epanechnikov, and Volcano kernels is also compared. In the real data sets, the features of the original data include discrete features and continuous features. All the data are processed using the standard text processing techniques following the original steps of the methods [7,11,10]. These real data sets consist of the KDD Cup 1999, the Mammography data set, the Ann-thyroid data set, and the Shuttle data set, all of which can be downloaded from the UCI database except the Mammography data set1 . The KDD Cup 1999 is a general data set condensed for the intrusion detection research. 60593 normal records and 228 U2R attack records labeled as outliers are combined as the KDD outlier data set. All the records are described by 34 continuous features and 7 categorical features. The Mammography data set includes 10923 records labeled 1 as normal data and another 260 records labeled 2 as outliers; all the records consist of 6 continuous features. The Ann-thyroid data set consists of 73 records labeled 1 as outliers and 3178 records labeled 3 1
Thank Professor Nitesh.V.Chawla for providing this data set.
280
J. Gao et al.
Table 2. The AUC values and the running time in parentheses for RKOF and the comparing methods on the real data sets by the k-d tree method [17]. Since LPF has the higher complexity and is unable to complete the data sets in the reasonable time, the accurate running time for LPF is not given in this table.
PP PP Data Methods PPP
KDD
Mammography
Ann-thyroid Shuttle (average)
RKOFa
0.962 (1918.1s)
0.871 (15.8s)
0.970 (4.9s)
0.990 (36.4s)
RKOFb
0.961 (2095.2s)
0.870 (19.8s)
0.970 (5.2s)
0.990 (36.9s)
RKOFc
0.944 (2363.7s)
0.855 (48.2s)
0.965 (13.2s)
0.993 (36.7s)
LOF
0.610 (2160.1s)
0.640 (28.8s)
0.869 (5.9s)
0.852 (42.0s)
LDF
0.941 (2214.9s)
0.824 (36.4s)
0.943 (7.2s)
0.962 (37.1s)
LPF
0.98 (2363.7s)
0.87 (48.2s)
0.97 (13.2s) 0.992 (42.0s)
Bagging
0.61(±0.25)
0.74(±0.07)
0.98(±0.01)
0.985(±0.031)
Boosting
0.51(±0.004)
0.56(±0.02)
0.64
0.784(±0.13)
Feature Bagging
0.74(±0.1)
0.80(±0.1)
0.869
0.839
Active Learning 0.94(±0.04) 0.81(±0.03) a. Using Volcano kernel b. Using Gaussian kernel
0.97(±0.01) 0.999(±0.0006) c. Using Epanechnikov kernel
as normal data. There are 21 attributes where 15 attributes are binary and 6 attributes are continuous. The Shuttle data set consists of 11478 records with label 1, 13 records with label 2, 39 records with label 3, 809 records with label 5, 4 records with label 6, and 2 records with label 7. We divide this data set into 5 subsets: label 2, 3, 5, 6, 7 records vs label 1 records, where the label 1 records are normal, and others are outliers. All the comparing outlier detection methods are evaluated using the ROC curves and the AUC values. The ROC curve represents the trade-off between the detection rate as y-axis and the false alarm rate as x-axis. The AUC value is the surface area under the ROC curve. Clearly, the larger the AUC value, the better outlier detection method. The AUC values for RKOF with different kernels and all other comparing methods are given in Table 2. Also shown in Table 2 are the running time data for RKOF with different kernels as well as those of the other three local densitybased methods; since the AUC values for other comparing methods are directly obtained from their publications in the literature, the running time data for these methods are not available and thus are not included in this table. From Table 2, we see that different RKOF methods using different kernels receive similar AUC values on all the data sets, especially the Volcano and Gaussian kernels. The k values with the best detection performance for all the three kernels on all the data sets are shown in Fig. 6(a). Clearly, the k values for the Volcano kernel are always smaller than those of the other kernels, and the k values for the Epanechnikov kernel are the largest among three kernels. This experiment supports one of the contributions of this work that the proposed Volcano kernel achieves the least computation time among the existing kernels.
RKOF: Robust Kernel-Based Local Outlier Detection
(a) k for different kernels
281
(b) ROC curves
Fig. 6. (a)The k values with the best performance for different kernels in RKOF. (b) ROC curves for RKOF based on the Volcano kernel on the KDD and the Mammography data sets.
(a) KDD
(b) Mammography
Fig. 7. AUC values of RKOF based on the Volcano kernel with different k values for the KDD and Mammography data sets
It indicates that different kernels used in RKOF do not significantly influence the detection performance, but they dramatically change the minimal k value with the acceptable performance and consequently the running time. Fig. 6(b) shows the ROC curves of RKOF based on the Volcano kernel for the KDD data set (k = 320) and the Mammography data set (k = 110). Fig. 7 shows the AUC values of RKOF based on the Volcano kernel with different k values for the KDD and Mammography data sets. The AUC values for the KDD data set are larger than 0.941, when k varies from 280 to 700; the AUC values for the Mammography data set are larger than 0.824, when k changes from 40 to 460. Clearly, the detection performance of RKOF for any k in these interval is better than that of the other comparing methods except LPF. For the Mammography data set, RKOF is more effective than the other comparing methods with k = 110, compared with k = 11183 for LPF. For the KDD data set, RKOF achieves the second best performance with k = 320. The best AUC value is achieved by LPF, but this AUC value is obtained when k = 13000. The complexity of RKOF is O(n log n + nk), compared with O(nd log n + ndk) for LPF, where d is the dimensionality of the data. It is clear that under the same circumstances LPF takes much longer time than RKOF while the AUC value of RKOF is very close to this best value. For the Ann-thyroid data set, RKOF
282
J. Gao et al.
achieves the acceptable performance that is very close to the best performance. The AUC value of the Shuttle data set is the average AUC of all the five subsets, where the AUC values of the subsets with the label 5, label 6, and label 7 are all approximately equal to 1. RKOF also obtains the acceptable performance that is very close to the best performance for the Shuttle data set. Overall, while there is no winner for all the cases, RKOF always achieves the best performance or is close to the best performance in all the data sets with the least running time. In particular, RKOF achieves the best performance or is close to the best performance for the KDD and the Mammography data sets with much less running time, which are the two large data sets of all the four data sets. This demonstrates the high scalability of the RKOF method in outlier detection. Specifically, in all the cases RKOF always has less running time than LOF, LDF and LPF. Though the running time data for the other comparing methods are not available, from the theoretic complexity analysis it is clear that they would all take longer running time than RKOF.
5
Conclusions
We have studied the local outlier detection problem in this paper. We have proposed the RKOF method based on the variable kernel density estimate and the weighted density estimate of the neighborhood of an object, which have addressed the existing disadvantages of LOF and other density-based methods. We have proposed a novel kernel function named the Volcano kernel, which is more suitable for outlier detection. Theoretical analysis and empirical evaluations on the synthetic and real data sets demonstrate that RKOF is more robust and effective for outlier detection at the same time taking less computation time.
References 1. Rousseeuw, P.J., Leroy, A.M.: Robust Rgression and Outlier Detection. John Wiley and Sons, New York (1987) 2. Hawkins, D.: Identification of Outliers. Chapman and Hall, London (1980) 3. Silverman, B.: Density Estimation for Statistics and Data Analysis. Chapman and Hall, London (1986) 4. Breunig, M.M., Kriegel, H.-P., Ng, R.T., Sander, J.: Lof: Identifying density-based local outliers. In: SIGMOD, pp. 93–104 (2000) 5. Papadimitriou, S., Kitagawa, H., Gibbons, P.: Loci: Fast outlier detection using the local correlation integral. In: ICDE, pp. 315–326 (2003) 6. Latecki, L.J., Lazarevic, A., Pokrajac, D.: Outlier Detection with Kernel Density Functions. In: Perner, P. (ed.) MLDM 2007. LNCS (LNAI), vol. 4571, pp. 61–75. Springer, Heidelberg (2007) 7. Yang, J., Zhong, N., Yao, Y., Wang, J.: Local peculiarity factor and its application in outlier detection. In: KDD, pp. 776–784 (2008) 8. Tang, J., Chen, Z., Fu, A.W.-c., Cheung, D.W.: Enhancing effectiveness of outlier detections for low density patterns. In: Chen, M.-S., Yu, P.S., Liu, B. (eds.) PAKDD 2002. LNCS (LNAI), vol. 2336, pp. 535–548. Springer, Heidelberg (2002)
RKOF: Robust Kernel-Based Local Outlier Detection
283
9. Sun, P., Chawla, S.: On local spatial outliers. In: KDD, pp. 209–216 (2004) 10. Lazarevic, A., Kumar, V.: Feature bagging for outlier detection. In: KDD, pp. 157–166 (2005) 11. Abe, N., Zadrozny, B., Langford, J.: Outlier detection by active learning. In: KDD, pp. 504–509 (2006) 12. Breiman, L.: Bagging predictors. J. Machine Learning 24(2), 123–140 (1996) 13. Freund, Y., Schapire, R.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 113–139 (1997) 14. Jin, W., Tung, A., Ha, J.: Mining top-n local outliers in large databases. In: KDD, pp. 293–298 (2001) 15. Gao, J., Hu, W., Li, W., Zhang, Z.M., Wu, O.: Local Outlier Detection Based on Kernel Regression. In: ICPR, pp. 585–588 (2010) 16. Barnett, V., Lewis, T.: Outliers in Statistic Data. John Wiley, New York (1994) 17. Bentley, J.L.: Multidimensional binary search trees used for associative searching. J. Communications of the ACM 18(9), 509–517 (1975)
Chinese Categorization and Novelty Mining Flora S. Tsai and Yi Zhang Nanyang Technological University, School of Electrical & Electronic Engineering, Singapore
[email protected]
Abstract. The categorization and novelty mining of chronologically ordered documents is an important data mining problem. This paper focuses on the entire process of Chinese novelty mining, from preprocessing and categorization to the actual detection of novel information, which has rarely been studied. First, preprocessing techniques for detecting novel Chinese text are discussed and compared. Next, we investigate the categorization and novelty mining performance between English and Chinese sentences and also discuss the novelty mining performance based on the retrieval results. Moreover, we propose new novelty mining evaluation measures, Novelty-Precision, Novelty-Recall, Novelty-F Score, and Sensitivity, which measures the sensitivity of the novelty mining system to the incorrectly classified sentences. The results indicate that Chinese novelty mining at the sentence level is similar to English if the sentences are perfectly categorized. Using our new evaluation measures of NoveltyPrecision, Novelty-Recall, Novelty-F Score, and Sensitivity, we can more fairly assess how the performance of novelty mining is influenced by the retrieval results. Keywords: novelty mining, novelty detection, categorization, retrieval, Chinese, preprocessing, information retrieval.
1
Introduction
The overabundance of information leads to the proliferation of useless and redundant content. Novelty mining (NM) is able to detect useful and novel information from a chronologically ordered list of relevant documents or sentences. Although techniques such as dimensionality reduction [15,18], probabilistic models [16,17], and classification [12] can be used to reduce the data size, novelty mining techniques are preferred since they allow users to quickly get useful information by filtering away the redundant content. The process of novelty mining consists of three main steps, (i) preprocessing, (ii) categorization, and (iii) novelty detection. This paper focuses on all three steps of novelty mining, which has rarely been explored. In the first step, text sentences are preprocessed by removing stop words, stemming words to their root form, and tagging the Parts-of-Speech (POS). In the second step, each incoming sentence is classified into its relevant topic bin. In the final step, novelty J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 284–295, 2011. c Springer-Verlag Berlin Heidelberg 2011
Chinese Categorization and Novelty Mining
285
detection searches through the time sequence of sentences and retrieves only those with “novel” information. This paper examines the link between categorization and novelty mining. In this task, we need to identify all novel Chinese text given groups of relevant sentences. Moreover, we also discuss the sentence categorization and novelty mining performance based on the retrieval results. The main contributions of this work are the investigation of the preprocessing techniques for detecting novel Chinese text, the discussion of the POS filtering rule for selecting words to represent a sentence, several experiments to compare the novelty mining performance between Chinese and English, the discovery that the novelty mining performance on Chinese can be as good as that on English if we can increase the preprocessing precision on Chinese text, the application of a mixed novelty metric that can effectively improves Chinese novelty mining performance, and a set of new novelty mining evaluation measures which can facilitate users to objectively evaluate the novelty mining results: NoveltyPrecision, Novelty-Recall, Novelty-F Score, and Sensitivity. The rest of this paper is organized as follows. The first section gives a brief overview of related work on detecting novel documents and sentences. The next section introduces the details of preprocessing steps for English and Chinese. Next, we describe the categorization algorithm and the mixed metric technique, which is applied in Chinese novelty mining. Traditional evaluation measures are described and new novelty evaluation measures for novelty mining are then proposed. Next, the experimental results are reported on the effect of preprocessing rules on Chinese novelty mining, Chinese novelty mining using mixed metric, categorization in English and Chinese, and novelty mining based on categorization using the old and newly proposed evaluation measures. The final section summarizes the research contributions and findings.
2
Related Work
Traditional sentence categorization methods use queries from topic information to evaluate similarity between an incoming sentence and the topic [1]. Then, each sentence is placed into its category according to the similarity. However, using queries from the topic information cannot guarantee satisfactory results since these queries can only provide some limited information. Later works have emphasized on how to expand the query so as to optimize the retrieval results [2]. The initial query, which is usually short, can be expanded based on the explicit user feedback or implicit pseudo-feedback in the target collections and the external resources, such as Wikipedia, search engines, etc. [2] Moreover, machine learning algorithms have been applied to sentence categorization that first transform sentences, which typically are strings of characters, into a representation suitable for the learning algorithms. Then, different classifiers are chosen to categorize the sentences to their relevant topic. Initial studies of novelty mining focused on the detection of novel documents. A document which is very similar to any of its history documents is regarded as a redundant document. To serve users better, novel information at the sentence level can be further highlighted. Therefore, later studies focused on detecting
286
F.S. Tsai and Y. Zhang
novel sentences, such as those reported in TREC Novelty Tracks [11], which compared various novelty metrics [19,21], and integrated different natural language techniques [7,14,20,22]. Studies for novelty mining have been conducted on the English and Malay languages [4,6,8,24]. Novelty mining studies on the Chinese language have been performed on topic detection and tracking, which identifies and collects relevant stories on certain topics from a stream of information [25]. However, to the best of our knowledge, few studies have been reported on entire process of Chinese novelty mining, from preprocessing and categorization to the actual detection of novel information, which is the focus of this paper.
3
Preprocessing for English and Chinese
3.1
English
English preprocessing first removes all stop words, such as conjunctions, prepositions, and articles. After removing stop words, word stemming is performed, which reduces the inflected words to their primitive root forms. 3.2
Chinese
Chinese preprocessing first needs to perform lexical analysis since there is no obvious boundary between Chinese words. Chinese word segmentation is a very challenging problem because of the difficulties in defining what constitutes a Chinese word [3]. Furthermore, there are no white spaces between Chinese words or expressions and there are many ambiguities in the Chinese language, such as: ‘ ’ (means ‘mainboard and server’ in English) might be ‘ / / ’ (means ‘mainboard/and/server’ in English) or ‘ / / / ’ (means ‘mainboard/ kimono/ task/ utensil’ in English). This ambiguity is a great challenge for Chinese word segmentation. Moreover, since there are no obvious derived words in Chinese, word stemming cannot be performed. To reduce the noise from Chinese word segmentation and obtain a better word list for a sentence, we first apply word segmentation on the Chinese text and then utilize Part-of-Speech (POS) tagging to select the meaningful candidate words. We used ICTCLAS for word segmentation and POS tagging because it achieves a higher precision than other Chinese POS tagging softwares [23]. Two different rules were used to select the candidate words of a sentence.
– Rule1: Non-meaningful words were removed, such as pronouns (‘r’ in the Chinese POS tagging criterions [9]), auxiliary words (‘u’), tone words (‘y’), conjunctions (‘c’), prepositions (‘p’) and punctuation words (‘w’). – Rule2: Fewer types of words were selected to represent a sentence, such as nouns (including ‘n’ short for common nouns, ‘nr’ short for person name, ‘ns’ short for location name, ‘nt’ short for organization name, ‘nz’ short for other proper nouns), verbs (‘v’), adjectives (‘a’) and adverbs (‘d’).
Chinese Categorization and Novelty Mining
287
Ü
In the following example, the Chinese sentence: “ ” means “There is a picture on the wall”. After POS filtering using Rule1, following words are used: “ (‘n’), (‘v’), (‘v’), (‘m’ measure word), (‘q’ quantifier), (‘n’)”. After POS filtering using Rule2, following words remain: “ (‘n’), (‘v’), (‘v’), (‘n’)”. By using Rule2, we can remove more unimportant words.
4
Ü
Ü
Categorization
From the output of the preprocessing steps on English and Chinese languages, we obtain bags of English and Chinese words. The corresponding term sentence matrix (TSM) can be constructed by counting the term frequency (TF) of each word. Therefore, each sentence can be conveniently represented by a vector where the TF value of each word is considered as one feature. Retrieving relevant sentences is traditionally based on computing the similarity between the representations of the topic and the sentences. The famous Rocchio algorithm [10] is adopted to categorize the sentences to their topics. The Rocchio algorithm is popular for two reasons. First, it is computationally efficient for online learning. Secondly, compared to many other algorithms, it works well empirically, especially at the beginning stage of adaptive filtering where the number of training examples is very small.
5
Novelty Mining
From the output of preprocessing, a bag of words is obtained, from which the corresponding term-sentence matrix (TSM) can be constructed by counting the term frequency (TF) of each word. The novelty mining system compares the incoming sentence to its history sentences in this vector space. Since the novelty mining process is the same for English and Chinese, a novelty mining system designed for English can also be applied to Chinese. The novelty of a sentence can be quantitatively measured by a novelty metric and represented by a novelty score N . The final decision on whether a sentence is novel depends on whether the novelty score falls above a threshold. The sentence that is predicted as “novel” will be placed into the history list of sentences. 5.1
Mixed Metric on Chinese Novelty Mining
The effect of sentence ordering can divide novelty metrics into two types: symmetric and asymmetric [21], as summarized in Table 1. In order to leverage the strengths of symmetric metrics and asymmetric metrics, we utilize a new technique for measuring the novelty by a mixture of both types of novelty metrics [13]. The goal of the mixed metric is to integrate the merits of both types of metrics and hence generalize better over different topics. Two major issues for constructing a mixed metric are (i) solving the scaling problem to ensure different component metrics comparable and consistent and (ii) combining the strategy that defines the way of fusing the outputs from different component metrics. By
288
F.S. Tsai and Y. Zhang
Table 1. Symmetric vs. Asymmetric Metrics Symmetric metric Asymmetric metric A metric M yields the A metric M yields different same result regardless the results based on the different ordering of two sentences, ordering of two sentences, i.e. M (si , sj ) = M (sj , si ). i.e. M (si , sj ) = M (sj , si ). Typical metrics Cosine similarity, New word count, in NM Jaccard similarity Overlap Definitions
normalizing the metrics, novelty scores from all novelty metrics range from 0 (i.e. redundant) to 1 (i.e. totally novel). Therefore, the metrics are both comparable and consistent because they have the same range of values. For the combining strategy, we adopt a new technique of measuring the novelty score N (st ) of the current sentence st , by combining two types of metrics, as shown in Equation (1). N (st ) = αNsym (st ) + (1 − α)Nasym (st )
(1)
where Nsym is the novelty score using the symmetric metric, Nasym is the novelty score using the asymmetric metric, and α is the combining parameter ranging from 0 to 1. The larger the value of α, the heavier the weight for the symmetric metrics. The new word count novelty metric is a popular asymmetric metric, which was proposed for sentence-level novelty mining [1]. The idea of the new word count novelty metric is to assign the incoming sentence the count of the new words that have not appeared in its history sentences, as defined in Equation (2). newW ord(st ) = |W (st )| − W (st ) ∩ ∪t−1 i=1 W (si )
(2)
where W (si ) is the set of words in the sentence si . The values of the new word count novelty metric for an incoming sentence are non-negative integers such as 0, 1, 2, etc. To normalize the values of the novelty scores into the range of 0 to 1, the new word count novelty metric can be normalized by the total number of words in the incoming sentence st as below. W (dt ) ∩ ∪t−1 W (di ) i=1 NnewW ord (dt ) = 1 − (3) |W (dt )| where the denominator |W (dt )| is the word count of dt . This normalized metric, NnewW ord , has the range of values from 0 (i.e. no new word) to 1 (i.e. 100% new words). In the following experiments using mixed metric, α is set 0.75. We chose cosine metrics as the symmetric metric and new word count defined in Equation (2) as the asymmetric metric, and the term weighting function as T F .
Chinese Categorization and Novelty Mining
5.2
289
Evaluation Measures
Precision (P ), recall (R), F Score (F ) and precision-recall (P R) curves are used to evaluate the performance for novelty mining [1]. The larger the area under the P R curve, the better the algorithm. We drew the standard F Score contours [11], which indicate the F Score values when setting precision and recall from 0 to 1 with a step of 0.1. These contours can facilitate us to compare F Scores along the P R curves. Based on all the topics’ P and R, the average P and average R can be obtained by calculating the arithmetic mean of these scores over all topics. Then, the average F Score (F ) is obtained by the harmonic average of the average P and average R. 5.3
Novelty Evaluation Measures
Although Precision, Recall, and F-Score can measure the novelty mining performance well when sentences are correctly categorized, if there are errors in the categorization, the measures cannot objectively measure the novelty mining performance. In order to objectively measure the novelty mining performance, we propose a set of new evaluation measures called Novelty Precision (N-Precision), Novelty Recall (N-Recall) and Novelty F Score(N-F Score). They are calculated only on correctly categorized sentences by our novelty mining system instead of all task relevant sentences. We remove the incorrectly categorized sentences before our novelty mining evaluation. N-precision =
NN+ N N + + N R+
(4)
N-recall =
NN+ + NN−
(5)
N-F =
NN+
2 × N-precision × N-recall N-precision + N-recall
(6)
where N R+ ,N R− ,N N + ,N N − correspond to the number of sentences that fall into each category (see Table 2). Our N-precision, N-recall and N-F Score do not consider the novelty mining performance of the sentences that are wrongly categorized to one topic. Moreover, in order to better measure the novelty mining result of this part, we bring forward a new measure called Sensitivity (defined in Equation 7), which indicates whether the novelty mining system is sensitive to the irrelevant sentences. Table 2. Categories for novelty mining evaluation based on categorization
Delivered Not Delivered
Correctly categorized and non-novel Correctly categorized and Novel N R+ NN+ − NR NN−
290
F.S. Tsai and Y. Zhang
If sensitivity is high, most wrongly categorized (irrelevant) sentences are predicted as novel, which will produce noise that prevent readers from finding the true novel information. N (7) IC where IC means the number of sentences that are incorrectly categorized by novelty mining system. N means the number of the wrongly categorized sentences that are predicted as novel by our system. Sensitivity =
6 6.1
Experiments and Results Dataset
The public dataset from TREC Novelty Track 2004 [11] is selected as our experimental dataset for sentence-level novelty mining. The TREC 2004 Novelty Track data is developed from AQUAINT collection. Both relevant and novel sentences are selected by TREC’s assessors. Since the dataset that we used is originally in English, we first translated the data into Chinese. During this process, we investigated issues on machine translation vs. manually corrected translation. After comparing the results of novelty mining on the machine translation sentences and the humanly corrected translation sentences individually, we found that there is only a slight difference (<2%) in the precision and F Score. This indicated that machine translation was sufficient for Chinese novelty mining. 6.2
Effect of Preprocessing Rules on Chinese Novelty Mining
In the first experimental study, the focus was on novelty mining rather than relevant sentences categorization. Therefore, our first experiments started with all given relevant Chinese text, from which the novel text should be identified. For the Chinese dataset, we first segmented the sentences into words and then performed POS filtering to acquire the candidate words for the space vector. Based on the vectors of Chinese text, we calculated the similarities between sentences and predicted the novelty for each sentence in the Chinese and English datasets. An incoming Chinese/English sentence will be compared with all the system-delivered 1000 novel sentences. We used threshold values between 0.05 and 0.95 with an equal step of 0.10. Then, we evaluated the Chinese/English novel text detection performance by setting a series of novelty score thresholds. We adopted two different rules to select the candidate words to represent one sentence and investigated the POS filtering influence on detecting the novel Chinese text. Rule1 selects only some non-meaningful words including pronouns, auxiliary words, tone words, conjunctions, prepositions, and punctuation words, and Rule2 selects fewer kinds of words to represent a sentence.
Chinese Categorization and Novelty Mining
291
Based on our experiments, we learn that the Chinese novelty mining performance is better when choosing the stricter rule (Rule2). Thus, POS filtering is necessary for Chinese because just removing some non-meaningful words (like stop words) may not be sufficient. POS filtering removes the less meaningful words so that each vector can be better represented. Rule2, which keeps only nouns, verbs, adjectives and adverbs, produces better results for novelty mining. Therefore, the remaining experiments used Rule2 for preprocessing the Chinese text. 6.3
Chinese Novelty Mining Using Mixed Metric
In this section, we compare the Chinese novelty mining performance after applying mixed metric at the sentence level. Novelty mining using cosine similarity and term weighting function is T F is compared to results using mixed metric, which blends the cosine similarity together with new word count. When setting novelty threshold from 0.05 to 0.95 with a step of 0.1, we can draw the PR curves for sentence-level novelty mining. Figure 1 shows the novelty mining results using mixed metrics at the sentence-level. From Figure 1, we learn that the Chinese novelty mining performance improves after applying mixed metric because it effectively utilizes the strengths of both the symmetric and asymmetric metric. Sentence−Level NM on TREC 2004 1
F score
0.9 Novelty Threshold=0.95 0.8
0.9
Chinese S−NM Chinese S−NM_Mix
0.7
0.85
0.8
Precision
0.6 0.75
0.7
0.5 0.6 0.4 0.5 0.3 0.4 0.2
0.3 0.2
0.1
0.1 0
0
0.1
0.2
0.3
0.4
0.5 Recall
0.6
0.7
0.8
0.9
1
Fig. 1. PR curves for sentence-level novelty mining on Chinese using mixed metric on TREC 2004. The grey dashed lines show contours at intervals of 0.1 points of F .
292
6.4
F.S. Tsai and Y. Zhang
Categorization in English and Chinese
We also conducted experiments to compare the categorization performance on Chinese with that of English on TREC 2004 Novelty Track. The topic information from TREC 2004 Novelty Track data is extracted by title, description and narrative. We used topic title and topic description to construct the initial query. Each sentence will be compared with the queries from each topic. When the relevance score of a sentence is greater than the categorization threshold, it is categorized as a relevant sentence to this topic. From our experiments, we observe that the categorization performance using Rocchio on Chinese is lower than that on English. This may be because of the influence of the higher preprocessing error rate on the results of Chinese categorization. Li [5] also mentioned that the results of Chinese text categorization for small categories were much worse than those for English. Reducing the noise in feature vectors of the Chinese text, which needs a better preprocessing of Chinese text, may lead to better results. 6.5
Novelty Mining Based on Categorization
Based on the categorization results, we performed sentence novelty mining using our new evaluation measures of Novelty-Precision, Novelty-Recall, and NoveltyF Score. We chose the categorization results when the categorization threshold θc =0.3. Then, we compare the novelty mining performance on Chinese and English. We use mixed metrics and T F ISF term weighting function. Table 3 and Table 4 show the novelty mining performance using the old evaluation measure and our new proposed measure on English and Chinese respectively. From Table 3 and Table 4, we learn that N-Precision is nearer to the precision of novelty mining on both English and Chinese when the sentences are perfectly categorized. In addition, sensitivity can explain the reasons why there is a big difference between these two groups of results because most wrongly categorized sentences are labeled as novel by our system. It is noticed that our novelty mining system is sensitive to the irrelevant sentences (Sensitivity ≥ 70%) and is more sensitive to the irrelevant sentences on Chinese than that on English, which is consistent with the results. Figure 2 shows the sentence-level novelty mining NPRF curves on Chinese and English based on the categorization results. The novelty score thresholds were chosen between 0 and 1 with a step of 0.10. Figure 2, it is noticed that the novelty mining performance on the categorization results varies in Chinese and English. Chinese cannot achieve as same as performance as that on English because the novelty mining performance on the Table 3. Comparison of Novelty Mining Performance on English Precision Recall F Score 0.479 0.911 0.615 N-Precision N-Recall N-F Score Sensitivity NM on categorization using new evaluation 0.4655 0.7491 0.5469 0.7002
Original NM on TREC perfect categorization
Chinese Categorization and Novelty Mining
293
Table 4. Comparison of Novelty Mining Performance on Chinese Precision Recall F Score 0.467 0.889 0.599 N-Precision N-Recall N-F Score Sensitivity NM on categorization using new evaluation 0.3551 0.6545 0.4414 0.8493
Original NM on TREC perfect categorization
Sentence−Level NM based on categorization result on TREC 2004 (Mixed (cos_tfisf + nwc) alpha = 0.75) 1 Chinese data English data
0.9
0.9
0.8 0.7
N−Precision
NF score
0.8
0.6 0.7 0.5 0.6 Threshold =0 0.05Threshold 0.5 =0 0.4
0.4
0.05
0.3 0.2
0.3 0.2
0.1
0.1 0
0
0.2
0.4
0.6
0.8
1
N−Recall
Fig. 2. Sentence-level novelty mining on categorization results: comparison between Chinese and English using PRF curve
categorization results is not good is because not all the relevant sentences are correctly categorized. The assessors judge the novelty of each sentence only on the correct relevant sentences. Therefore, if the categorization of each sentence is incorrect, the following novelty mining performance will be badly influenced.
7
Conclusion
This paper studied the entire process of preprocessing, categorization and novelty mining for detecting novel Chinese text, which were insufficiently addressed in previous studies. We described the Chinese preprocessing steps when choosing different Part-of-Speech (POS) filtering rules. We compared the novelty mining performance between Chinese and English and found that the novelty mining performance on Chinese can be as good as that on English by increasing the preprocessing precision on Chinese text.
294
F.S. Tsai and Y. Zhang
Then we applied a mixed novelty metric that effectively improved the Chinese novelty mining performance at the sentence level. Next, we compared the performance of categorization in English and Chinese, and found that Chinese categorization was influenced by the noise in preprocessing. Finally, we discuss the categorization and the novelty mining performance based on retrieval results. In order to objectively evaluate the novelty mining performance, we proposed a set of new novelty mining evaluation measures, Novelty-Precision, Novelty-Recall, Novelty-F Score, and Sensitivity. The new evaluation measures can more fairly assess how the performance of novelty mining is influenced by the categorization results.
References 1. Allan, J., Wade, C., Bolivar, A.: Retrieval and novelty detection at the sentence level. In: SIGIR 2003: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 314–321 (2003) 2. Diaz, F., Metzler, D.: Improving the estimation of relevance models using large external corpora. In: SIGIR 2006, Seattle, USA, pp. 154–161 (2006) 3. Gao, J., Li, M., Wu, A., Huang, C.-N.: Chinese word segmentation and named entity recognition: A pragmatic approach. Computational Linguistics 31(4), 531– 574 (2005) 4. Kwee, A.T., Tsai, F.S., Tang, W.: Sentence-level novelty detection in english and malay. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS, vol. 5476, pp. 40–51. Springer, Heidelberg (2009) 5. Li, Y., Taylor, J.S.: The SVM with uneven margins and Chinese document categorisation. In: Proceedings of the 17th Pacific Asia Conference on Language, Information and Computation, pp. 216–227 (2003) 6. Liang, H., Tsai, F.S., Kwee, A.T.: Detecting novel business blogs. In: ICICS 2009 Conference Proceedings of the 7th International Conference on Information, Communications and Signal Processing (2009) 7. Ng, K.W., Tsai, F.S., Chen, L., Goh, K.C.: Novelty detection for text documents using named entity recognition. In: 2007 6th International Conference on Information, Communications and Signal Processing, ICICS (2007) 8. Ong, C.L., Kwee, A., Tsai, F.: Database optimization for novelty detection. In: ICICS 2009 - Conference Proceedings of the 7th International Conference on Information, Communications and Signal Processing (2009) 9. PKU and CAS, Chinese POS tagging criterion (1999), http://icl.pku.edu.cn/icl_groups/corpus/addition.htm 10. Rocchio, J.: Relevance feedback in information retrieval. In: The SMART Retrieval System: Experiments in Automatic Document Processing, pp. 313–323 (1971) 11. Soboroff, I.: Overview of the TREC 2004 Novelty Track. In: Proceedings of TREC 2004 - the 13th Text Retrieval Conference, pp. 1–16 (2004) 12. Tan, R., Tsai, F.S.: Authorship identification for online text. In: International Conference on Cyberworlds, pp. 155–162 (2010) 13. Tang, W., Tsai, F.S., Chen, L.: Blended metrics for novel sentence mining. Expert Syst. Appl. 37(7), 5172–5177 (2010) 14. Tsai, F.S.: Review of techniques for intelligent novelty mining. Information Technology Journal 9(6), 1255–1261 (2010)
Chinese Categorization and Novelty Mining
295
15. Tsai, F.S.: Dimensionality reduction techniques for blog visualization. Expert Systems With Applications 38(3), 2766–2773 (2011) 16. Tsai, F.S.: A tag-topic model for blog mining. Expert Systems With Applications 38(5), 5330–5335 (2011) 17. Tsai, F.S., Chan, K.L.: Detecting cyber security threats in weblogs using probabilistic models. In: Yang, C.C., Zeng, D., Chau, M., Chang, K., Yang, Q., Cheng, X., Wang, J., Wang, F.-Y., Chen, H. (eds.) PAISI 2007. LNCS, vol. 4430, pp. 46–57. Springer, Heidelberg (2007) 18. Tsai, F.S., Chan, K.L.: Dimensionality reduction techniques for data exploration. In: 2007 6th International Conference on Information, Communications and Signal Processing, ICICS 2007, pp. 1568–1572 (2007) 19. Tsai, F.S., Chan, K.L.: Redundancy and novelty mining in the business blogosphere. The Learning Organization 17(6), 490–499 (2010) 20. Tsai, F.S., Chan, K.L.: An intelligent system for sentence retrieval and novelty mining. International Journal of Knowledge Engineering and Data Mining 1(3), 235–253 (2011) 21. Tsai, F.S., Tang, W., Chan, K.L.: Evaluation of metrics for sentence-level novelty mining. Information Sciences 180(12), 2359–2374 (2010) 22. Tsai, F.S., Zhang, Y.: D2S: Document-to-sentence framework for novelty detection. Knowledge and Information Systems (2011) 23. Zhang, H.-P., Liu, Q., Cheng, X.-Q., Zhang, H., Yu, H.-K.: Chinese lexical analysis using hierarchical hidden markov model. In: Second SIGHAN Workshop Affiliated with 41th ACL, pp. 63–70 (2003) 24. Zhang, Y., Tsai, F.S.: Combining named entities and tags for novel sentence detection. In: Proceedings of the WSDM 2009 ACM Workshop on Exploiting Semantic Annotations in Information Retrieval, ESAIR 2009, pp. 30–34 (2009) 25. Zheng, W., Zhang, Y., Zou, B., Hong, Y., Liu, T.: Research of Chinese topic tracking based on relevance model (2008)
Finding Rare Classes: Adapting Generative and Discriminative Models in Active Learning Timothy M. Hospedales, Shaogang Gong, and Tao Xiang Queen Mary University of London, UK, E1 4NS {tmh,sgg,txiang}@eecs.qmul.ac.uk
Abstract. Discovering rare categories and classifying new instances of them is an important data mining issue in many fields, but fully supervised learning of a rare class classifier is prohibitively costly. There has therefore been increasing interest both in active discovery: to identify new classes quickly, and active learning: to train classifiers with minimal supervision. Very few studies have attempted to jointly solve these two inter-related tasks which occur together in practice. Optimizing both rare class discovery and classification simultaneously with active learning is challenging because discovery and classification have conflicting requirements in query criteria. In this paper we address these issues with two contributions: a unified active learning model to jointly discover new categories and learn to classify them; and a classifier combination algorithm that switches generative and discriminative classifiers as learning progresses. Extensive evaluation on several standard datasets demonstrates the superiority of our approach over existing methods.
1
Introduction
Many real life problems are characterized by data distributed between vast yet uninteresting background classes, and small rare classes of interesting instances which should be identified. In astronomy, the vast majority of sky survey image content is due to well understood phenomena, and only 0.001% of data is of interest for astronomers to study [12]. In financial transaction monitoring, most are ordinary but a few unusual ones indicate fraud and regulators would like to find future instances. Computer network intrusion detection exhibits vast amounts of normal user traffic, and a very few examples of malicious attacks [16]. Finally, in computer vision based security surveillance of public spaces, observed activities are almost always people going about everyday behaviours, but very rarely may be a dangerous or malicious activity of interest [19]. All of these classification problems share two interesting properties: highly unbalanced frequencies – the vast majority of data occurs in one or more background classes, while the instances of interest for classification are much rarer; and unbalanced prior supervision – the majority classes are typically known a priori, while the rare classes are not. Classifying rare event instances rather than merely detecting any rare event is crucial because different classes may warrant different responses, for example due to different severity levels. In order to discover and learn to J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 296–308, 2011. c Springer-Verlag Berlin Heidelberg 2011
Finding Rare Classes: Adapting Generative and Discriminative Models
297
classify the interesting rare classes, exhaustive labeling of a large dataset would be required to ensure sufficient rare class coverage. However this is prohibitively expensive when generating each label requires significant time of a human expert. Active learning strategies might be used to discover or train a classifier with minimal label cost, but this is complicated by the dependence of classifier learning on discovery: one needs examples of each class to train a classifier. The problem of joint discovery and classification has received little attention despite its importance and broad relevance. The only existing attempt to address this is based on simply applying schemes for discovery and classifier learning sequentially or in fixed iteration [16]. Methods which treat discovery and classification independently perform poorly due to making inefficient use of data, (e.g., spending time on classifier learning is useless if the right classes have not been discovered and vice-versa). Achieving the optimal balance is critical, but non-trivial given the conflict between discovery and learning criteria. To address this, we build a generative-discriminative model pair [11,4] for computing discovery and learning query criteria, and adaptively balance their use based on joint discovery and classification performance. Depending on the actual supervision cost and sparsity of rare class examples, the quantity of labeled data varies. Given the nature of data dependence in generative and discriminative models [11], the ideal classifier also varies. As a second contribution, we therefore address robustness to label quantity and introduce a classifier switching algorithm to optimize performance as data is accumulated. The result is a framework which significantly and consistently outperforms existing methods at the important task of discovery and classification of rare classes. Related Work. A common unsupervised approach to rare class detection is outlier detection: building an unconditional model of the data and flagging unlikely instances. This has a few serious limitations: it does not classify; it fails with non-separable data, where interesting classes are embedded in the majority distribution; and it does not exploit any supervision about flagged outliers, limiting its accuracy – especially in distinguishing rare classes from noise. Iterative active learning approaches are often used to learn a classifier with minimal supervision [14]. Much of the active learning literature is concerned with the relative merits of different query criteria. For example, querying points that: are most uncertain [14]; reduce the version space [17]; or reduce direct approximations of the generalization error [13]. Different criteria may be suited to different datasets, e.g, uncertainty criteria are good to refine decision boundaries, but can be fatal if the classes are non-separable (the most uncertain points may be hopeless) or highly multi-modal. This has led to attempts to select dataset specific criteria online [2]. All these approaches rely on classifiers, and do not generally apply to scenarios in which the target classes are themselves unknown. Recently, active learning has been applied to discovering rare classes using e.g., likelihood [12] or gradient [9] criteria. Solving discovery and classification problems together with active learning is challenging because for a single dataset, good discovery and classification criteria are often completely different. Consider the toy scenarios in Figure 1. Here the color indicates the true class, and the
298
T.M. Hospedales, S. Gong, and T. Xiang
symbol indicates the estimated class based on two initial labeled points (large symbols). The black line indicates the initial decision boundary. In Figure 1(a) all classes are known but the decision boundary needs refining. Likelihood sampling (most unlikely point under the learned model) inefficiently builds a model of the whole space (choosing first the points labeled L), while uncertainty sampling selects points closest to the boundary (U symbols), leading to efficient refinement. In Figure 1(b) only two classes are known. Uncertainty inefficiently queries around the known decision boundary (choosing first the points U) without discovering the new classes above. In contrast, these are the first places queried by likelihood sampling (L symbols). Evidently, single-criterion approaches are insufficient. Moreover, multiple criteria may be necessary for a single dataset at different stages of learning, e.g., likelihood to detect new classes and uncertainty to learn to classify them. A simple but inefficient approach [16] is to simply iterate over criteria in fixed proportion. In contrast, our innovation is to adapt criteria online so as to select the right strategy at each stage of learning, which can dramatically increase efficiency. Typically, “exploration” is automatically preferred while there are easily discoverable classes, and “exploitation” to refine decision boundaries when most classes have been discovered. This ultimately results in better rare class detection performance than single objective, or non-adaptive methods [16].
(a)
L
(b) L
L
U U U U
L
Fig. 1. Sample Problems
Finally, there is the issue of what base classifier to use in the active learning algorithm of choice. One can categorize classifiers into two broad categories: generative and discriminative. Discriminative models directly learn p(y|x) for class y and data x. Generative models learn p(x, y) and compute p(y|x) via Bayes rule. The importance of this for active learning is that for a given generativediscriminative pair (in the sense of equivalent parametric form – such as naive Bayes & logistic regression), generative classifiers typically perform better with few training examples, while discriminative models are better asymptotically [11]. The ideal classifier is therefore likely to be completely different early and late in the active learning process. An automatic way to select the right classifier online as more labels are obtained is therefore key. Existing active learning work focuses on single generative [13] or discriminative [17] classifiers. We introduce a novel algorithm to switch classifiers online as the active learning process progresses in order to get the best of both worlds.
Finding Rare Classes: Adapting Generative and Discriminative Models
2 2.1
299
Adaptive Active Learning Active Learning
In this paper we deal with pool-based uncertainty sampling and likelihood sampling because of their computational efficiency and clearly complementary nature. Our method can nevertheless be easily generalized to other criteria. We consider a classification problem starting with many unlabeled instances U = (x1 , .., xn ) and a small set of labeled instances L = ((x1 , y1 ), .., (xm , ym )). L does not include the full set of possible labels Y in advance. We wish to learn the posterior conditional distribution p(y|x) so as to accurately classify the data in U. Active learning proceeds by iteratively: i) training a classifier C on L; ii) using query function Q(C, L, U) → i∗ to select unlabeled instances i∗ to be labeled and iii) removing xi∗ from U and adding (xi∗ , yi∗ ) to L. Query Criteria. Perhaps the most commonly applied query criteria are uncertainty sampling and variants [14]. The intuition is that if the current classification of a point is highly uncertain, it should be informative to label. Uncertainty is typically quantified by posterior entropy, which for binary classification reduces to selecting the point whose posterior is closest to p(y|x) = 0.5. The posterior p(y|x) of every point in U is evaluated and the uncertain points queried, pu (i) ∝ exp β p(yi |xi ) log p(yi |xi ) . (1) yi
Rather than selecting a single maxima, we exploit a normalized degree of preference pu (i) for every point i can be expressed by putting the entropy into a Gibbs function (1). For non-probabilistic SVM classifiers, an approximation to p(y|x) can be derived from the distance to the margin from each point [14]. A complementary query criteria is that of low likelihood p(x|y). Such points are badly explained by the current model, and should therefore be informative to label [12]. This may involve marginalizing or selecting the most likely class, (2) pl (i) ∝ exp −βmax p(xi |yi ) . yi
The uncertainty measure in (1) is in spirit discriminative (in focusing on decision boundaries), although p(y|x) can obviously be realized by a generative classifier. In contrast, the likelihood measure in (2) is intrinsically generative, in that it requires a density model of each class y, rather than just the decision boundary. The uncertainty measure is generally unsuitable for finding new classes, as it focuses on known decision boundaries, and the likelihood measure is good at finding new classes, while being poorer at refining decision boundaries between known classes (Figure 1). Note that the likelihood measure can still be useful to improve known-class classification if the classes are multi-modal – it will explore different modes. Our adaptation method will allow it to be used in both ways. Next, we discuss specific parametric forms for our models.
300
2.2
T.M. Hospedales, S. Gong, and T. Xiang
Generative-Discriminative Model Pairs
We use a Gaussian mixture model (GMM) for the generative model and a support vector machine (SVM) for the discriminative model. These were chosen because they may both be incrementally trained (for active learning efficiency), and they are a complementary generative-discriminative pair in that (assuming a radial basis SVM kernel) they have equivalent classes of decision boundaries [4], but are optimized with very different criteria during learning. Incremental GMM Estimation. For online GMM learning, we use the incremental agglomerative algorithm from [15]. To summarize the procedure, for the first n = 1..N training points observed with the same label y, {xn , y}N n , we incrementally build a model p(x|y) for y using kernel density estimation with Gaussian kernels N (xn , Σ) and weight ωn = n1 . d is the dimension of the data x. p(x|y) =
N
1 d/2
(2π)
1/2
|Σ|
ωn exp −
n=1
1 (x − xn )T Σ −1 (x − xn ) . 2
(3)
To bound the complexity, after some maximal number of Gaussians Nmax is reached, merge two existing Gaussians i and j by moment matching [7].
μ(i+j) =
ω(i+j) = ωi + ωj ,
ωi ω(i+j )
μi +
ωj ω(i+j )
μj , (4)
ωi Σi + (μi − μ(i+j) )(μi − μ(i+j) )T ω(i+j) ωj + Σj + (μj − μ(i+j) )(μj − μ(i+j) )T . ω(i+j)
Σ(i+j) =
(5)
The components to merge are chosen by the selecting the pair of Gaussian kernels (Gi , Gj ) whose replacement G(i+j) is most similar, in terms of the KullbackLeibler divergence. Specifically, we minimize the cost Cij , Cij = ωi KL(Gi ||G(i+j) ) + ωj KL(Gj ||G(i+j) ).
(6)
Importantly for iterative active learning online, merging Gaussians and updating the cost matrix requires constant O(Nmax ) computation every iteration once the initial cost matrix is built. In contrast, learning a GMM with latent variables requires multiple expensive O(n) expectation-maximization iterations [12]. The initial covariance Σ is assumed uniform diagonal Σ = Iσ 2 , and is estimated a priori by leave-one-out cross validation on the (large) unlabeled set U : ⎛ ⎞ d 1 σ ˆ = argmax ⎝ σ− 2 exp − 2 (x − xn )2 ⎠ . (7) 2σ σ n∈U
x=xn
Given the learned models p(x|y), we can classify yˆ ← fgmm (x), where
Finding Rare Classes: Adapting Generative and Discriminative Models
fgmm (x) = argmaxp(y|x),
p(y|x) ∝
y
301
wi N (x; μi,y , Σi,y )p(y). (8)
i
SVM. We use a standard SVM approach with RBF kernels, treating multi-class classification as a set of 1-vs-1 decisions, for which the decision rule [4] is given (by an equivalent form to (8)) as ⎛ ⎞ fsvm (x) = argmax ⎝ αki N (x; vi ) + αk0 ⎠ , (9) y
vi ∈SVy
and p(y|x) can be computed based on the binary posterior estimates [18]. 2.3
Combining Active Query Criteria
Given the generative GMM and discriminative SVM models defined in Section 2.2, and their respective likelihood and uncertainty query criteria defined in Section 2.1, our first concern is how to adaptively combine the query criteria online for discovery and classification. Our algorithm involves probabilistically selecting a query criteria Qk according to some weights w (k ∼ Multi(w)) and then sampling the query point from the distribution i∗ ∼ pk (i) ((1) or (2))1 . The weights w will be adapted based on the discovery and classification performance φ of our active learner at each iteration. In an active learning context, [2] shows that because labels are few and biased, cross-validation is a poor way to assess classification performance, and suggest the unsupervised measure of binary classification entropy (CE) on the unlabeled set U instead. This is especially the case in the rare class context where there is often only one example of a given class, so cross-validation is not well defined. To overcome this problem, we generalize CE to multi-class entropy (MCE) of the classifier f (x) and take it as our indication of classification performance, ny I(f (xi ) = y) i I(f (xi ) = y) H=− logny i . |U| |U| y=1
(10)
Here I is the indicator function that returns 1 if its argument is true, and ny is the number of classes observed so far. Importantly, we explicitly reward the discovery of new classes to jointly optimize classification and discovery. We define overall active learning performance φt (i) upon querying point i at time t as, φt (i) = αI(yi ∈ / L) + (1 − α) (eHt − eHt−1 ) − (1 − e) /(2e − 2). (11) 1
We choose this method because each criterion has very different “reasons” for its preference. An alternative is querying a product or mean [2,5,3] of the criteria. That risks querying a merely moderately unlikely and uncertain point – neither outlying nor on a decision boundary – which is useless for either classification or discovery.
302
T.M. Hospedales, S. Gong, and T. Xiang
The first right hand term above rewards discovery of a new class, and the second term rewards an increase in MCE (as an estimate of classification accuracy) after labeling point i at time t. The constants (1 − e) and (2e − 2) ensure the second term lies between 0 and 1. The parameter α is the mixing prior for discovery vs. classification. Given this performance measure, we define an update for the future weight wt+1 of each active criterion k, wt+1,k (q) ∝ λwt,k + (1 − λ)φt (i)
pk (i) + . p(i)
(12)
Here we define an exponential decay (first term) of the weight in favor of (second term) the current performance φ weighted by how strongly criteria k recommended the chosen point i, compared to the joint recommendation p(i) = k pk (i). λ is the forgetting factor. The third term encourages exploration by diffusing the weights so every criterion is tried occasionally. In summary, this approach adaptively selects more frequently those criteria that have been successful at discovering new classes and/or increasing MCE, thereby optimizing both discovery and classification accuracy. 2.4
Adaptive Selection of Classifiers
As discussed in Section 1, although we broadly expect the generative GMM classifier to have better initial performance, and the discriminative SVM classifier to have better asymptotic performance, the ideal classifier will vary with dataset and active learning iteration. The remaining question is how to combine these classifiers [10] online for best performance given any specific supervision budget. Cross-validation to determine reliability is infeasible because of lack of data; however we can again resort to the MCE over the training set U (10). In our experience, MCE is indeed indicative of generalization performance, but relatively crudely and non-linearly so. This makes approaches based on MCE weighted posterior fusion unreliable. We therefore choose a simpler but more reliable approach which switches the final classifier at the end of each iteration to the one with higher MCE, aiming to perform as well as the better classifier for any label budget. Additionally, the process of multi-class posterior estimation for SVMs [18] requires cross-validation and is inaccurate with limited data. To compute the uncertainty criterion (1) at each iteration, we therefore use posterior of the classifier determined to be more reliable by MCE. This ensures that uncertainty sampling is as accurate as possible in both low and high data contexts. Summary. Algorithm 1 summarizes our approach. There are four parameters: Gibbs parameter β, discovery vs. classification prior α, forgetting rate λ and exploring rate . None of these were tuned; we set them all crudely to intuitive values for all experiments, β = 100, α = 0.5, λ = 0.6 and = 0.01. The GMM and SVM classifiers both have regularization hyperparameters Nmax and (C, γ). These were not optimized, but set at standard values Nmax = 32, C = 1, γ = 1/d.
Finding Rare Classes: Adapting Generative and Discriminative Models
303
Algorithm 1. Integrated Active Learning for Discovery and Classification Active Learning Input: Labeled L and unlabeled U data. Classifiers C, query criteria Qk , weights w. 1. Build unconditional GMM from L ∪ U (3)-(5) 2. Estimate σ by cross-validation (7) 3. Train initial GMM fgmm and SVM fsvm classifiers on L using σ Repeat as training budget allows: 1. 2. 3. 4. 5. 6. 7.
Compute query criteria pu (i) (1) and pl (i) (2) Sample query criteria to use k ∼ Multi(w) Query point i∗ ∼ pk (i), add (xi∗ , yi∗ ) to L Update classifiers fgmm and fsvm with point i∗ (8) and (9) Compute multi-class classification entropies H gmm and H svm (10) Update query criteria weights w (11) and (12) If H gmm > H svm : select classifier fgmm (x), Else: select fsvm (x)
Testing Input: Testing samples U ∗ , selected classifier c. 1. Classify x ∈ U ∗ with fc (x) ((8) or (9))
3
Experiments
Evaluation Procedure. We tested our method on 7 rare class datasets from the UCI repository [1] and on the CASIA gait dataset [20], for which we addressed the image viewpoint recognition problem. We unbalanced the CASIA dataset by sampling training classes in geometric proportion. In each case we labeled one point from the largest class and the goal was to discover and learn to classify the remaining classes. Table 1 summarizes the properties of each dataset. Performance was evaluated at each iteration by: i) the number of distinct classes discovered and ii) the average classification accuracy over all classes. This accuracy measure weights the ability to classify rare classes equally with the majority class despite the fewer rare class points. Moreover, it means that undiscovered rare classes automatically penalize accuracy. Accuracy was evaluated by 2-fold cross-validation, averaged over 25 runs from random initial conditions. Comparative Evaluations. We compared the following methods: S/R: A baseline SVM classifier making random queries. G/G: GMM classification with GMM likelihood criterion (2). S/S: SVM classifier with SVM uncertainty criterion (1). S/GSmix: SVM classifier alternating GMM likelihood and SVM uncertainty queries (corresponding to [16]). S/GSonline: SVM classifier fusing GMM likelihood & SVM uncertainty criteria by the method in [2]. S/GSadapt: SVM classification with our adaptive fusion of GMM likelihood & SVM uncertainty
304
T.M. Hospedales, S. Gong, and T. Xiang
Table 1. Dataset properties. Number of Table 2. Classification performance sumitems N, classes Nc , dimensions d. Small- mary in terms of area under classification est and largest class proportions S/L. curve Data Ecoli PageBlock Glass Covertype Shuttle Thyroid KDD99 Gait view
N 336 5473 214 10000 10000 3772 50000 2353
d 7 10 10 10 9 22 23 25
Nc 8 5 6 7 7 3 15 9
S% 1.5% .5% 4% 3.6% .01% 2.5% .04% 3%
L% 42% 90% 36% 25% 78% 92% 51% 49%
Data EC PB GL CT SH TH KD GA
G/G S/GSmix S/GSad GSsw/GSad 59 60 60 62 53 57 58 59 63 55 57 64 41 39 43 46 40 39 42 43 50 55 56 59 41 23 54 59 38 31 49 57
criteria (10)-(12). GSsw/GSadapt: Our full model including online switching of GMM and SVM classifiers, as detailed in Algorithm 1. Shuttle. (Figure 2(a)). Our methods S/GSadapt (cyan) and GSsw/GSadapt (red), exploit likelihood sampling early for fast discovery, and hence early classification accuracy. (We also outperform the gradient and EM based active discovery models in [9] and [12].) Our adaptive models switch to uncertainty sampling later on, and hence achieve higher asymptotic accuracy than the pure likelihood based G/G method. Figure 2(c) illustrates this process via the query criteria weighting (12) for a typical run. The likelihood criterion discovers a new class early, leading to higher weight (11) and rapid discovery of the remaining classes. After 50 iterations, with no new classes to discover, uncertainty criteria obtains greater reward (11) and dominates, efficiently refining classification performance. Thyroid. (Figure 2(b)). Our GSsw/GSadapt model (red) is the best overall classifier: it matches the initially superior performance of the G/G likelihoodbased model (green), but later achieves the asymptotic performance of the SVM classifier based models. This is because of our classifier switching innovation (Section 2.4). Figure 2(d) illustrates switching via the average (training) classification entropy and (testing) accuracy of the classifiers composing GSsw/GSadapt. The GMM classifier entropy (black dots) is higher than the SVM entropy (blue dots) for the first 25 iterations. This is approximately the period over which the GMM classifier (black line) has better performance than the SVM classifier (blue line), so switching classifier on entropy allows the pair (green dashes) to always perform as well as the best individual classifier for each iteration. Glass. (Figure 2(e)). GSsw/GSadapt again performs best by switching to match the good initial performance of the GMM classifier and asymptotic performance of the SVM. Note the dramatic improvement over the SVM models in the first 50 iterations. Pageblocks (Figure 2(f)). The SVM-based models outperform G/G at most iterations. Our GSsw/GSadapt correctly selects the SVM classifier
Finding Rare Classes: Adapting Generative and Discriminative Models
Shuttle: Discovery
Shuttle: Classification
2 1
50
100
0.3 0.2
150
50
(d)
Adaptive Active Learning Criteria
0.4 0.2
50
100
Entropy based classifier switching
0.4
0.2 0.1 50
100
S/R S/S G/G S/GSmix S/GSonline S/GSadapt GSsw/GSadapt
3
2
150
Glass: Classification
S/R S/S G/G S/GSmix S/GSonline S/GSadapt GSsw/GSadapt
4 3 2
100
Labeled Points
150
0.6 0.5 0.4 0.3 0.2 0.1
20
40
60
80
100
20
0.6 0.5 0.4 0.3
40
60
80
100
Labeled Points
Gait: Discovery
(g)
150
0.7
Labeled Points
0.7
100
0.8
Gait: Classification 0.7
8
S/R S/S G/G S/GSmix S/GSonline S/GSadapt GSsw/GSadapt
6
4
2
0.2 50
50
Labeled Points
5
1
150
Pageblocks: Classification
Average Accuracy
Classes Discovered
0
0.8
4
1
GMM Entropy SVM Entropy GMM Accuracy SVM Accuracy Joint Accuracy
0.3
Pageblocks: Discovery
100
Glass: Discovery
Labeled Points
5
0.4
6
0.5
Labeled Points
(f)
50
(e)
0.6
0
150
0.5
Labeled Points
Classes Discovered
0
Entropy / Accuracy
Criteria Weight, w
Likelihood Uncertainty
0.6
0.3
1
150
0.7
1
0.6
1.5
Labeled Points
0.8
0
100
S/R S/S G/G S/GSmix S/GSonline S/GSadapt GSsw/GSadapt
2
0.1
Labeled Points
(c)
2.5
Average Accuracy
3
0.4
Average Accuracy
4
0.5
Average Accuracy
5
Thyroid: Classification 0.7
3
Classes Discovered
6
Average Accuracy
Classes Discovered
He 2007 Pelleg 2004 S/R S/S G/G S/GSmix S/GSonline S/GSadapt GSsw/GSadapt
Thyroid: Discovery
(b)
0.6
7
Classes Discovered
(a)
305
0.6 0.5 0.4 0.3 0.2 0.1
50
100
Labeled Points
150
50
100
Labeled Points
150
50
100
150
Labeled Points
Fig. 2. (a) Shuttle and (b) Thyroid dataset performance. (c) Shuttle criteria adaptation, (d) Thyroid entropy based classifier switching. (e) Glass, (f) Pageblocks and (g) Gait view dataset performance.
throughout. Gait view (Figure 2(g)). The majority class contains outliers, so likelihood criteria is unusually weak at discovery. Additionally for this data SVM performance is generally poor, especially in early iterations. GSsw/GSadapt adapts impressively to this dataset in two ways enabled by our contributions: exploiting uncertainty sampling criteria extensively and switching to predicting using the GMM classifier. In summary the G/G method (likelihood criterion) was usually the most efficient at discovering classes as expected. However, it was usually asymptotically weaker at classifying new instances. This is because generative model mis-specification tends to cost more with increasing amounts of data [11]. S/S, (uncertainty criterion), was general poor at discovery (and hence classification). Alternating between likelihood and uncertainty sampling, S/GSmix (corresponding to [16]) did a fair job of both discovery and classification, but under-performed our adaptive models due to its inflexibility. S/GSonline (corresponding to [2]) was better than random or S/S, but was not the quickest learner. Our first model S/GSadapt, which solely adapted the multiple active query criteria, was competitive at discovery, but sometimes not the best at classification in early phases with very little data – due to exclusively using the discriminative SVM classifier. Finally, by exploiting generative-discriminative classifier switching, our complete GSsw/GSadapt model was generally the best classifier over all stages of learning. Table 2 quantitatively summarizes the performance of the most competitive models for all datasets in terms of area under the classification curve.
306
4
T.M. Hospedales, S. Gong, and T. Xiang
Conclusion
Summary. We highlighted active classifier learning with a priori unknown rare classes as an under-studied but broadly relevant and important problem. To solve joint rare class discovery and classification, we proposed a new framework to adapt both active query criteria and classifier. To adaptively switch generative and discriminative classifiers online we introduced MCE; and to adapt query criteria we exploited a joint reward signal of new class discovery and MCE. In adapting to each dataset and online as data is obtained, our model significantly outperformed contemporary alternatives on eight standard datasets. Our approach will be of great practical value for many problems. Discussion. A related area of research to our present work is that of learning from imbalanced data [8] which aims to learn classifiers for classes with imbalanced distributions, while avoiding the pitfall of simply classifying everything as the majority class. One strategy to achieve this is uncertainty based active learning [6], which works because the distribution around the class boundaries is less imbalanced than the whole dataset. Our task is also an imbalanced learning problem, but more general in that the rare classes must also be discovered. We succeed in learning from imbalanced distributions via our use of uncertainty sampling, so in that sense our method generalizes [6]. Although our approach lacks the theoretical bounds of the fusion method in [2], we find it more compelling for various reasons: it addresses a very practical and previously unaddressed problem of learning to discover new classes and find new instances of them by jointly optimizing searching for new classes and refining their decision boundaries. It adapts based on the current state of the learning process, i.e., early on, class finding via likelihood may be more appropriate, and later on boundary refinement via uncertainty. In contrast [2] solely optimizes classification accuracy and is not directly applicable to discovery. [5] and [3] address the fusion of uncertainty and density (to avoid outliers) criteria for classifier learning (not discovery). [5] adapts between density weighted and unweighted uncertainty sampling based on their expected future error. This is different to our situation because there is no meaningful notion of future error when an unknown number of classes remain to be discovered. [3] samples from a weighted sum of density and uncertainty criteria. This is less powerful than our approach because it does not adapt online based on the performance of each criteria. Importantly, both [5] and [3] prefer high density points; while for rare class discovery we require the opposite – low likelihood. In comparison to other active rare class discovery work, our framework generalizes [12], (which exclusively uses generative models and likelihood criteria) to using more criteria and adapting between them. [9] focuses on a different active discovery intuition, using local gradient to discover non-separable rare classes. We derived an analogous query criterion based on GMM local gradient. It was generally weaker than likelihood-based discovery (and was hence adapted downward in our framework) for our datasets, so we do not report on it here. Unlike our work here, [5,12,9] all also rely on the very strong assumption that the user at least specifies the number of classes in advance. Finally, the
Finding Rare Classes: Adapting Generative and Discriminative Models
307
only other work of which we are aware which addresses both discovery and classification is [16]. This uses a fixed classifier and non-adaptively iterates between discovery and uncertainty criteria (corresponding to our S/GSmix condition). In contrast, our results have shown that our switching classifier and adaptive query criteria provide compelling benefit for discovery and classification. Future Work. There are various interesting questions for future research including and how to create tighter coupling between the generative and discriminative components [4], and generalizing our ideas to stream based active learning, which is a more natural setting for some practical problems. Acknowledgment. This research was funded by the EU FP7 project SAMURAI with grant no. 217899.
References 1. Asuncion, A., Newman, D.: UCI machine learning repository (2007), http://www.ics.uci.edu/ml/ 2. Baram, Y., El-Yaniv, R., Luz, K.: Online choice of active learning algorithms. Journal of Machine Learning Research 5, 255–291 (2004) 3. Cebron, N., Berthold, M.R.: Active learning for object classification: from exploration to exploitation. Data Min. Knowl. Discov. 18(2), 283–299 (2009) 4. Deselaers, T., Heigold, G., Ney, H.: SVMs, gaussian mixtures, and their generative/discriminative fusion. In: ICPR (2008) 5. Donmez, P., Carbonell, J.G., Bennett, P.N.: Dual strategy active learning. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 116–127. Springer, Heidelberg (2007) 6. Ertekin, S., Huang, J., Bottou, L., Giles, L.: Learning on the border: active learning in imbalanced data classification. In: CIKM (2007) 7. Goldberger, J., Roweis, S.: Hierarchical clustering of a mixture model. In: NIPS (2004) 8. He, H., Garcia, E.: Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21(9), 1263–1284 (2009) 9. He, J., Carbonell, J.: Nearest-neighbor-based active learning for rare category detection. In: NIPS (2007) 10. Kittler, J., Hatef, M., Duin, R.P.W., Matas, J.: On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(3), 226–239 (1998) 11. Ng, A., Jordan, M.: On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In: NIPS (2001) 12. Pelleg, D., Moore, A.: Active learning for anomaly and rare-category detection. In: NIPS (2004) 13. Roy, N., McCallum, A.: Toward optimal active learning through sampling estimation of error reduction. In: ICML, pp. 441–448 (2001) 14. Settles, B.: Active learning literature survey. Tech. Rep. 1648, University of wisconsin–Madison (2009) 15. Sillito, R., Fisher, R.: Incremental one-class learning with bounded computational complexity. In: ICANN (2007)
308
T.M. Hospedales, S. Gong, and T. Xiang
16. Stokes, J.W., Platt, J.C., Kravis, J., Shilman, M.: Aladin: Active learning of anomalies to detect intrusions. Tech. Rep. 2008-24, MSR (2008) 17. Tong, S., Koller, D.: Support vector machine active learning with applications to text classification. In: ICML (2000) 18. Wu, T.F., Lin, C.J., Weng, R.C.: Probability estimates for multi-class classification by pairwise coupling. Journal of Machine Learning Research 5, 975–1005 (2004) 19. Xiang, T., Gong, S.: Video behavior profiling for anomaly detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(5), 893–908 (2008) 20. Yu, S., Tan, D., Tan, T.: A framework for evaluating the effect of view angle, clothing and carrying condition on gait recognition. In: ICPR (2006)
Margin-Based Over-Sampling Method for Learning from Imbalanced Datasets Xiannian Fan, Ke Tang , and Thomas Weise Nature Inspired Computational and Applications Laboratory, School of Computer Science and Technology, University of Science and Technology of China, Hefei, China, 230027
[email protected],{ketang,tweise}@ustc.edu.cn
Abstract. Learning from imbalanced datasets has drawn more and more attentions from both theoretical and practical aspects. Oversampling is a popular and simple method for imbalanced learning. In this paper, we show that there is an inherently potential risk associated with the over-sampling algorithms in terms of the large margin principle. Then we propose a new synthetic over sampling method, named Marginguided Synthetic Over-sampling (MSYN), to reduce this risk. The MSYN improves learning with respect to the data distributions guided by the margin-based rule. Empirical study verities the efficacy of MSYN. Keywords: imbalance learning, over-sampling, over-fitting, large margin theory, generalization.
1
Introduction
Learning from imbalanced datasets has got more and more emphases in recent years. A dataset is imbalanced if its class distributions are skewed. The class imbalance problem is of crucial importance since it is encountered by a large number of real world applications, such as fraud detection [1], the detection of oil spills in satellite radar images [2], and text classification [3]. In these scenarios, we are usually more interested in the minority class instead of the majority class. The traditional data mining algorithms have a poor performance due to the fact that they give equal attention to the minority class and the majority class. One way for solving the imbalance learning problem is to develop ”imbalanced data oriented algorithms” that can perform well on the imbalanced datasets. For example, Wu et al. proposed class boundary alignment algorithm which modifies the class boundary by changing the kernel function of SVMs [4]. Ensemble methods were used to improve performance on imbalance datasets [5]. In 2010, Liu et al. proposed the Class Confidence Proportion Decision Tree (CCPDT) [6]. Furthermore, there are other effective methods such as cost-based learning [7] and one class learning [8].
Corresponding author.
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 309–320, 2011. c Springer-Verlag Berlin Heidelberg 2011
310
X. Fan, K. Tang, and T. Weise
Another important way to improve the results of learning from imbalanced data is to modify the class distributions in the training data by over-sampling the minority class or under-sampling the majority class [9]. The simplest sampling methods are Random Over-Sampling (ROS) and Random Under-Sampling (RUS). The former increases the number of the minority class instances by duplicating the instances of the minority, while the latter randomly removes some instances of the majority class. Sampling with replacement has been shown to be ineffective for improving the recognition of minority class significantly. [9][10]. Chawla et al. interpret this phenomenon in terms of decision regions in feature space and proposed the Synthetic Minority Over-Sampling Technique (SMOTE) [11]. There are also many other synthetic over-sampling techniques, such as Borderline-SMOTE [12] and ADASYN [13]. To summarize, under-sampling methods can reduce useful information of the datasets; over-sampling methods may make the decision regions of the learner smaller and more specific, thus may cause the learner to over-fit. In this paper, we analyze the performance of over-sampling techniques from the perspective of the large margin principle and find that the over-sampling methods are inherently risky from this perspective. Aiming to reduce this risk, we propose a new synthetic over-sampling method, called Margin-guided Synthetic Over-Sampling (MSYN). Our work is largely inspired by the previous works in feature selection using the large margin principle [14] [15] and problems of over-sampling for imbalance learning [16]. The empirical study revealed the effectiveness of our proposed method. The rest of this paper is organized as follows. Section 2 reviews the related works. Section 3 presents the margin-based analysis for over-sampling. Then in Section 4 we propose the new synthetic over-sampling algorithm. In Section 5, we test the performance of the algorithms on various machine learning benchmarks datasets. Finally, the conclusion and future work are given in Section 6.
2
Related Works
We use A to denote a dataset of n instances A = {a1 , ..., an }, where ai is a realvalued vector of dimension m. Let AP ⊂ A denote the minority class instances, AN ⊂ A denote the majority class instances. Over-sampling techniques augment the minority class to balance between the numbers of the majority and minority class instances. The simplest oversampling method is ROS. However, it may make the decision regions of the majority smaller and more specific, and thus can cause the learner to over-fit [16]. Chawla et al. over-sampled the minority class with their SMOTE method, which generates new synthetic instances along the line between the minority instances and their selected nearest neighbors [11]. Specifically, for the subset AP , they consider the k-nearest neighbors for each instances ai ∈ Ap . For some specified integer number k, the k-nearest neighbors are define as the k elements of AP , whose Euclidian distance to the element ai under consideration is the smallest. To create a synthetic instance, one of the k-nearest neighbors is randomly
Margin-Based Over-Sampling Method
311
selected and then multiplied by the corresponding feature vector difference with a random number between [0, 1]. Take a two-dimensional problem for example: anew = ai + (ann − ai ) × δ where ai ∈ AP is the minority instance under consideration, ann is one of the knearest neighbors from the minority class, and δ ∈ [0, 1]. This leads to generating a random instance along the line segment between two specific instances and thus effectively forces the decision region of the minority class to become more general [11]. The advantage of SMOTE is that it makes the decision regions larger and less specific [16]. Borderline-SMOTE focuses the instances on the borderline of each class and the ones nearby. The consideration behind it is: the instances on the borderline (or nearby) are more likely to be misclassified than the ones far from the borderline, and thus more important for classification. Therefore, Borderline-SMOTE only generates synthetic instances for those minority instances closer to the border while SMOTE generates synthetic instances for each minority instance. ADASYN uses a density distribution as a criterion to automatically decide the number of synthetic instances that need to be generated for each minority instance. The density distribution is a measurement of the distribution of the weights for different minority class instances according to their level of difficulty in learning. The consideration is similar to the idea of AdaBoost [17]: one should pay more attention to the difficult instances. In summary, either BorderlineSMOTE or ADASYN improves the performance of over-sampling techniques by paying more attention on some specific instances. They, however, did not touch the essential problem of the over-sampling techniques which causes over-fitting. Different from the previous work, we resort to margins to analyze the problem of over-sampling, since margins offer a theoretic tool to analyze the generalization ability. Margins play an indispensable role in machine learning research. Roughly speaking, margins measure the level of confidence a classifier has with respect to its decision. There are two natural ways of defining the margin with respect to a classifier [14]. One approach is to define the margin as the distance between an instance and the decision boundary induced by the classification rule. Support Vector Machines are based on this definition of margin, which we refer to as sample margin. An alternative definition of the margin can be the Hypothesis Margin; in this definition the margin is the distance that the classifier can travel without changing the way it labels any of the sample points [14].
3
Large Margin Principle Analysis for Over-Sampling
For prototype-based problems (e. g. the nearest neighbor classifer), the classifier is defined by a set of training points (prototypes) and the decision boundary is the Voronoi tessellation [18]. The sample margin in this case is the distance between the instance and the Voronoi tessellation. Therefore it measures the sensitivity to small changes of the instance position. The hypothesis margin R for this case is the maximal distance such that the following condition holds:
312
X. Fan, K. Tang, and T. Weise
near hit
near miss Class A train point Class A test point Class B train point
θ
Fig. 1. Two types of margins in terms of the Nearest Neighbor Rule. The toy problem involves class A and class B. Margins of a new instance (the blue circle), which belongs to class A, are shown. The sample margin 1(left) is the distance between the new instance and the decision boundary (the Voronoi tessellation). The hypothesis margin 1(right) is the largest distance the sample points can travel without altering the label of the new instance. In this case it is half the difference between the distance to the nearest miss and the distance to the nearest hit.
if we draw a sphere with radius R around each prototype, any change of the location of prototypes inside their sphere will not change the assigned labels. Therefore, the hypothesis margin measures the stability to small changes in the prototypes locations. See Figure 1 for illustration. Throughout this paper we will focus on the margins for the Nearest Neighbor rule (NN). For this special case, it is proved the following results [14]: 1. The hypothesis-margin lower bounds the sample-margin 2. It is easy to compute the hypothesis-margin of an instance x with respect to a set of instances A by the following formula: θA (x) =
1 (||x − nearestmissA (x)|| − ||x − nearesthitA (x)||) 2
(1)
where nearesthitA (x) and nearestmissA (x) denote the nearest instance to x in dataset A with the same and different label, respectively. In the case of the NN, we can know that the hypothesis margin is easy to calculate and that a set of prototypes with large hypothesis margin then it has large sample margin as well [14]. Now we consider the over-sampling problem using the large margin principle. When adding a new minority class instance x, we consider the difference of the overall margins for the minority class: ΔP (x) = (θA\a∪{x} (a) − θA\a (a)) (2) a∈AP
where A\a denotes the dataset excluding a from the dataset A, and A\a ∪ {x} denotes the union of A\a and {x}.
Margin-Based Over-Sampling Method
313
For any a ∈ AP , ||a − nearestmissA\a∪{x} (a)|| = ||a − nearestmissA\a (a)|| and ||a − nearesthitA\a∪{x}(a)|| ≤ |a − nearesthitA\a (a)||. From Eq. (1), it follows that ΔP (x) ≥ 0. We call ΔP (x) the margin gain for the minority class. Further, the difference of the overall margins for majority class is: ΔN (x) = (θA\a∪{x} (a) − θA\a (a)) (3) a∈AN
for any a ∈ AN , ||a − nearestmissA\a∪{x} (a)||≤||a − nearestmissA\a (a)|| and ||a − nearesthitA\a∪{x} (a)|| = ||a − nearesthitA\a(a)||. From Eq. (1), it follows that ΔN (x) ≤ 0. We call −ΔN (x) the margin loss for the majority class. In summary, it is shown that the over-sampling methods are inherently risky from the perspective of the large margin principle. The over-sampling methods, such as SMOTE, will enlarge the nearest-neighbor based margins for the minority class while may decrease the nearest neighbor based margins for the majority class. Hence, over-sampling will not only bias towards the minority class but also may be detrimental to the majority class. We cannot eliminate these effects when adopting over-sampling for imbalance learning completely, but we can seek methods to optimize the two parts. In the simplest way, one can maximize the margins for the minority class and ignore the margins loss for the majority class, i.e., the following formula: f1 = −ΔP (x)
(4)
Alternatively, one may also minimize the margins loss for the majority class, which is f2 = −ΔN (x) (5) One intuitive method is to seek a good balance between maximizing the margins gain for the minority class and minimizing the margins loss for the majority class. This can be conducted by minimizing Eq. (6): f (x)3 =
−ΔN (x) ,ε > 0 ΔP (x) + ε
(6)
where ε is a positive constant to ensure that the denominator of Eq. (6) to be non-zero.
4
The Margin-Guided Synthetic Over-Sampling Algorithm
In this section we apply the above analysis to the over-sampling techniques. Without loss of generality, our algorithm is designed on the basis of SMOTE. The general idea behind it, however, can also be applied to any other oversampling technique Based on the analysis in the previous section, Eq. (6) is employed to decide whether a new synthetic instance is good enough to be added into the training
314
X. Fan, K. Tang, and T. Weise
Algorithm 1. MSYN Input: Training set X with n instances (ai , yi ), i = 1, ..., n where ai is an instance in the m dimensional feature space, and yi belongs to Y = {1, −1} is the class identity label associated with ai , Define mP and mN as the number of the minority class instances and the number of the majority class instances, respectively. Therefore, mP < mN . BIN is the set of synthetic instances, which is initialized as empty. Parameter: P ressure. 1 Calculate the number of synthetic instances that need to be generated for the minority class: G = (mN − mP ) ∗ P ressure; 2 Calculate the number of synthetic instances that needed to be generated for each minority example ai : G gi = mP 3 4 5 6 7 8 9
for each minority class instances ai do for j ← 1 to gi do Randomly choose one minority instance, azi , from the k nearest neighbors for the instance ai ; Generate the synthetic instances as using the technique of SMOTE; Add as to BIN sort the synthetic instances in BIN according to the their values of Eq. (6); return (mN − mP ) instances who have the minimum (mN − mP ) values of Eq. (6).
dataset. Our new Margin-guided Synthetic Over-Sampling algorithm, MSYN for short, is given in Algorithm 1. The major focus of MSYN is to use margin-based guideline to select the synthetic instances. P ressure ∈ N, a natural number, is a parameter for controlling the selection pressure. In order to get (mN − mP ) new synthetic instances, we first create (mN −mP )∗P ressure new instances, then we only select top best (mN − mP ) new instances according to the values of Eq. (6) and discard the rest instances. This selection process implicitly decides whether an original minority instance is used to create a synthetic instances as well as how many synthetic instances will be generated, which is different from SMOTE since SMOTE generates the same number of synthetic instances for each original minority instances. Moreover, it is easy to see that computational complexity of MSYN is O(n2 ), which is mainly decided by calculating the distance matrix.
5
Experiment Study
The Weka’s C4.5 implementation [19] is employed in our experiments. We compare our proposed MSYN with SMOTE [11], ADASYN [13], Borderline-SMOTE [12] and ROS. All experiments were carried out using 10 runs of 10-fold crossvalidation. For MSYN, the parameter P ressure is set to 10 and the ε can be any random positive real number; for other methods, the parameters are set as recommended in the corresponding paper.
Margin-Based Over-Sampling Method
315
minority class majority class true boundary
1 0.9 0.8
Feature 2
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
0.2
0.4 0.6 Feature 1
0.8
1
Fig. 2. The distribution of the dataset Concentric with noise
To evaluate the performance of our approach, experiments on both artificial and real datasets have been performed. The former is used to show the behavior of the MSYN on known data distributions while the latter is used to verify the utility of our method when dealing with real-world problems. 5.1
Synthetic Datasets
This part of our experiments focuses on synthetic data to analyze the characteristics of the proposed MSYN. We used the dataset Concentric from the ELENA project [20]. The Concentric dataset is a two-dimensional uniform concentric circular distributions problem with two classes. The instances of minority class uniformly distribute within a circle of radius 0.3 centered on (0.5, 0.5). The points of majority class are uniformly distribute within a ring centered on (0.5, 0.5) with internal and external radius respectively to 0.3 and 0.5. In order to investigate the problem of over-fitting to noise, we modify the dataset by randomly flipping the labels of 1% instances, as shown in Figure 2. In order to show the performance of the various synthetic over-sampling techniques, we sketch them in Figure 3. The new synthetic instances created by each over-sampling method, the original majority instances and the corresponding C4.5 decision boundary are drawn. From Figure 3, we can see that MSYN shows good performance in the presence of noise while SMOTE and ADASYN suffer greatly from over-fitting the noise. MSYN generates no noise instances. This can be attributed to the fact that the margin-based Eq. (6) contains the information of the neighboring instances, and this information helps to decrease the influence of noise. Both SMOTE and ADASYN generate a large number of noise instances and their decision boundary is greatly influenced. Borderline-SMOTE generates a small number of noise instances and its decision boundary is slightly influenced.
316
X. Fan, K. Tang, and T. Weise
synthetic minority instances
majority instances
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
C4.5 boundary
True boundary
0
0 0
0.2
0.4
0.6
0.8
0
1
0.2
0.4
0.6
0.8
1
0.8
1
MSYN
SMOTE 1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2 0
0 0
0.2
0.4
0.6
0.8
Borderline−SMOTE
1
0
0.2
0.4
0.6
ADASYN
Fig. 3. The synthetic instances and the corresponding C4.5 decision boundary after processing by SMOTE, MSYN, Borderline-SMOTE, ADASYN, respectively.
Furthermore, Borderline-SMOTE pays little attention to interior instances and creates only a few of synthetic instances. 5.2
Real World Problems
We test the algorithms on ten datasets from the UCI Machine Learning Repository [21]. Information about these datasets is summarized in Table 1, where num is the size of the dataset, attr is the number of features, min% is the ratio of the number of minority class number to NUM. Instead of using the overall classification accuracy, we uadopt metrics related to Receiver Operating Characteristics (ROC) curve [22] to evaluate the compared algorithms, because traditional overall classification accuracy may not be able to provide a comprehensive assessment of the observed learning algorithms in case of class imbalanced datasets [3]. Specifically, we use the AUC [22] and F-Measure [23] to evaluate the performance. We apply the Wilcoxon signed rank test with a 95% confidence level on each dataset to see whether the difference between the compared algorithms is statistically significant. Table 2 and Table 3 show the AUC and F-Measure for the datasets, respectively. The results of Table 2 reveal that MSYN wins against SMOTE on nine out of ten datasets, beats ADASYN on seven out of ten datasets, outperforms
Margin-Based Over-Sampling Method
317
Table 1. Summary of the DataSets Datasets
num
attr
min%
Abalone Contraceptive Heart Hypothyroid Ionosphere Parkinsons Pima Spect Tic-tac-toe Transfusion
4177 1473 270 3163 351 195 768 367 958 748
8 9 9 8 34 22 8 19 9 4
9.36% 22.61% 29.28% 34.90% 35.90% 24.24% 34.90% 20.65% 34.66% 31.23%
Table 2. Result in terms of AUC in the experiments performs on real datasets. For SMOTE, ADAYSN, ROS and Borderline-SMOTE, if the value is underlined, MSYN has better performance than that method; if the value is starred, MSYN exhibits lower performance compared to that method; if the value is in normal style it means that the corresponding method does not perform significantly different from MSYN according to the Wilcoxon signed rank test. The row W/D/L Sig. shows the number of wins, draws and losses of MSYN from the statistical point of view.
Dataset
MSYN SMOTE ADASYN ROS Borderline-SMOTE
Abalone Contraceptive Heart Hypothyroid Ionosphere Parkinsons Pima Spect Tic-tac-toe Transfusion
0.7504 0.6660 0.7909 0.9737 0.8903 0.8248 0.7517 0.7403 0.9497 0.7140
0.7402 0.6587 0.7862 0.9652 0.8731 0.8101 0.7427 0.7108 0.9406 0.6870
N/A
9/1/0
W/D/L Sig.
0.7352 0.6708 0.6612 0.6055 0.7824 0.7608 0.9655 0.9574 0.8773 0.8970* 0.8298* 0.7798 0.7550 0.7236 0.7157 0.6889 0.9391 0.9396 0.6897 0.6695 7/2/1
9/0/1
0.7967* 0.6775* 0.7796 0.9653 0.8715 0.8157 0.7288 0.7436 0.9456 0.6991 6/2/2
ROS on nine out of ten datasets, and wins against Borderline-SMOTE on six out of ten datasets. The results of Table 3 show that MSYN wins against SMOTE on seven out of ten datasets, beats ADASYN on six out of ten datasets, beats ROS on six out of ten datasets, and wins against Borderline-SMOTE on six out of ten datasets. The comparisons reveal that MSYN outperforms the other methods in terms of both AUC and F-measure.
318
X. Fan, K. Tang, and T. Weise
Table 3. Result in terms of F-measure in the experiments performs on real datasets. For SMOTE, ADAYSN, ROS and Borderline-SMOTE, if the value is underlined, MSYN has better performance than that method; if the value is starred, MSYN exhibits lower performance compared to that method; if the value is in normal style it means that the corresponding method does not perform significantly different from MSYN according to the Wilcoxon singed rank test. The row W/D/L Sig. shows the number of wins, draws and losses of MSYN from the statistical point of view. Dataset
MSYN SMOTE ADASYN ROS Borderline-SMOTE
Abalone Contraceptive Heart Hypothyroid Ionosphere Parkinsons Pima Spect Tic-tac-toe Transfusion
0.2507 0.3745 0.7373 0.8875 0.8559 0.7308 0.6452 0.4660 0.8619 0.4723
0.3266* 0.4034* 0.7305 0.8412 0.8365 0.6513 0.6435 0.4367 0.8465 0.4601
N/A
7/1/2
W/D/L Sig.
6
0.3289* 0.3479* 0.4118* 0.4133* 0.7318 0.7151 0.8413 0.8771 0.8338 0.8668* 0.6832 0.6519 0.6499 0.6298 0.4206 0.4644 0.8437 0.8556 0.4507 0.4596 6/2/2
6/1/3
0.3154* 0.4142* 0.7223 0.9054* 0.8226 0.6719 0.6310 0.4524 0.8604 0.4664 6/1/3
Conclusion and Future Work
This paper gives an analysis of over-sample techniques from the viewpoint of the large margin principle. It is shown that over-sampling techniques will not only bias towards the minority class but may also bring detrimental effects to the classification of the majority class. This inherent dilemma of over-sampling cannot be entirely eliminated, but only reduced. We propose a new synthetic over-sampling method to strike a balance between the two contradictory objectives. We evaluate our new method on a wide variety of imbalanced datasets using different performance measures and compare it to the established over-sampling methods. The results support our analysis and indicate that the proposed method, MSYN, is indeed superior. As a new sampling method, MSYN can be further extended along several directions. First of all, we investigate the performance of MSYN using C4.5. Based on the nearest neighbor margin, MSYN has a bias for the 1-NN. Some strategies, however, can be adopted to approximate the hypothesis margin for the other classification rules. For example, we can use the confidence of the classifiers’ output to approximate the hypothesis margin. Thus we expect MSYN can be extended to work well with other learning algorithms, such as k-NN, RIPPER [28]. But solid empirical study is required to justify this expectation. Besides, ensemble learning algorithms can improve the accuracy and robustness of the learning procedure [25]. It is thus worthy of integrating MSYN with ensemble learning algorithms. Such an investigation can be conducted following the methodology employed in the work of SMOTEBoost [5], DataBoost-IM [26], BalanceCascade [27], etc.
Margin-Based Over-Sampling Method
319
Secondly, MSYN can be generalized to multiple-class imbalance learning as well. For each minority class i, a straightforward idea is to extend Eq. (6) to: − Δi,j (x) fi (x) =
j=i
Δi (x) + ε
,ε > 0
(7)
where Δi (x) denotes the margin gain of minority class i by adding a new minority instance x (x belongs to class i), and −Δi,j (x) denotes the margin loss for class j by adding a new minority instance x (x belongs to class i). Then we create the synthetic instances for each minoirty class to make the number of them being equal to the number of the majority class, which has the maximum number of instances. However, this idea is by no means the only one. Extending a technique from binary to multi-class problems is usually non-trivial, and more in-depth investigation is necessary to seek the best strategy.
References 1. Chan, P.K., Stolfo, S.J.: Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection. In: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pp. 164–168 (2001) 2. Kubat, M., Holte, R.C., Matwin, S.: Machine Learning for the Detection of Oil Spills in Satellite Radar Images. Machine Learning 30(2), 195–215 (1998) 3. Weisis, G.M.: Mining with Rarity: A Unifying Framwork. SiGKDD Explorations 6(1), 7–19 (2004) 4. Wu, G., Chang, E.Y.: Class-Boundary Alignment for Imbalanced Dataset Learning. In: Workshop on Learning from Imbalanced Datasets II, ICML, Washington DC (2003) 5. Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: Smoteboost: Improving Prediction of the Minority Class in Boosting. In: Lavraˇc, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 107–119. Springer, Heidelberg (2003) 6. Liu, W., Chawla, S., Cieslak, D.A., Chawla, N.V.: A Robust Decision Tree Algorithm for Imbalanced Data Sets. In: SIAM International Conf. on Data Mining (2010) 7. Zhou, Z.H., Liu, X.Y.: Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering, 63–77 (2006) 8. Raskutti, B., Kowalczyk, A.: Extreme re-balancing for SVMs: a case study. SIGKDD Explorations 6(1), 60–69 (2004) 9. Japkowicz, N.: The Class Imbalance Problem: Significance and Strategies. In: Proceeding of the 2000 International Conf. on Artificial Intelligence (ICAI 2000): Special Track on Inductive Learning, Las Vegas, Nevada (2000) 10. Ling, C., Li, C.: Data Mining for Direct Marketing Problems and Solutions. In: Proceeding of the Fourth International Conf. on Knowledge Discovery and Data Mining, KDD 1998, New York, NY (1998) 11. Chawla, N.V., Hall, L.O., Bowyer, K.W., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Oversampling Technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)
320
X. Fan, K. Tang, and T. Weise
12. Han, H., Wang, W.Y., Mao, B.H.: Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. Advances in Intelligent Computing, 878–887 (2005) 13. He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. In: Proceeding of International Conf. Neural Networks, pp. 1322–1328 (2008) 14. Crammer, K., Gilad-Bachrach, R., Navot, A., Tishby, N.: Margin analysis of the LVQ algorithm. Advances in Neural Information Processing Systems, 479–486 (2003) 15. Gilad-Bachrach, R., Navot, A., Tishby, N.: Margin based feature selection-theory and algorithms. In: Proceeding of the Twenty-First International Conference on Machine Learning (2004) 16. He, H., Garcia, E.A.: Learning from Imbalance Data. IEEE Transaction on Knowledge and Data Engineering 21(9), 1263–1284 (2009) 17. Freund, Y., Schapire, R.: A desicion-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55(1), 119–139 (1997) 18. Bowyer, A.: Computing dirichlet tessellations. The Computer Journal 24(2) (1981) 19. Witten, I.H., Frank, E.: Data mining: practical machine learning tools and techniques with Java implementations. ACM SIGMOD Record 31(1), 76–77 (2002) 20. UCL machine learning group, http://www.dice.ucl.ac.be/mlg/?page=Elena 21. Asuncion, A., Newman, D.: UCI machine learning repository (2007) 22. Bradley, A.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 30(7), 1145–1159 (1997) 23. Van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1979) 24. Wang, B.X., Japkowicz, N.: Imbalanced Data Set Learning with Synthetic Samples. In: Proc. IRIS Machine Learning Workshop (2004) 25. Dietterich, T.G.: Ensemble methods in machine learning. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, pp. 1–15. Springer, Heidelberg (2000) 26. Guo, H., Viktor, H.L.: Learning from Imbalanced Data Sets with Boosting and Data Generation: the DataBoost-IM Approach. SIGKDD Explorations: Special issue on Learning from Imbalanced Datasets 6(1), 30–39 (2004) 27. Liu, X.Y., Wu, J., Zhou, Z.H.: Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man and Cybernetics - Part B: Cybernetics 39(2), 539–550 (2009) 28. Cohen, W.: Fast Effective Rule Induction. In: Proceeding of 12th International Conf. on Machine Learning, Lake Tahoe, CA, pp. 115–123. Morgan Kaufmann, San Francisco (1995)
Improving k Nearest Neighbor with Exemplar Generalization for Imbalanced Classification Yuxuan Li and Xiuzhen Zhang School of Computer Science and Information Technology, RMIT University, Melbourne, Australia {li.yuxuan,xiuzhen.zhang}@rmit.edu.au
Abstract. A k nearest neighbor (kNN) classifier classifies a query instance to the most frequent class of its k nearest neighbors in the training instance space. For imbalanced class distribution, a query instance is often overwhelmed by majority class instances in its neighborhood and likely to be classified to the majority class. We propose to identify exemplar minority class training instances and generalize them to Gaussian balls as concepts for the minority class. Our k Exemplar-based Nearest Neighbor (kENN) classifier is therefore more sensitive to the minority class. Extensive experiments show that kENN significantly improves the performance of kNN and also outperforms popular re-sampling and costsensitive learning strategies for imbalanced classification. Keywords: Cost-sensitive re-sampling.
1
learning,
imbalanced
learning,
kNN,
Introduction
Skewed class distribution is common for many real-world classification problems, including for example, detection of software defects in software development projects [14], identification of oil spills in satellite radar images [11], and detection of fraudulent calls [9]. Typically the intention of classification learning is to achieve accurate classification for each class (especially the rare class) rather than an overall accuracy without distinguishing classes. In this paper our discussions will be focused on the two-class imbalanced classification problem, where the minority class is the positive and the majority class is the negative. Imbalanced class distribution has been reported to impede the performance of many concept learning systems [20]. Many systems adopt the maximum generality bias [10] to induct a classification model, where the concept1 of the least number of conditions (and therefore the most general concept) is chosen to describe a cluster of training instances. In the presence of class imbalance this induction bias tends to over-generalize concepts for the majority class and miss concepts for the minority class. In formulating decision trees [17] for example, 1
The term concept is used in its general sense. Strictly in the context of classification learning it is a subconcept of the complete concept for some class.
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 321–332, 2011. c Springer-Verlag Berlin Heidelberg 2011
322
Y. Li and X. Zhang
induction may stop at a node where class for the node is decided by the majority of instances under the node and instances of the minority class are ignored. In contrast to most concept learning systems, k nearest neighbor (kNN) classification [6,1,2] or instance-based learning, does not formulate a generalized conceptual model from the training instances at the training stage. Rather at the classification stage, a simple and intuitive rule is used to make decisions: instances close in the input space are likely to belong to the same class. Typically a kNN classifier classifies a query instance to the class that appears most frequently among its k nearest neighbors. k is a parameter for tuning the classification performance and is typically set to three to seven. Although instance-based learning has been advocated for imbalanced learning [10,19,3], to the best of our knowledge, a large-scale study of applying kNN classification to imbalanced learning has not been reported in literature. Most research efforts in this area have been on trying to improve its classification efficiency [1,2,21]. Various strategies have been proposed to avoid an exhaustive search for all training instances and to achieve accurate classification. In the presence of class imbalance kNN classification also faces challenges to correctly detect the positive instances. For a query instance, if its neighborhood is overwhelmed by negative instances, positive instances are still likely to be ignored in the decision process. Our main idea to mitigate the decision errors is to introduce a training stage to generalize the positive instances from a point to a gaussian ball in the instance space. Rather than generalizing every positive instances which may introduce false positives, we propose an algorithm to identify the exemplar positive instances called pivot positive instances (c.f. Section 3), and use them to reliably derive the positive class boundary. Experiments on 12 real-world imbalanced datasets show that our classifier, k Exemplar-based Nearest Neighbor (kENN), is effective and significantly improves the performance of kNN for imbalanced learning. kENN also outperforms the current re-sampling and cost-sensitive learning strategies, namely SMOTE [5] and MetaCost [7], for imbalanced classification. 1.1
Related Work
kNN has been advocated for learning with minority instances [10,19,3] because of its high specificity bias of keeping all minority instances. In [10], the problem of small disjuncts (a small cluster of training instances) was first studied and the maximum specificity bias was shown to reduce errors for small disjuncts. In [19] kNN was used to improve learning from small disjuncts encountered in the C4.5 decision tree system [17]. In [3] kNN was employed to learn from small disjuncts directly, however learning results were not provided to demonstrate its performance. In contrast to these previous work, we propose to generalize exemplar positive instances to form concepts for the minority class, and then apply kNN directly for imbalanced classification. With the assumption that learning algorithms perform best when classes are evenly distributed, re-sampling training data for even class distribution has been proposed to tackle imbalanced learning. Kubat and Matwin [12] tried to
Improving kNN with Exemplar Generalization
+
+ + +
+
-
*
-
-
-
-
-
+
P2
* *
-
-
-
-
-
-
-
-
- -
-
-
-
-
-
-
+
P3
-
+ +
-
-
-
-
-
- - -
-
-
+
-
N1 -
-
-
-
-
-
-
-
-
+
-
-
-
P1 +
+
-
-
-
-
-
-
-
+
+
+
+
323
+
*
- -
-
+
-
-
Fig. 1. An artificial imbalance classification problem
under-sample the majority class, while Ling and Li [13] combined over-sampling of the minority class with under-sampling of the majority class. Especially Chawla and Bowyer [5] proposed Synthetic Minority Over-sampling TEchnique (SMOTE) to over-sample the minority class by creating synthetic samples. It was shown that SMOTE over-sampling of the minority class in combination with under-sampling the majority class often could achieve effective imbalanced learning. Another popular strategy tackling the imbalanced distribution problem is cost-sensitive learning [8]. Domingos [7] proposed a re-costing method called MetaCost, which can be applied to general classifiers. The approach made errorbased classifiers cost-sensitive. His experimental results showed that MetaCost reduced costs compared to cost-blind classifier using C4.5Rules as baseline. Our experiments (c.f. Section 5) show that SMOTE in combination with under-sampling of majority class as well as MetaCost significantly improves the performance of C4.5 for imbalanced learning. However these strategies somehow do not statistically significantly improve the performance of kNN for class imbalance. This may be partly explained by that kNN makes classification decision by examining the local neighborhood of query instances where the global re-sampling and cost-adjustment strategies may not have pronounced effect.
2
Main Ideas
Fig. 1 shows an artificial two-class imbalance problem, where positive instances are denoted as “+” and negative instances are denoted as “-”. True class boundaries are represented as solid lines while the decision boundaries by some classification model are represented as dashed lines. Four query instances that indeed belong to the positive class are represented as stars (*). Three subconcepts associated with the positive class are the three regions formed by the solid lines, denoted as P1, P2 and P3 respectively. Subconcept P1 covers a large portion of instances in the positive instance space whereas P2 and P3 correspond to small
324
Y. Li and X. Zhang
-
-
-
-
+
-
+ +
-
+
*
-
-
-
-
+
-
- -
+ +
-
+
*
- -
-
+
-
-
(a) Standard 1NN
-
+
-
-
(b) Exemplar 1NN
Fig. 2. The Voronoi diagram for the subspace of subconcept P3 of Fig. 1
disjuncts of positive instances. Note that the lack of data for the subconcepts P2 and P3 cause the classification model to learn inappropriate decision boundaries for P2 and P3. As a result, two query instances (denoted by *) that are indeed positive defined by P2 fall outside the positive decision boundary of the classifier and similarly for another query instance defined as positive by P 3. Given the problem in Fig. 1, we illustrate the challenge faced by a standard kNN classifier using the subspace of instances at the lower right corner. Figure 2(a) shows the Voronoi diagram for subconcept P3 in the subspace, where the positive class boundary decided by standard 1NN is represented as the polygon in bold line. The 1NN induction strategy where the class of an instance is decided by the class of its nearest neighbor results in a class boundary much smaller than the true class boundary (circle). As a result the query instance (denoted by *), which indeed is a positive instance inside the true positive boundary, is predicted as negative by standard 1NN. Obviously to achieve more accurate prediction, the decision boundary for the positive class should be expanded so that it is closer to the true class boundary. A naive approach to expanding the decision boundary for the positive class is to generalize every positive instance in the training instance space from a point to a Gaussian ball. However this aggressive approach to expanding the positive boundary can most definitely introduce false positives. We need a strategy to selectively expand some positive points in the training instance space so that the decision boundary closely approximates the real class boundary while not introducing too many false positives. Our main idea of expanding the decision boundary for the positive class while minimizing false positives is based on exemplar positive instances. Exemplar positive instances should be the positive instances that can be generalized to reliably classify more positive instances in independent tests. Intuitively these instances should include the strong positive instances at or close to the center of a disjunct of positive instances in the training instance space. Weak positive instances close to the class boundaries should be excluded. Fig. 2(b) shows the Voronoi diagram after the three positive instances at the center of the disjunct of positive instances have been used to expand the boundary for the positive class. Obviously the decision boundary after adjustment is
Improving kNN with Exemplar Generalization
325
much closer to the real class boundary. As a result, the query instance (represented by *) is now enclosed by the boundary decided by the classifier and is correctly predicted as positive.
3
Pivot Positive Instances
Ideally exemplar positive instances can be reliably generalized to form the subconcept for a disjunct of positive instances with low false positive errors in the space of training instances. We call these exemplar instances pivot positive instances (PPIs) and define them using their neighborhood. Definition 1. The Gaussian ball B(x, r) centered at an instance x in the training instance space Rn (n is the number of features defining the space) is the set of instances within distance r of x: {y ∈ Rn | distance(x, y)≤ r}. Each Gaussian ball defines a positive subconcept and only those positive instances that can form sufficiently accurate positive subconcepts are pivot positive instances, as defined below. Definition 2. Given a training instance space Rn , and a positive instance x ∈ Rn , let the distance between x and its nearest positive neighbor be e. For a false positive error rate (FP rate) threshold δ, x is a pivot positive instance (PPI) if the subconcept for Gaussian ball B(x, e) has an FP rate ≤ δ. For simplicity the FP rate for the Gaussian ball centered at a positive instance is called the FP rate for the positive instance. To explain the concept of PPI, let us for now assume that the false positive rate for a positive instance is its observed false positive ratio in the training instance space. Example 1. Consider the positive instances in the subspace highlighted in Fig. 1 Given a false positive rate threshold of 30%, the three positive instances at the center have zero false positives in its Gaussian ball (shown in enlarged form in Fig. 2(b)) and therefore are PPIs. The other two positive instances however are not PPIs, as they have observed false positive ratio of respectively 50% (2 out of 4) and 33.3% (1 out of 3). The observed false positive ratio for a positive instance in the training instance space is not accurate description of its performance in the presence of independently chosen test instances. We estimate the false positive rate by re-adjusting the observed false positive ratio using pessimistic estimate. A similar approach has been used to estimate the error rate for decision tree nodes in C4.5 [17,22]. Assume that the number of false positives in a Gaussian ball of N instances follows the binomial distribution B(N, p), where p is the real probability of false positives in the Gaussian ball. For a given confidence level c where its corresponding z can be computed, p can be estimated from N and the observed false positive ratio f as follows [22]: f + z 2 /2N + z f (1 − f )/N + z 2 /4N 2 (1) 1 + z 2 /N
326
Y. Li and X. Zhang
Algorithm 1. Compute the set of positive pivot points Input: a) Training set T (|T | is number of instances in T ); b) confidence level c. Output: The set of pivot positive instances P (with radius r for each Gaussian ball) 1: δ ← FP rate threshold by Equation (1) from c, |T |, and prior negative frequency 2: P ← φ 3: for each positive instance x ∈ T do 4: G ← neighbors of x in increasing order of distance to x 5: for k = 1 to |G| do 6: if G[k] is a positive instance then 7: break {;; G[k] is the nearest positive neighbor of x} 8: end if 9: end for 10: r ← distance(x, G[k]) 11: f ← k−1 {;; Gaussian ball B(x, r) has k + 1 instances and (k + 1 − 2) FPs} k+1 12: p ← the FP rate by Equation (1) from c, k and f 13: if p ≤ δ then 14: P ← P ∪ {x} {;; x is a pivot positive instance, and P is the output} 15: end if 16: end for
The Gaussian ball for a positive instance always has two positive instances— the reference positive instance and its nearest positive neighbor. The confidence level is a parameter tuning the performance of PPIs. A high confidence level means the estimated false positive error rate is close to the observed false negative ratio, and thus few false positives are tolerated in identifying PPIs. On very imbalanced data we need to tolerate a large number of false positives to aggressively identify PPIs to achieve high sensitivity for the positives. Our experiments (Section 5.3) confirm this hypothesis. The default confidence level is set to 10%. We set the FP rate threshold for identifying PPIs based on the imbalance level of training data. The threshold for PPIs are dynamically determined by the prior negative class frequency. If the false positive rate for a positive instance estimated using Equation (1) is not greater than the threshold estimated from the prior negative class frequency, the positive instance is a PPI. Under this setting, a relatively larger number of FP errors are allowed in Gaussian balls for imbalanced data while less errors are allowed for balanced data. Especially on very balanced data the PPI mechanism will be turned off and kENN reverts to standard kNN. For example on a balanced dataset of 50 positive instances and 50 negative instances, at a confidence level of 10%, the FP rate threshold for PPIs is 56.8% (estimated from the 50% negative class frequency using Equation (1)). A Gaussian ball without any observed FP errors (and containing 2 positive instances only) has an estimated FP rate of 68.4%2 . As a result no PPIs are identified at the training stage and standard kNN classification will be applied. 2
Following standard statistics, when there are not any observed errors, for N instances √ at confidence level c the estimated error rate is 1 − N c.
Improving kNN with Exemplar Generalization
4
327
k Exemplar-Based Nearest Neighbor Classification
We now describe the algorithm to identify pivot positive instances at the training stage, and how to make use of the pivot positive instances for classification for nearest neighbor classification. The complete process of computing pivot positive instances from a given set of training instances is illustrated in Algorithm 1. Input to the algorithm are the training instances and a confidence level c. Output of the algorithm are the pivot positive instances with their radius distance for generalization to Gaussian balls. In the algorithm first FP rate threshold δ is computed using Equation (1) from confidence level c, number of training instances |T | and the prior negative class frequency (line 1). The neighbors of each positive instance x are sorted in increasing order of their distance to x (line 4). The loop of lines 3 to 16 computes the PPIs and accumulating them in P to be output (line 14). Inside the loop, the FP rate p for each positive instance x is computed using Equation (1) from the observed FP ratio f (line 12) and if p ≤ δ, x is identified as a PPI and kept in P (line 14). The main computation of Algorithm 1 lies in the process computing up to n − 1 nearest neighbors for each positive instance x and sorting according to their distance to x, where n is the size of the training set (line 4). Algorithm 1 thus has a complexity of O(p∗n log n), where p and n are respectively the number of positive and all instances in the training set. Note that p << n for imbalanced datasets, and so the algorithm has reasonable time efficiency. At the classification stage, to implement the concept of Gaussian balls for pivot positive instances, the distance of a query instance to its k nearest neighbors is adjusted for all PPIs. Specifically for a query instance t and a training instances x , the adjusted distance between t and x is defined as: distance(t, x) − x.radius if x is a PPI adjusted distance(t, x) = (2) distance(t, x) otherwise where distance(t, x) is the distance between t and x using some metric of standard kNN. With the above equation, the distance between a query instance and a PPI in the training instance space is reduced by the radius of the PPI. As a result the adjusted distance is conceptually equivalent to the distance of the query instance to the edge of the Gaussian ball centered at the PPI. The adjusted distance function as defined in Equation (2) can be used in kNN classification in the presence of class imbalance, and we call the classifier k Exemplar-based Nearest Neighbor (kENN).
5
Experiments
We conducted experiments to evaluate the performance of kENN. kENN was compared against kNN and the naive approach of generalizing positive subconcepts for kNN (Section 2). kENN was also compared against two popular imbalanced learning strategies SMOTE re-sampling and MetaCost cost-sensitive
328
Y. Li and X. Zhang Table 1. The experimental datasets, ordered in decreasing level of imbalance Dataset size Oil 937 Hypo-thyroid 3163 PC1 1109 Glass 214 Satimage 6435 CM1 498 New-thyroid 215 KC1 2109 SPECT F 267 Hepatitis 155 Vehicle 846 German 1000
#attr (num, symb) classes (pos, neg) minority (%) 47(47, 0) (true, false) 4.38% 25 (7, 18) (true, false) 4.77% 21 (21,0) (true, false) 6.94% 9 (9,0) (3, other) 7.94% 36 (36,0) (4, other) 9.73% 21 (21,0) (true, false) 9.84% 5 (5,0) (3, other) 13.95% 21 (21,0) (true, false) 15.46% 44 (44,0) (0, 1) 20.60% 19 (6,13) (1, 2) 20.65% 18 (18,0) (van, other) 23.52% 20 (7,13) (2, 1) 30.00%
learning, using kNN (IBk in WEKA) and C4.5 (J48 in WEKA) as the base classifiers. All classifiers were developed based on the WEKA data mining toolkit [22], and are available at http://www.cs.rmit.edu.au/∼zhang/ENN. For both kNN and kENN k was set to 3 by default, and the confidence level of kENN was set to 10%. To increase the sensitivity of C4.5 to the minority class, C4.5 was set with the -M1 option that minimum one instance was allowed for a leaf node and without pruning. SMOTE oversampling combined with undersampling was applied to 3NN and C4.5, denoted as 3NNSmt+ and C4.5Smt+ respectively. SpreadSubsample was used to undersample the majority class for uniform distribution (M=1.0), and then SMOTE was applied to generate additional 3 times more instances for the minority class. MetaCost was used for cost-sensitive learning with 3NN and C4.5 (denoted as 3NNMeta and C4.5Meta) and the cost of each class was set to the inverse of class ratio. Table 1 summarizes the 12 real-world imbalanced datasets from various domains used in our experiments, from highly imbalanced (the minority 4.35%) to moderately imbalanced (the minority 30.00%). The Oil dataset was provided by Robert Holte [11], and the task is to detect the oil spill (4.3%) from satellite images. The CM1, KC1 and PC1 datasets were obtained from the NASA IV&V Facility Metrics Data Program (MDP) repository (http://mdp.ivv.nasa.gov/ index.html). The task is to predict software defects (around 10% on average) in software modules. The remaining datasets were compiled from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml). In addition to the natural 2-class domains, like thyroid diseases diagnoses and Hepatitis, we also constructed four imbalanced datasets by choosing one class as the positive and the remaining classes combined as the negative. The Receiver Operating Characteristic (ROC) curve [18] is becoming widely used to evaluate imbalanced classification. Given a confusion matrix of four types of decisions True Positive (TP), False Positive (FP), True Negative (TN) and P False Negative (FN), ROC curves depict tradeoffs between T P rate = T PT+F N FP and F P rate = F P +T N . Good classifiers can achieve a high TP rate at a low
Improving kNN with Exemplar Generalization
329
Table 2. The AUC for kENN, in comparison with other systems. The best result for each dataset is in bold. AUCs with difference <0.005 are considered equivalent. Dataset 3ENN Oil 0.811 Hypo-thyroid 0.846 PC1 0.806 Glass 0.749 Satimage 0.925 CM1 0.681 New-thyroid 0.99 KC1 0.794 SPECT F 0.767 Hepatitis 0.783 Vehicle 0.952 German 0.714 Average 0.818
Naive 0.788 0.831 0.786 0.623 0.839 0.606 0.945 0.732 0.728 0.71 0.945 0.677 0.768
3NN 0.796 0.849 0.756 0.645 0.918 0.637 0.939 0.759 0.72 0.758 0.969 0.69 0.786
3NNSmt+ 3NNMeta 0.797 0.772 0.901 0.846 0.755 0.796 0.707 0.659 0.902 0.928 0.666 0.625 0.972 0.962 0.756 0.779 0.725 0.735 0.772 0.744 0.942 0.956 0.686 0.705 0.798 0.792
C4.5 0.685 0.924 0.789 0.696 0.767 0.607 0.927 0.64 0.626 0.753 0.921 0.608 0.745
C4.5Smt+ 0.771 0.948 0.728 0.69 0.796 0.666 0.935 0.709 0.724 0.713 0.926 0.649 0.771
C4.5Meta 0.764 0.937 0.76 0.754 0.765 0.668 0.931 0.695 0.643 0.745 0.929 0.606 0.766
FP rate. Area Under the ROC Curve (AUC) measures the overall classification performance [4], and a perfect classifier has an AUC of 1.0. All results reported next were obtained from 10-fold cross validation and two-tailed paired t-tests at 95% confidence level were used to test statistical significance. The ROC convex hull method provides visual performance analysis of classification algorithms at different levels of sensitivity [15,16]. In the ROC space, each point of the ROC curve for a classification algorithm corresponds to a classifier. If a point falls on the convex hull of all ROC curves the corresponding classifier is potentially an optimal classifier; otherwise the classifier is not optimal. Given a classification algorithm, the higher fraction of its ROC curve points lie on the convex hull the more chance the algorithm produce optimal classifiers. For all results reported next, data points for the ROC curves were generated using the ThresholdCurve module of WEKA, which correspond to the number of TPs and FPs that result from setting various thresholds on the probability of the positive class. The AUC value for ROC curves were obtained using the Mann Whitney statistic in WEKA. The convex hull for ROC curves were computed using the ROCCH package3 . 5.1
Performance Evaluation Using AUC
Table 2 shows the AUC results for all models. It can be seen that 3ENN is a very competitive model. Compared with the remaining models, 3ENN has the highest average AUC of 0.818 and wins on 9 datasets. In comparison the average AUC for the Naive method is just 0.768. 3ENN significantly outperforms all of 3NN (p = 0.005), 3NNSmt+ (p = 0.029), 3NNMeta (p = 0.008), C4.5Smt+ (p = 0.014) and C4.5Meta (p = 0.021). This result confirms that our exemplarbased positive concept generalization strategy is very effective for improving the 3
Available at http://home.comcast.net/~ tom.fawcett/public_html/ROCCH/
Y. Li and X. Zhang
0.6
0.90
0.8
0.95
1.0
1.00
330
0.4
3ENN C4.5Smt+ 3NNSmt+ 3NNMeta Convex Hull
0.80
0.85
3ENN C4.5Smt+ 3NNSmt+ 3NNMeta Convex Hull
0.0
0.2
0.4
0.6
0.8
1.0
(a) New-thyroid
0.2
0.4
0.6
0.8
1.0
(b) German
Fig. 3. ROC curves with convex hull on two datasets. The x-axis is the FP rate and the y-axis is the TP rate. Points on the convex hull are highlighted with a large circle.
performance of kNN for imbalanced classification, and furthermore the strategy is more effective than re-sampling and cost-sensitive learning strategies. It should be noted that C4.5Smt+ and C4.5Meta both demonstrate improvement over C4.5. This shows that re-sampling and cost-sensitive learning strategies are effective for improving the performance of C4.5 for imbalanced classification, which is consistent with previous findings [5,20]. On the other hand however, 3NNSmt+ and 3NNMeta do not show significant improvement over 3NN. That these strategies are less effective on kNN for class imbalance may be attributed to that kNN adopts a maximal specificity induction strategy. For example, the re-sampling strategy ensures overall class balance however it does not necessarily mean that for the neighborhood of individual query instances the minority class is well represented. Not forming concepts from overall even sample distribution after re-sampling, kNN may still miss some positive query instances due to the under-representation of the positive class in their neighborhood. 5.2
The ROC Convex Hull Analysis
Table 2 has shown that C4.5Smt+ outperforms C4.5Meta, and so for readability we only compare the ROC curves of 3ENN against that of 3NNSmt+, 3NNMeta and C4.5Smt+. Fig. 3 shows the ROC curves of the four models on the Newthyroid and German datasets. From Table 2, 3ENN and 3NNSmt+ have the best AUC results of 0.99 and 0.972 on New-thyroid, which has a relatively high level of imbalance of 13.95%. But as shown in Fig. 3(a), the ROC curves of the four models show very different trends. Notably more points of 3ENN lie on the convex hull at low FP rates (<10%). Conversely more points of 3NNSmt+ lie on the convex hull at high FP rates (>50%). It is desirable in many applications to achieve accurate prediction at low false positive rate and so 3ENN is obviously a good choice for this purpose. German has a moderate imbalance level of 30%.
331
0.75 0.70
AUC
0.80
0.85
Improving kNN with Exemplar Generalization
0.65
Oil Glass KC1 German
0
10
20
30
40
50
Confidence level%
Fig. 4. The AUC of 3ENN with varying confidence level
ROC curves of the four models demonstrate similar trends on German, as shown in Fig. 3(b). Still at low FP rates, more points from 3ENN lie on the ROC convex hull, which again shows that 3ENN is a strong model. 5.3
The Impact of Confidence Level on kENN
As discussed in Section 3 confidence level affects the decision in kENN of whether to generalize a positive instance to a Gaussian ball. We applied 3ENN to two highly imbalanced datasets and two moderately imbalanced datasets with confidence level from 1% to 50%. The AUC results are shown in Fig. 4. For the two datasets with high imbalance (Oil 4.38% and Glass 7.94%) AUC is negatively correlated with confidence level. For example on Oil when the confidence level increases from 1% to 50% the AUC decreases from 0.813 to 0.801. However for the two datasets with moderate imbalance (KC1 15.46% and German 30.00%) AUC is inversely correlated with confidence level. On German when confidence level increases from 1% to 50% AUC increases from 0.69 to 0.718. The opposite behavior of AUC in relation to confidence level may be explained by that on highly imbalanced data, to predict more positive instances, it is desirable to tolerate more false positives in forming Gaussian balls, which is achieved by setting a low confidence level. Such an aggressive strategy increases the sensitivity of kENN to positive instances. On less imbalanced datasets where there are relatively sufficient positive instances, a high confidence level is desired to ensure a low level of false positives in positive Gaussian balls.
6
Conclusions
With kNN classification, the class of a query instance is decided by the majority class of its k nearest neighbors. In the presence of class imbalance, a query instance is often classified as belonging to the majority class and as a result many positive (minority class) instances are misclassified. In this paper, we have proposed a training stage where exemplar positive training instances are identified
332
Y. Li and X. Zhang
and generalized into Gaussian balls as concepts for the minority class. When classifying a query instance using its k nearest neighbors, the positive concepts formulated at the training stage ensure that classification is more sensitive to the minority class. Extensive experiments have shown that our strategy significantly improves the performance of kNN and also outperforms popular re-sampling and cost-sensitive learning strategies for imbalanced learning.
References 1. Aha, D.W. (ed.): Lazy learning. Kluwer Academic Publishers, Dordrecht (1997) 2. Aha, D.W., et al.: Instance-based learning algorithms. Machine Learning 6 (1991) 3. Bosch, A., et al.: When small disjuncts abound, try lazy learning: A case study. In: BDCML (1997) 4. Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 30 (1997) 5. Chawla, N.V., et al.: SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16 (2002) 6. Cover, T., Hart, P.: Nearest neighbor pattern classification. Institute of Electrical and Electronics Engineers Transactions on Information Theory 13 (1967) 7. Domingos, P.: Metacost: A general method for making classifiers cost-sensitive. In: KDD 1999 (1999) 8. Elkan, C.: The foundations of cost-sensitive learning. In: IJCAI (2001) 9. Fawcett, T., Provost, F.J.: Adaptive fraud detection. Data Mining and Knowledge Discovery 1(3) (1997) 10. Holte, R.C., et al.: Concept learning and the problem of small disjuncts. In: IJCAI 1989 (1989) 11. Kubat, M., et al.: Machine learning for the detection of oil spills in satellite radar images. Machine Learning 30(2-3) (1998) 12. Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: One-sided selection. In: ICML 1997 (1997) 13. Ling, C., et al.: Data mining for direct marketing: Problems and solutions. In: KDD 1998 (1998) 14. Menzies, T., et al.: Data mining static code attributes to learn defect predictors. IEEE Transactions on Software Engineering 33 (2007) 15. Provost, F., et al.: The case against accuracy estimation for comparing induction algorithms. In: ICML 1998 (1998) 16. Provost, F.J., Fawcett, T.: Robust classification for imprecise environments. Machine Learning 42(3) (2001) 17. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993) 18. Swets, J.: Measuring the accuracy of diagnostic systems. Science 240(4857) (1988) 19. Ting, K.: The problem of small disjuncts: its remedy in decision trees. In: Canadian Conference on Artificial Intelligence (1994) 20. Weiss, G.M.: Mining with rarity: a unifying framework. SIGKDD Explorations 6(1) (2004) 21. Wilson, D.R., Martinez, T.R.: Reduction techniques for instance-based learning algorithms. Machine Learning (2000) 22. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco (2005)
Sample Subset Optimization for Classifying Imbalanced Biological Data Pengyi Yang1,2,3 , Zili Zhang4,5, , Bing B. Zhou1,3 , and Albert Y. Zomaya1,3 1 3
School of Information Technologies, University of Sydney, NSW 2006, Australia 2 NICTA, Australian Technology Park, Eveleigh, NSW 2015, Australia Centre for Distributed and High Performance Computing, University of Sydney NSW 2006, Australia 4 Faculty of Computer and Information Science, Southwest University CQ 400715, China 5 School of Information Technology, Deakin University, VIC 3217, Australia
[email protected],
[email protected]
Abstract. Data in many biological problems are often compounded by imbalanced class distribution. That is, the positive examples may largely outnumbered by the negative examples. Many classification algorithms such as support vector machine (SVM) are sensitive to data with imbalanced class distribution, and result in a suboptimal classification. It is desirable to compensate the imbalance effect in model training for more accurate classification. In this study, we propose a sample subset optimization technique for classifying biological data with moderate and extremely high imbalanced class distributions. By using this optimization technique with an ensemble of SVMs, we build multiple roughly balanced SVM base classifiers, each trained on an optimized sample subset. The experimental results demonstrate that the ensemble of SVMs created by our sample subset optimization technique can achieve higher area under the ROC curve (AUC) value than popular sampling approaches such as random over-/under-sampling; SMOTE sampling, and those in widely used ensemble approaches such as bagging and boosting.
1
Introduction
Modern molecular biology is rapidly advanced by the increasing use of computational techniques. For tasks such as RNA gene prediction [1], promoter recognition [2], splice site identification [3], and the classification of protein localization sites [4], it is often necessary to address the problem of imbalanced class distribution because the datasets extracted from those biological systems are likely to contain a large number of negative examples (referred to as majority class) and a small number of positive examples (referred to as minority class). Many popular classification algorithms such as support vector machine (SVM) have been applied to a large variety of bioinformatics problems including those mentioned above (e.g. refs. [1,3,4]). However, most of these algorithms are sensitive to the
Corresponding author.
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 333–344, 2011. c Springer-Verlag Berlin Heidelberg 2011
334
P. Yang et al.
imbalanced class distribution and may not perform well if being directly applied on the imbalanced data [5,6]. Sampling is a popular approach to addressing the imbalanced class distribution [7]. Simple methods such as random under-sampling and random oversampling are routinely applied in many bioinformatics studies [8]. With random under-sampling, the size of the majority class is reduced to compensate the imbalance, whereas with random over-sampling, the size of the minority class is increased to compensate the imbalance. Although they are straightforward and computationally efficient, these two methods are prone to either increased noise and duplicated samples or informative sample removal [9]. A more sophisticated approach known as SMOTE is to synthesize “new” samples using original samples in the dataset [10]. However, many bioinformatics problems often present several thousands of samples with a highly imbalanced class distribution. Applying SMOTE will introduce a large number of synthetic samples which may increase the data noise substantially. Alternatively, a cost-metric can be specified to force the classifier to pay more attention to the minority class [11]. This requires to choose a correct cost-metric which is often unknown a priori. Several recent studies found that ensemble learning could improve the performance of a single classifier in imbalanced data classification [6,12]. In this study, we explore along this direction. In particular, we introduce a sample subset optimization technique for ‘intelligent under-sampling’ in imbalanced data classification. Using this technique, we designed an ensemble of SVMs specifically for learning from imbalanced biological datasets. This system has several advantages over the conventional ones: – It creates each base classifier using a roughly balanced training subset with a built-in intelligent under-sampling. This is important in learning from imbalanced data because it reduces the risk of bias towards one class while neglecting the other one. – The system embraces an ensemble framework in which multiple roughly balanced training subsets are created to train an ensemble of classifiers. Thus, it reduces the risk of removing informative samples from the majority class, which may occur when a simple under-sampling technique is applied. – As opposed to random sampling, the sample subset optimization technique is applied to identify optimal sample subsets. This may improve the quality of the base classifiers and result in a more accurate ensemble. – The aforementioned biological problems often present several thousands of training samples. The proposed technique is essentially an under-sampling approach. It can avoid the introduction of data noise and the generated data subsets may be more efficient for classifier training. The rest of the paper discusses the details of the proposed sample subset optimization technique and the associated ensemble learning system. Section 2 presents the ensemble learning system. Section 3 describes the main idea of sample subset optimization. The base classifier and fitness function of the ensemble system are described in Section 4. Comparisons with typical sampling and ensemble methods are given in Section 5. Section 6 concludes the paper.
Sample Subset Optimization for Classifying Imbalanced Biological Data
2
335
Ensemble System
Ensemble learning is an effective approach for improving the prediction accuracy of a single classification algorithm. Such an improvement is commonly achieved by using multiple classifiers (known as the base classifiers) each trained on a subset of samples created by random sampling such as those used in bagging [13], or cost-sensitive sampling such as those used in boosting [14]. The base classifiers are typically combined using an integration function such as averaging [15] or majority voting [16]. We propose an ensemble learning system specifically designed for imbalanced biological data classification. The schematic representation of the proposed system is shown in Figure 1. It has three main components – sample subset optimization, base classifier, and fitness function. The key of this ensemble system is the application of the sample subset optimization techniques (to be described in Section 3). Suppose that a highly imbalanced dataset contains n samples from the majority class and m samples from the minority class where n m, the system creates each sample subset by including all m minority samples and selecting a subset of samples from the n majority samples according to an internal optimization procedure. This procedure is conducted to generate multiple optimized sample subsets, each being a roughly balanced subset containing m minority samples and ni carefully selected majority samples, where ni n (i = 1...L) and L is the total number of optimized sample subsets. Using those optimized sample subsets, we can obtain a group of base classifiers ci (i = 1...L), each being trained on its corresponding sample subset {m + ni }. The base classifiers are then combined using majority voting to form an ensemble of classifiers. Algorithm 1 summarizes the procedure. A line starting with “//” in the algorithm is a comment for its adjacent next line. Training set
Optimized training subsets
m …
Optimize samples from majority class
n
m
m
n1
n2
…
m
Test set m’
nL
Base classifiers c1
c2
…
cL
n’
Majority voting
Prediction AUC value
Fig. 1. A schematic representation of the proposed ensemble system
336
P. Yang et al.
Algorithm 1. sampleSubsetOptimization Input: Imbalanced dataset DI Output: Roughly balanced dataset DB 1: cvSize = 2; 2: cvSets = crossValidate(DI , cvSize); 3: for i = 1 to cvSize do 4: // obtain the internal training samples 5: DiT = getTrain(cvSets, i); 6: // obtain the internal test samples 7: Dit = getTest(cvSets, i); 8: // obtain samples of the minority class 9: Diminor = getMinoritySample(DiT ); 10: // obtain samples of the majority class 11: Dimajor = getMajoritySample(DiT ); 12: // select a subset of samples from the majority class 13: Dimajor = optimizeMajoritySample(Dimajor , Diminor , Dit ); 14: DB = DB ∪ (Diminor ∪ Dimajor ); 15: end for 16: return DB ;
3
Sample Subset Optimization
The key function in Algorithm 1 is the optimization procedure applied to select a subset of samples from the majority class (Algorithm 1, line 13). The principal idea of the sample subset optimization procedure is to apply a cross validation procedure to form a subset in which each sample is selected according to the internal classification accuracy. In this section, we describe its formulation using a particle swarm optimization (PSO) algorithm [17], and analyze its behavior using a synthetic dataset. The base classifier and the fitness function used for optimization are discussed in Section 4. 3.1
Formulation of Sample Subset Optimization
We formulate the sample subset optimization using a particle swarm optimization algorithm. In particular, for each sample from the majority class a dimension in the particle space is assigned. That is, for n majority samples, the particle is coded as an indicator function set p = {Ix1 , Ix2 , ..., Ixn }. For each dimension, an indicator function Ixj takes value “1” when the corresponding jth sample xj is included to train a classifier. Similarly, a “0” denotes that the corresponding sample is excluded from training. By optimizing a population of L particles pi (i = 1...L), the velocity of the ith particle vi,j (t) and the position of this particle si,j (t) in the jth dimension of the solution space are updated in each iteration t as follows: vi,j (t + 1) = w · vi,j (t) + c1 r1 · (pbesti,j − si,j (t)) + c2 r2 · (gbesti,j − si,j (t)) (1)
Sample Subset Optimization for Classifying Imbalanced Biological Data
si,j (t + 1) =
0: if random() S(vi,j (t + 1)) 1: if random() < S(vi,j (t + 1))
337
(2)
1 (3) 1 + e−vi,j (t+1) where pbesti,j and gbesti,j are the previous best position and the best position found by informants, respectively. c1 , r1 , c2 , and r2 are the learning rates and social coefficients. random() is the random number generator with a uniform distribution of [0,1]. Representing this optimization procedure in pseudocode, we obtain Algorithm 2. Note that the PSO algorithm produces multiple optimized sample subsets in parallel. Therefore, by specifying the popSize parameter, we can obtain any number of optimized sample subsets with a single execution of the algorithm. S(vi,j (t + 1)) =
Algorithm 2. optimizeMajoritySamples Input: Majority samples Dmajor , Minority samples Dminor , Internal test samples Dt pi Output: Optimized sample subsets Dmajor (i = 1...L) 1: popSize = L; 2: initiateParticles(Dmajor , popSize); 3: for t = 1 to termination do 4: // go through each particle in the population 5: for i = 1 to popSize do 6: // extract the samples according to the indicator function set i 7: Dpmajor = extractSelectedSamples(pi , Dmajor ); pi i 8: Dtrain = Dpmajor ∪ Dminor ; 9: // train a classifier using selected majority samples and all minority samples i 10: hi = trainClassifier(Dptrain ); 11: // calculate the fitness of the trained classifier using internal test samples 12: f itness = calculateFitness(hi , Dt ); 13: // update velocity (Eq. (1)) and position (Eq. (2)) according to fitness value 14: vi,j (t) = updateVelocity(vi,j (t), f itness); 15: si,j (t) = updatePosition(si,j (t), f itness); 16: end for 17: end for i 18: return Dpmajor (i = 1...L)
3.2
Analysis of Behavior
We analyze the behavior of sample subset optimization by using an imbalanced synthetic data. Samples are created with each has two features. These two features are generated from the same distribution. Specifically, 20 samples of the majority class are generated from a normal distribution N (5, 1) and 10 samples of the minority class are generated from a normal distribution N (7, 1). In addition, 5 “outlier” samples are introduced to the dataset. They are labeled as majority class, but are generated from the normal distribution of the minority class. The class ratio of the data is 25:10.
338
P. Yang et al. 9
9 Linear SVM border
8
8
7
7
Feature 2
Feature 2
Linear SVM border
6 5 4
5 4
3 2 3
6
3
Majority samples Minority samples 4
5
6
7
8
9
10
2
Majority samples Minority samples 3
4
5
Feature 1
(a) orinigal dataset
6
7
8
9
10
Feature 1
(b) dataset after optimization
Fig. 2. The green lines are the classification boundary created using a linear SVM with (a) the original dataset and (b) the dataset after optimization
Figure 2(a) shows the original dataset and the resulting classification boundary of a linear SVM, and Figure 2(b) shows a dataset after applying sample subset optimization and the resulting classification boundary of a linear SVM. Note that this is one of the optimized dataset which is used to train one base classifier. Our ensemble is the aggregation of multiple base classifiers trained on multiple optimized datasets. It is evident that the class ratio is more balanced after optimization (from 25:10 to 15:10). In addition, the 3 out of 5 outlier samples are removed, and 7 redundant majority samples which has limited effect on the decision boundary of the linear SVM classifier are removed to correct the imbalanced class distribution.
4
Base Classifier and Fitness Function
We select SVM as the base classifier for building the ensemble system. SVM is routinely applied to many challenging bioinformatics problems. The design of the fitness function is another important facet for sample subset optimization. It determines the quality of the base classifiers, and thus the performance of the ensemble. The following subsections describe these two components in details. 4.1
Base Classifier of Support Vector Machine
SVM is a popular classification algorithm which has been widely used in many bioinformatics problems. Among different kernel choices, linear SVM with a soft margin is robust for large scale and high-dimensional dataset classification [18]. Let us denote each sample in the dataset as a vector xi (i = 1...M ) where M is the total number of samples, and yi is the class label of sample xi . Each component in xi is a feature xij (j = 1...N ) interpreted as the jth feature of the ith sample, where N is the dimension of the feature space. In our case, features could be GC-content, dinucleotide values, or other biological markers used to characterize each sample.
Sample Subset Optimization for Classifying Imbalanced Biological Data
339
A linear SVM with a soft margin is trained by optimizing following functions: 1 min ||w||2 + C ξi w,b,ξ 2 i=1 M
subject to : yi (< w, xi >) + b ≥ 1 − ξi where w is the weight vector, ξi are slack variables, and b is the bias. The constant C determines the trade-off between maximizing the margin and minimizing the amount of slack. In this study, we utilize the implementation proposed by Hsieh et al. [19]. This is an implementation for fast and large scale linear SVM, which is especially suited as base classifier for ensemble learning due to its computational efficiency. Notice that classifiers are trained both for sample subset optimization and for composing ensemble. However, these two procedures are independent from each other, and therefore, the classifiers trained for sample subset optimization are not the classifiers used for ensemble. The purpose of the classifiers trained in the sample subset optimization procedure are to provide fitness feedbacks of the selected samples, whereas the classifiers used for composing ensemble are trained by using the optimized sample subsets and serve as the base classifiers of the ensemble. To maximize the specificity of the feedbacks, the same classification algorithm, that is, linear SVM, is used for both procedures. 4.2
Fitness Function
For building a classifier, a subset of samples from the majority class is selected according to an indicator function set pi (see Section 3.1), and combined with i the samples from the minority class to form a training set Dptrain . The goodness of an indicator function set can be assessed by the performance of the classifier trained with the samples specified by it. For imbalanced data, one effective way to evaluate the performance of the classifier is to use area under the ROC curve i metric [20]. Hence, we devise AU C(hi (Dptrain , Dtest )) as a component of fitness pi function, where Dtrain denotes the training set generated using pi and Dtest denotes the test data. Function AU C() calculates the AUC value of a classification model hi (Da , Db ) which is trained on Da and evaluated on Db . Moreover, the size of the subset is also important because a small training set is likely to result in a poorly trained model with poor generalization. Therefore, the fitness function can be constructed by combining the two components: i f itness(ui) = w1 · AU C(hi (Dptrain , Dtest )) + w2 · Size(pi)
(4)
where Size() determines the size of a subset (specified by pi ). Coefficients w1 and w2 are empirical constants which can be adjusted to alter the relative importance of each fitness component. The default values are w1 = 0.8 and w2 = 0.2 as they work well in a range of datasets.
340
5
P. Yang et al.
Experimental Results
In this section, we first describe four imbalanced biological datasets used in our experiment. They are generated from several important and diverse biological problems and represent different degrees of imbalanced class distribution. Next we present the performance results of our ensemble algorithm compared with six other algorithms using those datasets. 5.1
Datasets
We evaluated different algorithms using datasets generated for identification of miRNA, classification of protein localization sites, and prediction of promoter (drosophila and human). Specifically, the miRNA identification dataset contains 691 positive samples and 9248 negative samples, which is described by 21 features [21]. The protein localization dataset is generated from the study discussed in [22]. We attempted to differentiate membrane proteins (258) from the rests (1226). The human promoter dataset contains 471 promoter sequences and 5131 coding sequences (CDS) and intron sequences. Compared to the human promoter dataset, the drosophila promoter dataset has a relatively balanced class distribution with 1936 promoter sequences and 2722 CDS and intron sequences. We calculated the 16 dinucleotide features according to [23]. The datasets are summarized and organized according to class ratio in Table 1. Table 1. Summary of biological datasets used for evaluation Dataset (short name) # Sample # Features Minority vs. Majority drosophila promoter (DroProm) 6594 16 0.4156 (≈ 1:2.5) protein localization (ProtLoc) 1484 8 0.2104 (≈ 1:5) human promoter (HuProm) 5602 16 0.0918 (≈ 1:10) miRNA identification (miRNA) 9939 21 0.0747 (≈ 1:13)
5.2
Performance Comparison
The performance of the single classifier of SVM was used as the baseline for all datasets. We compared the single classifier approaches including random undersampling with SVM (RUS-SVM), random over-sampling with SVM (ROS-SVM), SMOTE sampling with SVM (SMOTE-SVM), and the ensemble approaches including boosting with base classifiers of SVM (Boost-SVMs), bagging with base classifiers of SVM (Bag-SVMs), and our sample subset optimization technique with SVM (SSO-SVMs). For the ensemble methods, we tested the ensemble size from 10 to 100 with a step of 10. A 5-fold cross-validation procedure was applied to partition datasets for training and testing, and each algorithm was tested on the same partition to reduce evaluation variance. Among the six tested algorithms, four of them employed the randomization procedure. They are RUS-SVM, ROS-SVM, Bag-SVMs, and SSO-SVMs (note that the Boost-SVMs algorithm uses the
Sample Subset Optimization for Classifying Imbalanced Biological Data
0.9
341
0.92 0.91 0.9
0.8
0.75 SSO−SVMs Bag−SVMs Boost−SVMs Single−SVM ROS−SVM RUS−SVM SMOTE−SVM
0.7
0.65
10
20
30
40
50
60
70
80
90
Area Under ROC Curve
Area Under ROC Curve
0.85
0.89 0.88 0.87 SSO−SVMs Bag−SVMs Boost−SVMs Single−SVM ROS−SVM RUS−SVM SMOTE−SVM
0.86 0.85 0.84 0.83 0.82
100
10
20
Number of Base Classifiers
0.75
0.9
0.7
SSO−SVMs Bag−SVMs Boost−SVMs Single−SVM ROS−SVM RUS−SVM SMOTE−SVM 10
20
30
40
50
60
70
80
Number of Base Classifiers
(c) human promoter
90
100
Area Under ROC Curve
Area Under ROC Curve
0.95
0.55
50
60
70
80
90
100
(b) protein localization
0.8
0.6
40
Number of Base Classifiers
(a) drosophila promoter
0.65
30
0.85
0.8
SSO−SVMs Bag−SVMs Boost−SVMs Single−SVM ROS−SVM RUS−SVM SMOTE−SVM
0.75
0.7
10
20
30
40
50
60
70
80
90
100
Number of Base Classifiers
(d) miRNA identification
Fig. 3. The comparison of different algorithms for data classification. The x-axis denotes the ensemble sizes and the y-axis denotes the AUC value. For those algorithms that use a single classifier, the same AUC value is plotted on different ensemble sizes for the purpose of comparison.
reweighting implementation and is deterministic). For those with the randomization procedure, we repeated the test 10 times, each time with a different random seed. Figure 3 shows the results comparison. It can be seen that in most cases ensemble approaches give higher AUC values than the single classifier approaches. For single classifier approaches, random under-sampling, random over-sampling, and SMOTE sampling do improve the classification results when the analyzed dataset has a highly imbalanced class distribution such as the cases in Figure 3(b)(c)(d). However, the improvements become less significant when the imbalance is moderate (drosophila promoter dataset in Figure 3(a)). SMOTE sampling performs better than random under-sampling and over-sampling approaches in the case of protein localization (Figure 3(b)). However, the performance gain is marginal in other three datasets (Figure 3(a)(c)(d)). We do not observe significant difference of the performance between random under-sampling and
342
P. Yang et al.
Table 2. The comparison of different algorithms for data classification according to AUC value. The value for ensemble approaches are averaged across different ensemble sizes. Algorithm DroProm Single-SVM 0.6584 RUS-SVM 0.6584 ROS-SVM 0.6555 SMOTE-SVM 0.6400 Boost-SVMs 0.7756 Bag-SVMs 0.8507 SSO-SVMs 0.8520
ProtLoc 0.8296 0.8850 0.8866 0.8976 0.8852 0.8671 0.9098
HuProm miRNA 0.5740 0.7542 0.6016 0.7644 0.5986 0.8114 0.5961 0.7924 0.6644 0.8891 0.7264 0.9198 0.7718 0.9419
Table 3. P -value using one-tail student t-test to compare the performance difference Algorithm SSO-SVMs vs. Single-SVM SSO-SVMs vs. RUS-SVM SSO-SVMs vs. ROS-SVM SSO-SVMs vs. SMOTE-SVM SSO-SVMs vs. Boost-SVMs SSO-SVMs vs. Bag-SVMs
DroProm 2 × 10−15 2 × 10−15 2 × 10−15 8 × 10−16 2 × 10−8 6 × 10−4
ProtLoc 4 × 10−18 1 × 10−13 2 × 10−13 8 × 10−11 8 × 10−7 7 × 10−11
HuProm 1 × 10−11 4 × 10−11 4 × 10−11 3 × 10−11 7 × 10−6 1 × 10−6
miRNA 1 × 10−14 2 × 10−14 3 × 10−13 9 × 10−14 2 × 10−5 2 × 10−3
random over-sampling, except in the case of miRNA identification (Figure 3(d)) where random over-sampling is relatively better than random under-sampling. For ensemble approaches, Boost-SVMs performs surprisingly worse than the other two approaches in most cases and the performance fluctuates among different ensemble sizes. This may be caused by its training process in that the boosting algorithm assigns increasingly more classification weights to those most “difficult” samples in each iteration. However, those “difficult” samples could be the outliers and cause deleterious effect when the classifiers pay too much attention on classifying them while ignoring other more representative samples. In this regard, Bag-SVMs and SSO-SVMs appear to be the better approaches. However, SSO-SVMs almost always performs the best in every case and generates much smaller performance variance when different random seeds were used. It is likely that the SSO-SVMs can capture the most representative samples from the training set which gives a better generalization on unseen data classification. We also observe that the improvement is more significant when the datasets has a highly imbalanced class distribution (Figure 3(b)(c)(d)). Table 2 shows the AUC values of both single classifier and ensemble approaches. For the ensemble approaches, the AUC value is the average of those given by the ensemble sizes from 10 to 100. The proposed SSO-SVMs performs the best in all four tested datasets. Comparing these results with the baseline of a single SVM, they account for 10%-20% improvements. To confirm the improvements are statistically significant, we applied a one-tail student t-test and compared SSO-SVMs with the other six methods. Table 3 shows the pvalue of the comparison. In all four datasets, the performance of SSO-SVMs is
Sample Subset Optimization for Classifying Imbalanced Biological Data
343
significantly better than the other six methods, with a p-value smaller than 0.05. Therefore, we confirmed the effectiveness of the proposed ensemble approach.
6
Conclusion
In this paper we introduced a sample subset optimization technique for sampling optimal sample subsets from training data. We integrated this technique in an ensemble learning framework and created an ensemble of SVMs specifically for imbalanced biological data classification. The proposed algorithm was applied to several bioinformatics tasks with moderate and highly imbalanced class distributions. According to our experimental results, (1) the approaches based on data sampling for a single SVM are generally less effective compared to the ensemble approaches; (2) the proposed sample subset optimization technique appears to be very effective and the ensemble optimized by this technique produced the best classification results in terms of AUC value for all evaluation datasets.
References 1. Meyer, I.M.: A practical guide to the art of RNA gene prediction.. Briefings in bioinformatics 8(6), 396–414 (2007) 2. Zeng, J., Zhu, S., Yan, H.: Towards accurate human promoter recognition: a review of currently used sequence features and classification methods. Briefings in Bioinformatics 10(5), 498–508 (2009) 3. Sonnenburg, S., Schweikert, G., Philips, P., Behr, J., R¨ atsch, G.: Accurate splice site prediction using support vector machines. BMC Bioinformatics 8(suppl. 10), 7 (2007) 4. Hua, S., Sun, Z.: Support vector machine approach for protein subcellular localization prediction. Bioinformatics 17(8), 721–728 (2001) 5. Akbani, R., Kwek, S., Japkowicz, N.: Applying support vector machines to imbalanced datasets. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 39–50. Springer, Heidelberg (2004) 6. Liu, Y., An, A., Huang, X.: Boosting prediction accuracy on imbalanced datasets with SVM ensembles. In: Ng, W.-K., Kitsuregawa, M., Li, J., Chang, K. (eds.) PAKDD 2006. LNCS (LNAI), vol. 3918, pp. 107–118. Springer, Heidelberg (2006) 7. Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intelligent Data Analysis 6(5), 429–449 (2002) 8. Batuwita, R., Palade, V.: A New Performance Measure for Class Imbalance Learning. Application to Bioinformatics Problems. In: 2009 International Conference on Machine Learning and Applications, pp. 545–550. IEEE, Los Alamitos (2009) 9. Chawla, N., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter 6, 1–6 (2004) 10. Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16(1), 321–357 (2002) 11. Weiss, G.M.: Mining with rarity: a unifying framework. ACM SIGKDD Explorations Newsletter 6(1), 7–19 (2004) 12. Hido, S., Kashima, H., Takahashi, Y.: Roughly balanced bagging for imbalanced data. Statistical Analysis and Data Mining 2(5-6), 412–426 (2009)
344
P. Yang et al.
13. Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996) 14. Schapire, R.E., Freund, Y., Bartlett, P., Lee, W.S.: Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics 26(5), 1651–1686 (1998) 15. Tax, D., Van Breukelen, M., Duin, R.: Combining multiple classifiers by averaging or by multiplying? Pattern Recognition 33(9), 1475–1485 (2000) 16. Lam, L., Suen, S.Y.: Application of majority voting to pattern recognition: an analysis of its behavior and performance. IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans 27(5), 553–568 (1997) 17. Poli, R., Kennedy, J., Blackwell, T.: Particle swarm optimization. Swarm Intelligence 1(1), 33–57 (2007) 18. Ben-Hur, A., Ong, C.S., Sonnenburg, S., Sch¨ olkopf, B., R¨ atsch, G.: Support vector machines and kernels for computational biology. PLoS Computational Biology 4(10) (2008) 19. Hsieh, C., Chang, K., Lin, C., Keerthi, S., Sundararajan, S.: A dual coordinate descent method for large-scale linear SVM. In: Proceedings of the 25th International Conference on Machine Learning, pp. 408–415. ACM, New York (2008) 20. Fawcett, T.: An introduction to ROC analysis. Pattern Recognition Letters 27(8), 861–874 (2006) 21. Batuwita, R., Palade, V.: microPred: effective classification of pre-miRNAs for human miRNA gene prediction. Bioinformatics 25(8), 989–995 (2009) 22. Horton, P., Nakai, K.: A probabilistic classification system for predicting the cellular localization sites of proteins. In: Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology, pp. 109–115. AAAI Press, Menlo Park (1996) 23. Rani, T.S., Bhavani, S.D., Bapi, R.S.: Analysis of E. coli promoter recognition problem in dinucleotide feature space. Bioinformatics 23(5), 582–588 (2007)
Class Confidence Weighted k NN Algorithms for Imbalanced Data Sets Wei Liu and Sanjay Chawla School of Information Technologies, University of Sydney {wei.liu,sanjay.chawla}@sydney.edu.au
Abstract. In this paper, a novel k -nearest neighbors (k NN) weighting strategy is proposed for handling the problem of class imbalance. When dealing with highly imbalanced data, a salient drawback of existing kNN algorithms is that the class with more frequent samples tends to dominate the neighborhood of a test instance in spite of distance measurements, which leads to suboptimal classification performance on the minority class. To solve this problem, we propose CCW (class confidence weights) that uses the probability of attribute values given class labels to weight prototypes in kNN. The main advantage of CCW is that it is able to correct the inherent bias to majority class in existing k NN algorithms on any distance measurement. Theoretical analysis and comprehensive experiments confirm our claims.
1
Introduction
A data set is “imbalanced” if its dependent variable is categorical and the number of instances in one class is different from those in the other class. Learning from imbalanced data sets has been identified as one of the 10 most challenging problems in data mining research [1]. In the literature of solving class imbalance problems, data-oriented methods use sampling techniques to over-sample instances in the minor class or under-sample those in the major class, so that the resulting data is balanced. A typical example is the SMOTE method [2] which increases the number of minor class instances by creating synthetic samples. It has been recently proposed that using different weight degrees on the synthetic samples (so-called safe-level-SMOTE [3]) produces better accuracy than SMOTE. The focus of algorithm-oriented methods has been on extensions and modifications of existing classification algorithms so that they can be more effective in dealing with imbalanced data. For example, modifications of decision tree algorithms have been proposed to improve the standard C4.5, such as HDDT [4] and CCPDT [5]. K NN algorithms have been identified as one of the top ten most influential data mining algorithms [6] for their ability of producing simple but powerful
The first author of this paper acknowledges the financial support of the Capital Markets CRC.
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 345–356, 2011. c Springer-Verlag Berlin Heidelberg 2011
346
W. Liu and S. Chawla
classifiers. The k neighbors that are the closest to a test instances are conventionally called prototypes. In this paper we use the concepts of “prototypes” and “instances” interchangeably. There are several advanced k NN methods proposed in the recent literature. Weinberger et al. [7] learned Mahanalobis distance matrices for k NN classification by using semidefinite programming, a method which they call large margin nearest neighbor (LMNN) classification. Experimental results of LMNN show large improvements over conventional k NN and SVM. Min et al. [8] have proposed DNet which uses a non-linear feature mapping method pre-trained with Restricted Boltzmann Machines to achieve the goal of large-margin k NN classification. Recently, a new method WDk NN was introduced in [9] which discovers optimal weights for each instance in training phase which are taken into account during test phases. This method is demonstrated superior to other k NN algorithm including LPD [10], PW [11], A-NN [12] and WDNN [13]. In this paper, the model we propose is an algorithm-oriented method and we preserve all original information/distribution of the training data sets. More specifically, the contributions of this paper are as follows: 1. We express the mechanism of traditional k NN algorithms as equivalent to using only local prior probabilities to predict instances’ labels, from which perspective we illustrate why many existing k NN algorithms have undesirable performance on imbalanced data sets; 2. We propose CCW (class confidence weights), the confidence (likelihood) of a prototype’s attributes values given its class label, which transforms prior probabilities of to posterior probabilities. We demonstrate that this transformation makes the k NN classification rule analogous to using a likelihood ratio test in the neighborhood; 3. We propose two methods, mixture modeling and Bayesian networks, to efficiently estimate the value of CCW; The rest of the paper is structured as follows. In Section 2 we review existing k NN algorithms and explain why they are flawed in learning from imbalanced data. We define CCW weighting strategy and justify its effectiveness in Section 3. CCW is estimated in Section 4. Section 5 reports experiments and Section 6 concludes the paper.
2
Existing k NN Classifiers
Given labeled training data (xi , yi ) (i = 1,...,n), where xi ∈ Rd are feature vectors, d is the number of features and yi ∈ {c1 , c2 } are binary class labels, k NN algorithm finds a group of k prototypes from the training set that are the closest to a test instance xt by a certain distance measure (e.g. Euclidean distances), and estimates the test instance’s label according to the predominance of a class in this neighborhood. When there is no weighting (NW) strategy, this majority voting mechanism can be expressed as: NW:
yt = arg max
c∈{c1 ,c2 }
xi ∈φ(xt )
I(yi = c)
(1)
Class Confidence Weighted kNN Algorithms for Imbalanced Data Sets
347
where yt is a predicted label, I(·) is an indicator function that returns 1 if its condition is true and 0 otherwise, and φ(xt ) denotes the set of k training instances (prototypes) closest to xt . When the k neighbors vary widely in their distances and closer neighbors are more reliable, the neighbors are weighted by the multiplicative-inverse (MI) or the additive-inverse (AI) of their distances: MI:
c∈{c1 ,c2 }
AI:
yt = arg max
c∈{c1 ,c2 }
xi ∈φ(xt )
yt = arg max
I(yi = c) ·
1 dist(xt , xi )
I(yi = c) · (1 −
xi ∈φ(xt )
(2)
dist(xt , xi ) ) distmax
(3)
where dist(xt , xi ) represents the distance between the test point xt and a prototype xi , and distmax is the maximum possible distance between two training t ,xi ) instances in the feature space which normalizes dist(x distmax to the range of [0,1]. While MI and AI solve the problem of large distance variance among k neighbors, their effects become insignificant if the neighborhood of a test point is considerably dense, and one of the class (or both classes) is over-represented by its samples – since in this scenario all of the k neighbors are close to the test point and the difference among their distances is not discriminative [9]. 2.1
Handling Imbalanced Data
Given the definition of the conventional k NN algorithm, we now explain its drawback in dealing with imbalanced data sets. The majority voting in Eq. 1 can be rewritten as the following equivalent maximization problem:
yt = arg max
c∈{c1 ,c2 }
⇒
max {
I(yi = c)
xi ∈φ(xt )
xi ∈φ(xt )
= =
max {
I(yi = c1 ),
xi ∈φ(xt ) I(yi = c1 )
k max { pt (c1 ), pt (c2 ) }
I(yi = c2 ) } (4)
xi ∈φ(xt )
,
xi ∈φ(xt ) I(yi = c2 )
k
}
where pt (c1 ) and pt (c2 ) represent the proportion of class c1 and c2 appearing in φ(xt ) – the k -neighborhood of xt . If we integrate this k NN classification rule into Bayes’s theorem, treat φ(xt ) as the sample space and treat pt (c1 ) and pt (c2 ) as priors 1 of two classes in this sample space, Eq. 4 intuitively illustrates that the classification mechanism of k NN is based on finding the class label that has a higher prior value. This suggests that traditional k NN uses only the prior information to estimate class labels, which has suboptimal classification performance on the minority class when the data set is highly imbalanced. Suppose c1 is the dominating class label, it is expected that the inequality pt (c1 ) pt (c2 ) holds true in most 1
We note that pt (c1 ) and pt (c2 ) are conditioned (on xt ) in the sample space of the overall training data, but unconditioned in the sample space of φ(xt ).
348
W. Liu and S. Chawla
10 8.5 9 8 8 7.5
7 6
7
5 6.5 4 6 3 5.5
2 1 0
5 0
2
4
6
8
10
(a) Balanced data full view
2.5
3
3.5
4
4.5
5
5.5
6
(b) Balanced data regional view
10 9
5
8 4.5
7 6
4 5 4
3.5
3 3
2 1
2.5 0
0
2
4
6
8
10
(c) Imbalanced data full view
4
4.5
5
5.5
6
6.5
(d) Imbalanced data regional view
Fig. 1. Performance of conventional k NN (k = 5) on synthetic data. When data is balanced, all misclassifications of circular points are made on the upper left side of an optimal linear classification boundary; but when data is imbalanced, misclassifications of circular points appear on both sides of the boundary.
regions of the feature space. Especially in the overlap regions of two class labels, k NN always tends to be biased towards c1 . Moreover, because the dominating class is likely to be over-represented in the overlap regions, “distance weighting” strategies such as WI and AI are ineffective in correcting this bias. Figure 1 shows an example where k NN is performed by using Euclidean distance measure for k = 5. Samples of positive and negative classes are generated pos neg neg from Gaussian distributions with mean [μpos 1 , μ2 ] = [6, 3] and [μ1 , μ2 ] = [3, 6] respectively and a common standard deviation I (the identity matrix). The (blue) triangles are samples of the negative/majority class, the (red) unfilled circles are those of the positive/minority class, and the (green) filled circles indicate the positive samples incorrectly classified by the conventional k NN algorithm. The straight line in the middle of two clusters suggests a classification boundary built by an ideal linear classifier. Figure 1(a) and 1(c) give global overall views of k NN classifications, while Figure 1(b) and 1(d) are their corresponding “zoom-in” subspaces that focus on a particular misclassified positive sample. Imbalanced data is sampled under the class ratio of Pos:Neg = 1:10.
Class Confidence Weighted kNN Algorithms for Imbalanced Data Sets
349
As we can see from Figure 1(a) and 1(b), when data is balanced all of the misclassified positive samples are on the upper left side of the classification boundary, and are always surrounded by only negative samples. But when data is imbalanced (Figure 1(c) and 1(d)), misclassifications of positives appear on both sides of the boundary. This is because the negative class is over-represented and dominates much larger regions than the positive class. The incorrectly classified positive point in Figure 1(d) is surrounded by 4 negative and 1 positive neighbors, with a negative neighbor being the closest prototype to the test point. In this scenario, distances weighting strategies (e.g. MI and AI) cannot be helpful to correct the bias to negative class. In the next section, we introduce CCW and explain how it can solve such problems and correct the bias.
3
CCW Weighted k NN
To improve the existing k NN rule, we introduce CCW to capture the probability (confidence) of attributes values given a class label. We define CCW on a training instance i as follows: wiCCW = p(xi |yi ),
(5)
where xi and yi represent the attribute vector and the class label of instances i. Then the resulting classification rule integrated with CCW is:
yt = arg max
CCW:
c∈{c1 ,c2 }
I(yi = c) · wiCCW ,
(6)
xi ∈φ(xt )
and by applying it into distance weighting schemes MI and AI we obtain: CCWMI :
yt = arg max
c∈{c1 ,c2 }
CCWAI :
yt = arg max
c∈{c1 ,c2 }
I(yi = c)
xi ∈φ(xt )
1 · p(xi |yi ) dist(xt , xi )
I(yi = c)(1 −
xi ∈φ(xt )
dist(xt , xi ) ) · p(xi |yi ) distmax
(7)
(8)
With the integration of CCW, the maximization problem in Eq. 4 becomes:
yt = arg max
c∈{c1 ,c2 }
⇒ max {
I(yi = c) · p(xi |yi )
xi ∈φ(xt )
xi ∈φ(xt )
I(yi = c1 ) p(xi |yi = c1 ), k
xi ∈φ(xt )
I(yi = c2 ) p(xi |yi = c2 ) } k
(9)
= max { pt (c1 )p(xi |yi = c1 )xi ∈φ(xt ) , pt (c2 )p(xi |yi = c2 )xi ∈φ(xt ) } = max { pt (xi , c1 )xi ∈φ(xt ) , pt (xi , c2 )xi ∈φ(xt ) } = max { pt (c1 |xi )xi ∈φ(xt ) , pt (c2 |xi )xi ∈φ(xt ) }
where pt (c|xi )xi ∈φ(xt ) represents the probability of xt belonging to class c given the attribute values of all prototypes in φ(xt ). Comparisons between Eq. 4 and Eq. 9 demonstrate that the use of CCW changes the bases of k NN rule from using priors to posteriors: while conventional k NN directly uses the probabilities (proportions) of class labels among the k prototypes, we use conditional probabilities of classes given the values of the k prototypes’
350
W. Liu and S. Chawla
feature vectors. The change from priors to posteriors is easy to understand since CCW behaves just like the notion of likelihood in Bayes’ theorem. 3.1
Justification of CCW
Since CCW is equivalent to the notion of likelihood in Bayes’ theorem, in this subsection we demonstrate how the rationale of using CCW-based k NN rule can be interpreted by likelihood ratio tests. We assume c1 is the majority class and define the null hypothesis (H0 ) as “xt belonging to c1 ”, and the alternative hypothesis (H1 ) as “xt belonging to c2 ”. Assume among φ(xt ), the first j neighbors are from c1 and the other k − j ones are from c2 . We obtain the likelihood of H0 (L0 ) and H1 (L1 ) from: L0 =
j
p(xi |yi = c1 )xi ∈φ(xt ) ,
L1 =
i=1
k
p(xi |yi = c2 )xi ∈φ(xt )
i=j+1
Then the likelihood ratio test statistic can be written as: Λ=
j p(xi |yi = c1 )xi ∈φ(xt ) L0 = k i=1 L1 i=j+1 p(xi |yi = c2 )xi ∈φ(xt )
(10)
Note that the numerator and the denominator in the fraction of Eq. 10 correspond to the two terms of the maximization problem in Eq. 9. It is essential to ensure the majority class does not have higher priority than the minority in imbalanced data, so we choose “Λ = 1” as the rejection threshold. Then the mechanism of using Eq. 9 as the k NN classification rule is equivalent to “predict xt to be c2 when Λ ≤ 1” (reject H0 ), and “predict xt to be c1 when Λ > 1” (do not reject H0 ). Example 1. We reuse the example in Figure 1. The size of triangles/circles is proportional to their CCW weights: the larger the size of a triangle/cirle, the greater the weight of that instance; and the smaller the lower the weight. In Figure 1(d), the misclassified positive instance has four negative-class neighbors with CCW weights 0.0245, 0.0173, 0.0171 and 0.0139, and has one positive-class neighbor of weight 0.1691. Then the total negative-class weight is 0.0728 and the total positive-class weight is 0.1691, and the CCW ratio is 0.0728 0.1691 < 1 which gives a label prediction to the positive (minority) class. So even though the closest prototype to the test instance comes from the wrong class which also dominates the test instance’s neighborhood, a CCW weighted k NN can still correctly classify this actual positive test instances.
4
Estimations of CCW Weights
In this section we briefly introduce how we employ mixture modeling and Bayesian networks to estimate CCW weights.
Class Confidence Weighted kNN Algorithms for Imbalanced Data Sets
4.1
351
Mixture Models
In the formulation of mixture models, the training data is assumed follow a qcomponent finite mixture distribution with probability density function (pdf ): p(x|θ) =
q
αm p(x|θm )
(11)
m=1
where x is a sample of training data whose pdf is demanded, αm represents mixing probabilities, θm defines the mth component, and θ ≡ {θ1 ,...,θq , α1 ,...,αq } is the complete set of parameters specifying the mixture model. Given training = data Ω, the log-likelihood of a q-component mixture distribution is: log p(Ω|θ) n n q = log i=1 p(xi |θ) i=1 log m=1 p(xi |θm ). Then the maximum likelihood (ML) estimate θML = arg maxθ log p(Ω|θ) can be found analytically. We use the expectation-maximization (EM) algorithm to solve ML and then apply the estimated θ into Eq. 11 to find the pdf of all instances in training data set as their corresponding CCW weights. Example 2. We reuse the example in Figure 1, but now we assume the underlying distribution parameters (i.e. the mean and variance matrixes) that generate the two classes of data are unknown. We apply training samples into ML estimation, solve for θ by EM algorithm, and then use Eq. 11 to estimate the pdf of training instances which are used as their CCW weights. The estimated weights (and their effects) of the neighbors of the originally misclassified positive sample in Figure 1(d) are shown in Example 1. 4.2
Bayesian Networks
While mixture modeling deals with numerical features, Bayesian networks can be used to estimate CCW when feature values are categorical. The task of learning a Bayesian network is to (i) build a directed acyclic graph (DAG) over Ω, and (ii) learn a set of (conditional) probability tables {p(ω|pa(ω)), ω ∈ Ω} where pa(ω) represents the set of parents of ω in the DAG. From these conditional distributions one can recover the joint probability distribution over Ω by using p(Ω) = d+1 i=1 p(ωi |pa(ωi )). In brief, we learn and build the structure of the DAG by employing K2 algorithm [14] which in the worst case has an overall time complexity of O(n2 ), one “n” for the number of features and another “n” for the number of training instances. Then we estimate the conditional probability tables directly from training data. After obtaining the joint distributions p(Ω), the CCW weight of a p(Ω) training instance i can be easily obtained from wiCCW = p(xi |yi ) ∝ p(y where i) p(yi ) is the proportion of class yi among the entire training data.
5
Experiments and Analysis
In this section, we analyze and compare the performance of CCW-based k NN against existing k NN algorithms, other algorithm-oriented state of the art
352
W. Liu and S. Chawla
Table 1. Details of imbalanced data sets and comparisons of kNN algorithms on weighting strategies for k = 1 Name
#Inst #Att MinClass CovVar 7
KDDCup’09 : Appetency 50000 278 1.76% Churn 50000 278 7.16% Upselling 50000 278 8.12% 8 Agnostic-vs-Prior : Ada.agnostic 4562 48 24.81% Ada.prior 4562 15 24.81% Sylva.agnostic 14395 213 6.15% Sylva.prior 14395 108 6.15% 9 StatLib : BrazilTourism 412 9 3.88% Marketing 364 33 8.52% Backache 180 33 13.89% BioMed 209 9 35.89% Schizo 340 15 47.94% Text Mining [15]: Fbis 2463 2001 1.54% Re0 1504 2887 0.73% Re1 1657 3759 0.78% Tr12 313 5805 9.27% Tr23 204 5833 5.39% UCI [16]: Arrhythmia 452 263 2.88% Balance 625 5 7.84% Cleveland 303 14 45.54% Cmc 1473 10 22.61% Credit 690 16 44.49% Ecoli 336 8 5.95% German 1000 21 30.0% Heart 270 14 44.44% Hepatitis 155 20 20.65% Hungarian 294 13 36.05% Ionosphere 351 34 35.9% Ipums 7019 60 0.81% Pima 768 9 34.9% Primary 339 18 4.13% Average Rank Friedman Tests Friedman Tests
NW
Area Under Precision-Recall Curve MI CCWMI AI CCWAI WDk NN
4653.2 .022(4) .021(5) .028(2) .021(5) .035(1) .023(3) 3669.5 .077(3) .069(5) .077(2) .069(5) .093(1) .074(4) 3506.9 .116(6) .124(4) .169(2) .124(4) .169(1) .166(3) 1157.5 1157.5 11069.1 11069.1
.441(6) .443(4) .672(6) .853(6)
.442(4) .433(5) .745(4) .906(4)
.520(2) .518(3) .790(2) .941(2)
.442(4) .433(5) .745(4) .906(4)
.609(1) .606(1) .797(1) .945(1)
.518(3) .552(2) .774(3) .907(3)
350.4 250.5 93.8 16.6 0.5
.064(6) .106(6) .196(6) .776(6) .562(4)
.111(4) .118(4) .254(4) .831(4) .534(5)
.132(2) .152(1) .318(2) .874(2) .578(3)
.111(4) .118(4) .254(4) .831(4) .534(5)
.187(1) .152(2) .319(1) .887(1) .599(1)
.123(3) .128(3) .307(3) .872(3) .586(2)
2313.3 1460.3 1605.4 207.7 162.3
.082(6) .423(6) .360(1) .450(6) .098(6)
.107(4) .503(5) .315(5) .491(4) .122(4)
.119(2) .561(2) .346(2) .498(1) .136(1)
.107(4) .503(4) .315(5) .491(3) .122(4)
.117(3) .563(1) .346(2) .490(5) .128(3)
.124(1) .559(3) .335(4) .497(2) .134(2)
401.5 444.3 2.4 442.1 8.3 260.7 160.0 3.3 53.4 22.8 27.9 6792.8 70.1 285.3
.083(6) .064(1) .714(6) .299(6) .746(6) .681(4) .407(6) .696(6) .397(6) .640(6) .785(6) .056(6) .505(6) .168(6) 5.18 7E-7 3E-6
.114(4) .063(4) .754(4) .303(5) .751(4) .669(5) .427(4) .758(4) .430(4) .659(4) .874(5) .062(4) .508(4) .222(4) 4.18 8E-6 2E-6
.145(2) .063(4) .831(2) .318(2) .846(2) .743(2) .503(2) .818(2) .555(2) .781(2) .903(2) .087(1) .587(2) .265(1) 1.93 Base –
.114(4) .064(2) .754(4) .305(4) .751(4) .669(5) .427(4) .758(4) .430(4) .659(4) .884(3) .062(5) .508(4) .217(5) 4.03 2E-5 9E-6
.136(3) .064(3) .846(1) .357(1) .867(1) .78(1) .509(1) .826(1) .569(1) .815(1) .911(1) .087(2) .618(1) .224(3) 1.53 – Base
.159(1) .061(6) .760(3) .315(3) .791(3) .707(3) .492(3) .790(3) .531(3) .681(3) .882(4) .078(3) .533(3) .246(2) 2.84 4E-5 2E-4
approaches (i.e. WDk NN2 , LMNN3 , DNet4 , CCPDT5 and HDDT6 ) and dataoriented methods (i.e. safe-level-SMOTE). We note that since WDk NN has been demonstrated (in [9]) better than LPD, PW, A-NN and WDNN, in our experiments we include only the more superior WDk NN among them. CCPDT and HDDT are pruned by Fisher’s exact test (as recommended in [5]). All experiments are carried out using 5×2 folds cross-validations, and the final results are the average of the repeated runs. 2 3 4 5 6
We implement CCW-based k NNs and WDkNN inside Weka environment [17]. The code is obtained from www.cse.wustl.edu/~ kilian/Downloads/LMNN.html The code is obtained from www.cs.toronto.edu/~ cuty/DNetkNN_code.zip The code is obtained from www.cs.usyd.edu.au/~ weiliu/CCPDT_src.zip The code is obtained from www.nd.edu/~ dial/software/hddt.tar.gz
Class Confidence Weighted kNN Algorithms for Imbalanced Data Sets
353
Table 2. Performance of k NN weighting strategies when k = 11 Datasets
MI Appetency .033(8) Churn .101(7) Upselling .219(8) Ada.agnostic .641(9) Ada.prior .645(8) Sylva.agnostic .930(2) Sylva.prior .965(4) BrazilTourism .176(9) Marketing .112(10) Backache .311(7) BioMed .884(5) Schizo .632(6) Fbis .134(10) Re0 .715(3) Re1 .423(7) Tr12 .628(6) Tr23 .127(8) Arrhythmia .160(7) Balance .127(7) Cleveland .889(8) Cmc .346(9) Credit .888(7) Ecoli .943(3) German .535(7) Heart .873(7) Hepatitis .628(6) Hungarian .825(5) Ionosphere .919(4) Ipums .123(8) Pima .645(7) Primary .308(5) Average Rank 6.5 Friedman 2E-7 Friedman 0.011
CCWMI .037(4) .113(2) .243(5) .654(5) .669(2) .926(8) .965(2) .242(1) .157(2) .325(3) .885(3) .632(4) .145(5) .717(1) .484(1) .631(4) .156(3) .214(4) .130(5) .897(2) .383(2) .895(2) .948(1) .541(2) .876(4) .646(1) .832(1) .919(2) .138(4) .667(1) .314(2) 2.78 Base –
AI .036(6) .101(6) .218(9) .646(8) .654(7) .930(3) .965(6) .232(5) .113(9) .307(8) .858(7) .626(7) .135(9) .705(5) .434(6) .624(7) .123(10) .167(6) .145(2) .890(6) .357(7) .887(8) .938(5) .533(8) .873(8) .630(5) .823(7) .916(7) .123(7) .644(8) .271(8) 6.59 1E-6 4E-5
Area Under Precision-Recall Curve CCWAI SMOTE WDk NN LMNN DNet .043(1) .040(3) .036(5) .035(7) .042(2) .115(1) .108(4) .100(8) .107(5) .111(3) .241(6) .288(3) .212(10) .231(7) .264(4) .652(6) .689(3) .636(10) .648(7) .670(4) .668(3) .661(5) .639(9) .657(6) .664(4) .925(9) .928(6) .922(10) .928(4) .926(7) .965(4) .904(10) .974(1) .965(3) .935(9) .241(2) .233(4) .184(8) .209(6) .237(3) .161(1) .124(8) .150(3) .134(5) .142(4) .328(2) .317(6) .330(1) .318(5) .322(4) .844(8) .910(2) .911(1) .884(4) .877(6) .617(8) .561(10) .663(3) .632(5) .589(9) .141(6) .341(3) .136(8) .140(7) .241(4) .709(4) .695(7) .683(8) .716(2) .702(6) .475(4) .479(2) .343(8) .454(5) .477(3) .601(8) .585(10) .735(3) .629(5) .593(9) .156(3) .124(9) .128(7) .141(5) .140(6) .229(3) .083(10) .134(9) .187(5) .156(8) .149(1) .135(4) .091(9) .129(6) .142(3) .897(1) .889(7) .895(3) .893(5) .893(4) .384(1) .358(6) .341(10) .365(5) .371(4) .894(3) .891(5) .903(1) .891(6) .893(4) .941(4) .926(7) .920(8) .945(2) .933(6) .537(4) .536(6) .561(1) .538(3) .537(5) .876(5) .878(2) .883(1) .875(6) .877(3) .645(2) .625(8) .626(7) .637(3) .635(4) .831(2) .819(8) .826(4) .829(3) .825(6) .918(5) .916(7) .956(1) .919(3) .917(6) .140(2) .136(5) .170(1) .130(6) .138(3) .665(2) .657(4) .655(6) .656(5) .661(3) .279(7) .310(4) .347(1) .311(3) .294(6) 3.71 5.59 5.18 4.68 4.78 – 0.1060 0.002 2E-7 0.007 Base 0.1060 0.007 0.007 0.048
CCPDT .024(10) .092(10) .443(1) .723(1) .682(1) .934(1) .946(8) .152(10) .130(6) .227(9) .780(10) .807(2) .363(2) .573(9) .274(9) .946(1) .619(2) .346(2) .092(8) .806(10) .356(8) .871(9) .566(10) .493(9) .828(9) .458(9) .815(9) .894(9) .037(9) .587(10) .170(10) 6.68 0.019 0.019
HDDT .025(9) .099(9) .437(2) .691(2) .605(10) .928(5) .954(7) .199(7) .125(7) .154(10) .812(9) .846(1) .384(1) .540(10) .274(9) .946(1) .699(1) .385(1) .089(10) .846(9) .380(3) .868(10) .584(9) .464(10) .784(10) .413(10) .767(10) .891(10) .020(10) .613(9) .183(9) 6.9 0.007 0.007
We select 31 data sets from KDDCup’097 , agnostic vs. prior competition8 , StatLib9 , text mining [15], and UCI repository [16]. For multiple-label data sets, we keep the smallest label as the positive class, and combine all the other labels as the negative class. Details of the data sets are shown in Table 1. Besides the proportion of the minor class in a data set, we also present the coefficient of variation (CovVar) [18] to measure imbalance. CovVar is defined as the ratio of the standard deviation and the mean of the class counts in data sets. The metric of AUC-PR (area under precision-recall curve) has been reported in [19] better than AUC-ROC (area under ROC curve) on imbalanced data. A curve dominates in ROC space if and only if it dominates in PR space, and classifiers that are more superior in terms of AUC-PR are definitely more superior in terms of AUC-ROC, but not vice versa [19]. Hence we use the more informative metric of AUC-PR for classifier comparisons. 7 8 9
http://www.kddcup-orange.com/data.php http://www.agnostic.inf.ethz.ch http://lib.stat.cmu.edu/
354
W. Liu and S. Chawla
Weighted by MI MI Weighted by CCW 1
0.8
0.8
0.8
0.6
0.4
0 0
Area under PR curve
1
0.2
0.6
0.4
0.2
10 20 Data sets indexes
0 0
30
(a) Manhattan (k=1)
0.6
0.4
0.2
10 20 Data sets indexes
0 0
30
(b) Euclidean (k=1)
Weighted by MI MI Weighted by CCW
Weighted by MI MI Weighted by CCW
0.8
0.8
0.8
0.2
0 0
Area under PR curve
1
0.4
0.6
0.4
0.2
10 20 Data sets indexes
30
(d) Manhattan (k=11)
0 0
30
Weighted by MI MI Weighted by CCW
1
0.6
10 20 Data sets indexes
(c) Chebyshev (k=1)
1
Area under PR curve
Area under PR curve
Weighted by MI MI Weighted by CCW
1
Area under PR curve
Area under PR curve
Weighted by MI MI Weighted by CCW
0.6
0.4
0.2
10 20 Data sets indexes
30
(e) Euclidean (k=11)
0 0
10 20 Data sets indexes
30
(f ) Chebyshev (k=11)
Fig. 2. Classification improvements from CCW on Manhattan distance (1 norm), Euclidean distance (2 norm) and Chebyshev distance (∞ norm)
5.1
Comparisons among NN Algorithms
In this experiment we compare CCW with existing k NN algorithm using Euclidean distance on k = 1. When k = 1, apparently all k NNs that use the same distance measure have exactly the same prediction on a test instances. However the effects of CCW weights generate different probabilities of being positive/negative for each test instance, and hence produce different AUC-PR values. While there are various ways to compare classifiers across multiple data sets, we adopt the strategy proposed by [20] that evaluates classifiers by ranks. In Table 1 the k NN classifiers in comparison are ranked on each data set by the value of their AUC-PR, with ranking of 1 being the best. We perform Friedman tests on the sequences of ranks between different classifiers. In Friedman tests, p–values that are lower than 0.05 reject the hypothesis with 95% confidence that the ranks of classifiers in comparison are not statistically different. Numbers in parentheses of Table 1 are the ranks of classifiers on each data set, and a sign in Friedman tests suggests classifiers in comparison are significantly different. As we can see, both CCWMI and CCWAI (the “Base” classifiers) are significantly better than existing methods of NW, MI, AI and WDk NN.
Class Confidence Weighted k NN Algorithms for Imbalanced Data Sets
5.2
355
Comparisons among k NN Algorithms
In this experiment, we compare k NN algorithms on k > 1. Without losing generality, we set a common number k = 11 for all k NN classifiers. As shown in Table 2, both CCWMI and CCWAI significantly outperforms MI, AI, WDk NN, LMNN, DNet, CCPDT and HDDT. In the comparison with over-sampling techniques, we focus on MI equipped with safe-level-SMOTE [3] method, shown as “SMOTE” in Table 2. The results we obtained from CCW classifiers are comparable to (better but not significant than) the over-sampling technique under 95% confidence. This observation suggests that if one uses CCW he can obtain results comparable to the cutting-edge sampling technique, so the extra computational cost of data sampling before training can be saved. 5.3
Effects of Distance Metrics
While in all previous experiments k NN classifiers are performed under Euclidean distance (2 norm), in this subsection we provide empirical results that demonstrate the superiority of CCW methods on other distance metrics such as Manhattan distance (1 norm) and Chebyshev distance (∞ norm). Due to page limits, here we only present the comparisons of “CCWMI vs. MI”. As we can see from Figure 2, CCWMI can improve MI on all three distance metrics.
6
Conclusions and Future Work
The main focus of this paper is on improving existing k NN algorithms and make them robust to imbalanced data sets. We have shown that conventional k NN algorithms are akin in using only prior probabilities of the neighborhood of a test instance to estimate its class labels, which leads to suboptimal performance when dealing with imbalanced data sets. We have proposed CCW, the likelihood of attribute values given a class label, to weight prototypes before taking them into effect. The use of CCW transforms the original k NN rule of using prior probabilities to their corresponding posteriors. We have shown that this transformation has the ability of correcting the inherent bias towards majority class in existing k NN algorithms. We have applied two methods (mixture modeling and Bayesian networks) to estimate training instances’ CCW weights, and their effectiveness is confirmed by synthetic examples and comprehensive experiments. When learning Bayesian networks, we construct network structures by applying the K2 algorithm which has an overall time complexity of O(n2 ). In future our plan is to extend the idea of CCW to multiple-label classification problems. We also plan to explore the use of CCW on other supervised learning algorithms such as support vector machines etc.
356
W. Liu and S. Chawla
References 1. Yang, Q., Wu, X.: 10 challenging problems in data mining research. International Journal of Information Technology and Decision Making 5(4), 597–604 (2006) 2. Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE. Journal of Artificial Intelligence Research 16(1), 321–357 (2002) 3. Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: Safe-Level-SMOTE. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS, vol. 5476, pp. 475–482. Springer, Heidelberg (2009) 4. Cieslak, D., Chawla, N.: Learning Decision Trees for Unbalanced Data. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part I. LNCS (LNAI), vol. 5211, pp. 241–256. Springer, Heidelberg (2008) 5. Liu, W., Chawla, S., Cieslak, D., Chawla, N.: A Robust Decision Tree Algorithms for Imbalanced Data Sets. In: Proceedings of the Tenth SIAM International Conference on Data Mining, pp. 766–777 (2010) 6. Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G., Ng, A., Liu, B., Yu, P., et al.: Top 10 algorithms in data mining. Knowledge and Information Systems 14(1), 1–37 (2008) 7. Weinberger, K., Saul, L.: Distance metric learning for large margin nearest neighbour classification. The Journal of Machine Learning Research 10, 207–244 (2009) 8. Min, R., Stanley, D.A., Yuan, Z., Bonner, A., Zhang, Z.: A deep non-linear feature mapping for large-margin knn classification. In: Proceedings of the 2009 Ninth IEEE International Conference on Data Mining, pp. 357–366 (2009) 9. Yang, T., Cao, L., Zhang, C.: A Novel Prototype Reduction Method for the K-Nearest Neighbor Algrithms with K ≥ 1. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD 2010. LNCS, vol. 6119, pp. 89–100. Springer, Heidelberg (2010) 10. Paredes, R., Vidal, E.: Learning prototypes and distances. Pattern Recognition 39(2), 180–188 (2006) 11. Paredes, R., Vidal, E.: Learning weighted metrics to minimize nearest-neighbor classification error. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1100–1110 (2006) 12. Wang, J., Neskovic, P., Cooper, L.: Improving nearest neighbor rule with a simple adaptive distance measure. Pattern Recognition Letters 28(2), 207–213 (2007) 13. Jahromi, M.Z., Parvinnia, E., John, R.: A method of learning weighted similarity function to improve the performance of nearest neighbor. Information Sciences 179(17), 2964–2973 (2009) 14. Cooper, G., Herskovits, E.: A Bayesian method for the induction of probablistic networks from data. Machine Learning 9(4), 309–347 (1992) 15. Han, E., Karypis, G.: Centroid-based document classification. In: Zighed, D.A., ˙ Komorowski, J., Zytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 116–123. Springer, Heidelberg (2000) 16. Asuncion, A., Newman, D.: UCI Machine Learning Repository (2007) 17. Witten, I., Frank, E.: Data mining: practical machine learning tools and techniques with Java implementations. ACM SIGMOD Record 31(1), 76–77 (2002) 18. Hendricks, W., Robey, K.: The sampling distribution of the coefficient of variation. The Annals of Mathematical Statistics 7(3), 129–132 (1936) 19. Davis, J., Goadrich, M.: The relationship between precision-recall and roc curves. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 233–240 (2006) 20. Demˇsar, J.: Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research 7, 1–30 (2006)
Multi-agent Based Classification Using Argumentation from Experience Maya Wardeh , Frans Coenen, Trevor Bench-Capon, and Adam Wyner Department of Computer Science, The University of Liverpool, Liverpool L69 3BX, UK {maya.wardeh,coenen,tbc,A.Z.Wyner}@liverpool.ac.uk
Abstract. An approach to multi-agent classification, using an Argumentation from Experience paradigm is describe, whereby individual agents argue for a given example to be classified with a particular label according to their local data. Arguments are expressed in the form of classification rules which are generated dynamically. The advocated argumentation process has been implemented in the PISA multi-agent framework, which is also described. Experiments indicate that the operation of PISA is comparable with other classification approaches and that it can be utilised for Ordinal Classification and Imbalanced Class problems. Keywords: Classification, Argumentation, Multi-agent (Data Mining), Classification Association Rules.
1
Introduction
Argumentation is concerned with the dialogical reasoning processes required to arrive at a conclusion given two or more alternative viewpoints. The process of multi-agent argumentation is conceptualised as a discussion, about some issue that requires a solution, between a set of software agents with different points of view; where each agent attempts to persuade the others that its point of view, and the consequent solution, is the correct one. In this paper we propose applying argumentation to facilitate classification. In particular, it is argued that one model of argumentation, Arguing from Experience ([24,23]), is well suited to the classification tasks. Arguing from Experience provides a computational model of argument based on inductive reasoning from past experience. The arguments are constructed dynamically using Classification Association Rule Mining (CARM) techniques. The setting is a “debate;; about how to classify examples; the generated Classification Association Rules (CARs) provide reasons for and against particular classifications. The proposed model allows a number of agents to draw directly from past examples to find reasons for coming to a decision about the classification of an unseen instance. Agents formulate their arguments in the form of CARs generated from datasets of past examples. Each agent’s dataset is considered to
Corresponding author.
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 357–369, 2011. c Springer-Verlag Berlin Heidelberg 2011
358
M. Wardeh et al.
encapsulate that agent’s experience. The exchange of arguments between agents represents a dialogue which continues until an agent poses an argument for a particular classification that no other agent can refute. The model has been realised in the form of argumentation framework called PISA: Pooling Information from Several Agents. The promoted argumentation-based approach is thus a multiagent classification technique [5] that offers a number of practical advantages: (i) dynamic generation of classification rules in a just in time manner according to the requirements of each agent, (ii) easy-to-understand explanations, in the form of dialogues, concerning a particular classification, and (iii) application to ordinal classification and imbalanced class problems as well as standard classification. The approach also provides for a natural representation of agent “experience” as a set of records, and the arguments as CARs. At the same time the advocated approach also preserves the privacy of the information each agent knows, therefore it can be used with sensitive data. The rest of this paper is organised as follows. Section 2 provides an overview of the PISA Framework. Section 3 details the nature of the Classification Association Rules (CARs) used in PISA. In Section 4 details and empirical analysis are provided of three different applications of PISA to classification problems: (i) standard classification, (ii) ordinal classification and (iii) the imbalanced class problem. Finally, we conclude with a summary of the main findings and some suggested for further work.
2
Argumentation-Based Multi Agent Classification: The PISA Framework
The intuition behind PISA is to provide a method whereby agents argue about a classification task. In effect each agent can be viewed as a dynamic classifier. The overall process thus leads to a reasoned consensus obtained through argumentation, rather than some other mechanism such as voting (e.g. [2]). It is suggested that this dialogue process increases the acceptability of the outcome to all parties. In this respect PISA can be said to be an ensemble-like method. Both theoretical and empirical research (e.g. [20]) has demonstrated that a good ensemble is one comprising individual classifiers that are relatively accurate but make their errors on different parts of the input training set. Two of the most-popular ensemble methods are: (i) Bagging [3] and (ii) Boosting [14]. Both techniques rely on varying the data to obtain different training sets for each of the classifiers in the ensemble. PISA is viewed as a bagging-like multi-agent ensemble, whereby the dataset is equally divided amongst a number of participants corresponding to the number of class values in the dataset. Each participant applies the same set of algorithms to mine CARs supporting their advocated class. To this end, each participant can be said to correspond to a single classifier. The argumentation process by which each participant advances moves to support its proposals corresponds to voting methods by which ensemble techniques assign class labels to input cases. But rather than simple voting, PISA applies an argumentation debate (dialogue). PISA also differs from Boosting techniques in that it does not generate a
Multi-agent Based Classification Using Argumentation from Experience
359
sequence of classifiers; instead the desired classification is achieved through the collaborative operation of several classifiers. Furthermore, PISA classifies unseen records by (dynamically) producing a limited number of CARs sufficient to reach a decision without the need to produce the full set of CARs. The PISA framework comprises three key elements: 1. Participant Agents. A number of Participant Agents, at least one for each class in the discussion domain such that each advocates one possible classification. 2. Chairperson. A neutral mediator agent which administers a variety of tasks aimed at facilitating PISA dialogues. 3. Set of CARs. The joint set of CARs exchanged in the course of one PISA dialogue. This set is represented by a central argument structure, called the Argumentation Tree, maintained by the Chairperson. Participant Agents have access to this tree and may use it to influence their choice of move. The agents consult the tree at the beginning of each round and decide which CAR they are going to attack. Full Details of this data structure can be found in [24]. Once a dialogue has terminated the status of the argumentation tree will indicate the winning’ classification. Note that the dialogues produced by PISA also explains the resulting classifications. This feature is seen as an essential advantage offered by PISA. Each Participant Agent has its own distinct (tabular) local dataset relating to a classification problem (domain). These agents produce reasons for and against classifications by mining CARs from their datasets using a number of CARM algorithms (Section 3). The antecedent of every CAR represents a set of reasons for believing the consequent. In other words given a CAR, P → c, this should be read as: P are reasons to believe that the case should classify as c. CARs are mined dynamically as required. The dynamic mining provides for four different types of move, each encapsulated by a distinct category of CAR. Each Participant Agent can employ any one of the following types of move to generate arguments: (i) Proposing moves, (ii) Attacking moves, and (iii) Refining moves. The different moves available are discussed further below. Note that each of these moves has a set of legal next moves (see Table 1). Proposing Moves. There is only one kind of proposing move: 1. Propose Rule: Allows a new CAR, with a confidence higher than a given threshold, to be proposed. All PISA dialogues commence with a Propose Rule move. Attacking Moves. Moves intended to show that a CAR proposed by some other agent should not be considered decisive with respect to the current instance. Two sub-types are available: (i) Distinguish and (ii) Counter Rule, as follows: 2. Distinguish: Allows an agent to add new attributes (premises) to a previously proposed CAR so that the confidence of the new rule is lower than the confidence threshold, thus rendering the original classification inadmissible.
360
M. Wardeh et al. Table 1. Legal next moves in PISA Move Label Next Move 1 Propose Rule 2, 3 2 Distinguish 4, 1
Move Label Next Move 3 Counter Rule 2, 1 4 Increase Conf 2, 3
3. Counter Rule: Similar to Propose Rule but used to cite a classification other than that advocated by the initial Propose Rule move. Refining Moves. Moves that enable a CAR to be refined to meet a counter attack. For the purposes of using PISA as a classifier, one refining move is implemented: 4. Increase Confidence: Allows the addition of new attribute(s) to the premise associated with a previously proposed CAR so as to increase the confidence of the rule, thus increasing the confidence that the case should be classified as indicated.
3
PISA Dynamic CAR Mining
Having introduced, in the foregoing, the legal moves in PISA dialogues, the realisation of these moves is described in this section. The idea is to mine CARs according to: (i) a desired minimum confidence, (ii) a specified consequent and (iii) a set of candidate attributes for the antecedent (a subset of the attributes represented by the case under discussion). Standard CARM techniques (e.g.[7,16]) tend to generate the complete set of CARs represented in the input data. PISA on the other hand utilises a just in time approach to CARM, directed at generating particular subsets of CARs, and applied such that each agent mines appropriate CARs as needed. The mining process supports two different forms of dynamic ARM request: 1. Find a subset of rules that conform to a given set of constraints. 2. Distinguish a given rule by adding additional attributes. In order to realise the above, each Participant Agent utilises a T-tree [6] to summarise its local dataset. A T-tree is a reverse set enumeration tree structure where nodes are organised using reverse lexicographic ordering, which in turn enables direct indexing according to attribute number; therefore computational efficiency gains are achieved. A further advantage, with respect to PISA, is that the reverse ordering dictates that each sub-tree is rooted at a particular class attribute, and so all the attribute sets pertaining to a given class are contained in a single T-tree branch. This means that any one of the identified dynamic CARM requests need be directed at only one branch of the tree. This reduces the overall processing cost compared to other prefix tree structures (such as FP-Trees [16]). To further enhance the dynamic generation of CARs a set of algorithms that work directly on T-trees were developed. These algorithms were able to mine CARs satisfying different values of support threshold. At the start of the dialogue each player has an empty T-tree and slowly builds a partial
Multi-agent Based Classification Using Argumentation from Experience
361
T-tree from their data set, as required, containing only the nodes representing attributes from the case under discussion plus the class attribute. Note that no node pruning, according to some user specified threshold, takes place; except for nodes that have zero support. Two dynamic CAR retrieval algorithms were developed: (i) Algorithm A which finds a rule that conforms to a given set of constraints, and (ii) Algorithm B which distinguishes a given rule by adding additional attributes. Further details of these algorithms can be found in [25].
4
Applications of PISA
Arguing from Experience enables PISA agents to undertake a number of different tasks, mainly: 1. Multi-agent Classification: Follows the hypothesis that the described operation of PISA produces at least comparative results to that obtained using traditional classification paradigms. 2. Ordinal Classification: Follows the hypothesis that PISA can be successfully applied to datasets with ordered-classes, using a simple agreement strategy. 3. Classifying imbalanced data using dynamic coalitions: Follows the hypothesis that dynamic coalitions between a number of participant agents, representing rare classes, improves the performance of PISA with imbalanced multi-class datasets. In this section the above applications of PISA are empirically evaluated. For the evaluation we used a number of real-world datasets drawn from the UCI repository [4]. Where appropriate continuous values were discretised into ranges. The chosen datasets (Table 2) display a variety of characteristics with respect to number of records (R), number of classes (C) and number of attributes (A). Importantly, they include a diverse number of class labels, distributed in a different manner in each dataset (balanced and unbalanced), thus providing the desired variation in the experience assigned to individual PISA participants. 4.1
Application 1: PISA-Based Classification
The first application of PISA is in the context of multi-agent classification based on argumentation. In order to provide an empirical assessment of this application we ran a series of experiments designed to evaluate the hypothesis that PISA produces at least comparative results to that obtained using traditional classification paradigms. In particular, ensemble classification methods. The results presented throughout this sub-section, unless otherwise noted, were obtained using Tenfold Cross Validation (TCV). For the purposes of running PISA, each training dataset was equally divided among a number of Participant Agents corresponding to the number of classes in the dataset. Then a number of PISA dialogues were executed to classify the cases in the test sets1 . In order to fully assess its operation, PISA was compared against a range of classification paradigms: 1
For each evaluation the confidence threshold used by each participant was 50% and the support threshold 1%.
362
M. Wardeh et al.
Table 2. Summary of data sets. Columns indicate: domain name, number of records, number of classes, number of attributes and class distribution (approximately balanced or not. Name Hepatitis HorseColic Cylnder Bands Pima (Diabetes) Mushrooms Iris Wine Lymphography Heart Dematology Zoo Glass Ecoli Led7 Chess
R C A Bal Name 155 2 19(56) no Ionosphere 368 2 27(85) no Congressional Voting 540 2 39(124) yes Breast 768 2 9(38) yes Tic-Tac-Toe 8124 2 23(90) yes Adult 150 3 4(19) yes Waveform 178 3 13(68) yes Connect4 148 4 18(59) no Car Evaluation 303 5 22(52) no Nursery 366 6 49(49) no Annealing 101 7 17(42) no Automobile (Auto) 214 7 10(48) no Page Blocks 336 8 8(34) no Solar Flare 3200 10 8(24) yes Pen Digits 28056 18 6(58) no
R C A Bal 351 2 34(157) no 435 2 17(34) yes 699 958 48842 5000 67557 1728 12960 898 205 5473 1389 10992
2 2 2 3 3 4 5 6 7 7 9 10
11(20) 9(29) 14(97) 22(101) 42(120) 7(25) 9(32) 38(73) 26(137) 11(46) 10(39) 17(89)
yes no no yes no no no no no no no yes
Table 3. Summary of the Ensemble Methods used. The implementation of these methods was obtained from [15]. (S=Support, RDT=Random Decision Trees) Ensemble Bagging-C4.5 ADABoost-C4.5
.
MutliBoostABC4.5 DECORATE
Technique Bagging[3]
Base C4.5 (S=1%) ADABoost.M1 C4.5 [14] (S=1%) MultiBoosting C4.5 [26] (S=1%) [19] C4.5 (S=1%)
Ensemble Bagging-RDT ADABoost-RDT MultiBoostABRDT
Technique Bagging[3]
Base RDT (S=1%) ADABoost.M1 RDT [14] (S=1%) MultiBoosting RDT [26] (S=1%)
1. Decision trees: Both C4.5, as implemented in [15], and the Random Decision Tree (RDT)[8], were used. 2. CARM : The TFPC (Total From Partial Classification) algorithm [7] was adopted because this algorithm utilises similar data structures [6] as PISA. 3. Ensemble classifiers: Table 3 summarises the techniques used. We chose to apply Boosting and Bagging, combined with decision trees, because previous work demonstrated that such combination is very effective (e.g. [2,20]). For each of the included methods (and PISA) three values were calculated for each dataset: (i) classification error rate, (ii) Balanced Error Rate (BER) using a confusion matrix obtained from each TCV2 ; and (iii) execution time. 2
Balanced Error Rates (BER) were calculated, for each dataset, as follows: BER =
C
C
1
Fci i=1 Fci +Tci
C = the number of classes in the dataset, Tci = the number of cases which are correctly classified as class ci, and Fci = the number of cases which should have been classified as ci but where classified under different class label.
Multi-agent Based Classification Using Argumentation from Experience
363
Table 4. Test set error rate (%). Values in bold are the lowest in a given dataset. Dataset Hepatitis Ionosphere HorseColic Congress CylBands Breast Pima TicTacToe Mushrooms Adult Iris Waveform Wine Connect4 Lympho Car Eval Heart Page Bloc Nursery Dematology Annealing Zoo Auto Glass Ecoli Flare Led7 Pen Digit Chess
PISA 13.33 3.33 2.78 1.78 15.00 3.91 14.47 2.84 0.41 14.49 2.67 2.16 1.18 5.01 6.23 4.11 5.05 2.24 6.37 4.96 9.55 9.90 12.00 14.69 5.17 6.09 12.00 2.75 9.13
Bagging C4.5 RDT
Ensembles ADABoost.M1 MultiBoost C4.5 RDT C4.5 RDT
Decision Trees Decorate
TFPC C4.5
RDT 23.23 2.57 3.89 0.00 36.48 5.07 16.18 20.77 0.06 13.09 8.00 2.42 0.00 4.31 25.00 5.90 4.67 6.94 3.72 5.28 1.67 0.00 17.00 29.91 8.79 8.03 24.25 1.08 18.58
18.06 7.69
14.84 6.84
15.48 7.12
21.29 10.83
13.55 6.27
18.71 10.83
16.13 7.41
16.13 8.55
3.01 42.22 5.01 27.21 7.20 0.00
2.31 27.04 4.86 25.26 5.43 0.00
2.08 42.22 4.86 25.26 2.19 0.00
3.01 34.81 4.86 23.83 20.35 0.00
2.08 42.22 4.86 25.13 2.19 0.00
3.01 34.81 4.86 24.87 20.35 0.00
2.77 39.81 5.43 25.66 5.85 0.00
4.16 42.22 4.86 26.69 15.45 0.00
4.67 17.97
5.33 11.98
6.00 21.48
7.33 21.48
6.00 13.62
7.33 11.96
4.67 21.48
4.00 21.48
18.92 4.51 20.07 6.93 2.08 4.10 1.22 7.92 15.12 27.10 13.99 2.48 24.81 4.47
19.59 1.24 19.73 6.93 3.09 3.55 0.67 4.95 15.61 21.50 15.18 3.41 24.16 1.35
14.86 2.43 22.79 7.02 0.38 3.83 0.45 3.96 14.15 22.43 16.37 3.41 24.84 1.58
29.73 6.25 21.09 6.93 3.09 15.30 1.78 19.80 21.46 29.91 24.70 3.41 24.28 2.51
15.54 2.60 19.05 7.02 0.35 3.28 0.56 3.96 15.61 25.23 14.88 3.41 24.91 5.07
29.73 6.25 19.73 6.93 3.09 15.30 1.78 19.80 21.46 29.91 24.70 3.10 24.34 1.87
19.59 4.28 20.41 6.93 1.91 1.64 1.34 6.93 16.10 29.91 13.10 3.10 24.75 2.51
22.97 5.09 19.05 7.02 2.62 6.01 1.56 7.92 18.05 33.18 15.77 2.48 24.84 5.65
18.00 14.29 22.78 9.30 30.37 10.00 25.92 33.68 1.05 19.19 6.00 33.32 25.29 34.17 24.29 30.00 46.67 9.95 22.25 25.00 11.80 8.00 29.00 33.81 37.27 14.74 31.03 18.24 15.73
These three values then provided the criteria for assessing and comparing the classification paradigms. The results are presented in Table 4. From the table it can be seen that PISA performs consistently well; out performing the other association rule classifier, and giving comparable results to the decision tree methods. Additionally, PISA produced results comparable to those produced by the ensemble methods. Moreover, PISA scored an average overall accuracy of 93.60%, higher than that obtained from any of the other methods tested (e.g. Bagging-RDT (89.48%) and RDT (90.24%))3 . Table 5 shows the BER for each of the given datasets. From the table it can be seen that PISA produced reasonably good results overall, producing the best result in 14 out of the 39 datasets tested. Table 6 gives the execution times (in milliseconds) for each of the methods. Note that PISA is not the fastest method. However, the recorded performance is by no means the worst (for instance Decorate runs slower than PISA with respect to the majority of the datasets). Additionally, PISA seems to run faster than Bagging and ADABoost with some datasets. 4.2
Application 2: PISA-Based Ordinal Classification
Having established PISA as a classification paradigm, we now explore the application of PISA to ordinal classification. In this form of multi-class classification the set of class labels is finite and ordered. Whereas traditional classification paradigms commonly assume that the class values are unordered. For many practical applications class labels do exhibit some form of order (e.g. the weather can be cold, mild, 3
These accuracies were calculated from Table 4.
364
M. Wardeh et al.
Table 5. Test set BER (%). Values in bold are the lowest in a given dataset. Dataset Hepatitis Ionosphere HorseColic Congress CylBands Breast Pima TicTacToe Mushrooms Adult Iris Waveform Wine Connect4 Lympho Car Eval Heart Page Bloc Nursery Dematology Annealing Zoo Auto Glass Ecoli Flare Led7 Pen Digit Chess
PISA 12.00 4.58 2.80 2.35 14.50 4.75 13.94 2.14 0.59 8.80 2.90 3.93 1.42 11.90 15.95 8.21 8.25 9.45 5.47 8.49 16.13 13.23 12.26 16.09 16.18 17.18 11.84 3.47 9.63
Bagging C4.5 RDT
Ensembles ADABoost.M1 MultiBoost C4.5 RDT C4.5 RDT
Decision Trees Decorate
TFPC C4.5
RDT 38.19 2.19 3.71 0.00 34.56 4.71 24.47 22.94 0.06 39.89 7.96 2.39 0.00 5.33 47.11 10.67 9.82 21.48 5.74 3.25 4.43 0.00 13.57 29.56 9.42 7.59 24.36 3.73 16.38
27.41 7.08
20.63 6.63
23.37 6.42
33.69 11.43
19.89 5.31
25.05 11.43
24.60 7.08
23.38 8.17
3.43 46.10 6.03 28.88 6.71 0.00
2.66 24.48 6.20 26.86 5.35 0.00
2.27 46.10 6.20 26.86 2.25 0.00
3.19 35.63 6.20 25.18 22.46 0.00
2.27 46.10 6.07 26.72 2.25 0.00
2.27 35.63 6.20 26.12 22.46 0.00
3.05 40.14 6.71 27.16 5.25 0.00
4.61 18.00
5.29 11.99
5.93 21.51
7.32 21.51
5.93 13.64
7.32 11.97
4.69 21.51
4.69 46.10 6.20 28.34 16.98 0.00 17.75 3.96 21.51
30.12 11.25 9.16 21.46 4.14 4.68 6.76 12.78 11.43 24.57 36.66 12.77 24.56 4.48
9.74 6.77 8.97 22.85 2.28 3.89 3.92 10.71 15.98 19.42 40.18 12.66 24.07 1.51
25.90 4.79 9.93 27.89 1.05 3.93 2.57 10.71 10.55 24.33 41.18 10.91 24.87 1.59
43.43 10.43 7.98 21.46 5.82 19.41 4.31 36.51 18.92 29.56 51.89 12.66 24.31 2.23
38.66 5.29 7.85 27.87 0.76 3.31 3.25 10.71 15.98 23.18 37.92 12.66 25.09 4.92
43.43 10.43 8.97 21.46 5.78 19.41 4.31 36.51 18.92 29.56 51.89 12.62 24.23 1.89
39.35 10.24 9.43 22.85 4.55 1.84 7.16 15.71 12.84 29.56 24.16 12.62 24.92 2.23
35.97 16.55 7.97 27.89 5.98 7.00 6.83 17.50 17.04 37.98 43.35 12.54 24.72 5.57
36.44 13.41 28.63 9.71 32.78 12.89 33.67 47.44 1.04 6.07 33.35 24.05 66.67 16.09 75.00 48.02 19.89 40.10 61.67 33.51 17.14 19.60 48.55 23.23 14.74 31.39 18.38 24.53
warm and hot). Given ordered classes, one is not only concerned to maximise the classification accuracy, but also to minimise the distances between the actual and the predicted classes. The problem of ordinal classification is often solved by either multi-class classification or regression methods. However, some new approaches, tailored specifically for ordinal classification, have been introduced in the literatures (e.g. [13,22]). PISA can be utilised for ordinal classification by the means of biased agreement. Agents in PISA have the option to agree with CARs suggested by other agents, by not attacking these rules, even if a valid attack is possible. PISA agents can either agree with all the opponents or with a pre-defined set of opponents that match the class order. For instance, in the weather scenario, agents supporting the decision that the weather is hot, agree with those with the opinion that the weather is warm, and vice versa. Whereas agents supporting that the weather is cold or mild agree with each other. We refer to the latter form of agreement by the term biased agreement. In which the agents are equipped with a simple list of the class labels that they could agree with (the agreement list). Here, we have two forms of this mode of agreement: 1. No Attack Biased Agreement (NA-BIA): In which agents consult their agreement list before mining any rules from their local datasets and attempt only to attack/respond to CARs of the following shape: P → Q : ∀q ∈ Q, q ∈ / agreementlist 2. Confidence Threshold Biased Agreement (CT-BIA): Here, if the agents fail to attack any CARs that contradict with their agreement list, then they try to attack CARs (P → Q : ∃q ∈ Q, q ∈ agreementlist) if and only if they fail to mine a matching CAR, with same or higher confidence, from their own local dataset (P → Q‘ : Q‘ ⊇ Q).
Multi-agent Based Classification Using Argumentation from Experience
365
Table 6. Test set execution times (milliseconds). Values in bold are the lowest in a given dataset. cm Dataset Hepatitis Ionosphere HorseColic Congress CylBands Breast Pima TicTacToe Mushrooms Adult Iris Waveform Wine Connect4 Lympho Car Eval Heart Page Bloc Nursery Dematology Annealing Zoo Auto Glass Ecoli Flare Pen Digits Led7 Chess
PISA 115 437 17 34 83 31 75 71 313 3019 42 1243 136 4710 15 74 343 159 965 194 750 43 210 180 139 239 1345 78 2412
Bagging C4.5 RDT
Ensembles ADABoost.M1 MultiBoost C4.5 RDT C4.5 RDT
Decision Trees Decorate
TFPC C4.5
RDT 60 12 4.8 15 17 8 21 6.1 117 706 13 102 106 3612 5 24 5 55 139 7 28 5 5 10 3 27 80 90 334
110 1130
40 210
190 1170
70 20
200 1210
60 20
610 4090
40 80
50 110 110 160 80 750
20 130 110 90 70 380
20 40 140 80 250 110
140 20 110 130 30 50
130 40 170 80 280 60
20 20 170 110 10 50
590 1190 330 500 620 6400
30 40 8.1 20 20 80
40 1840
50 380
60 4400
50 830
50 1650
10 560
110 4730
10 200
80 300 250 130 1790 160 1090 40 440 260 240 30 2300 730
50 110 80 430 720 40 120 10 70 120 150 20 460 360
90 370 480 430 3130 230 850 20 320 340 360 60 5810 260
10 20 20 130 60 20 10 10 10 10 10 40 820 130
70 20 430 280 3760 20 1170 30 350 430 340 20 2790 1150
10 310 10 130 10 20 10 10 10 10 10 20 800 480
140 1580 620 430 1449 480 3340 110 520 1060 1510 140 2300 3380
5 80 20 120 110 20 50 10 20 20 10 10 290 110
213 109 108 154 936 11 11 61.4 630 1279 2 862 163 6054 29 17 183 60 204 169 689 85 43 43 4 23 1606 25 226
Table 7. The application of PISA with datasets from Table 2 with ordered classes Datasets Lympo Car Eval Page Bloc Nursery Dema Zoo Ecoli
ER PISA CT-BIA NA-BIA 6.21 4.11 2.67 6.37 4.96 9.90 6.03
4.76 5.00 3.64 6.27 7.95 7.92 5.52
3.38 4.03 3.91 5.83 6.87 6.86 4.34
BER PISA CT-BIA NA-BIA 15.95 9.53 13.43 11.79 8.49 13.23 16.81
20.73 10.09 10.42 13.57 8.74 14.67 6.72
13.94 10.61 10.06 7.88 7.53 12.17 6.91
MSE PISA CT-BIA NA-BIA 0.199 0.863 1.250 7.450 0.144 0.223 0.008
0.046 1.220 5.164 7.071 0.143 0.230 0.008
0.015 0.708 4.757 6.725 0.100 0.232 0.005
MAE PISA CT-BIA NA-BIA 2.07 1.02 0.49 1.61 1.46 2.26 8.23
1.36 1.32 0.78 1.57 1.37 2.26 7.92
0.84 1.01 0.83 1.46 1.24 1.96 4.63
To test the hypothesis that the above approach improves the performance of PISA when applied to ordinal classification a series of TCV tests, using a number of datasets from Table 2 which have ordered classes, were conducted. PISA was run using the NA-BIA and CT-BIA strategies, and the results were compared against the use of PISA without any agreement strategy. Additionally, to provide better comparison the Mean Squared Error (MSE) and the Mean Absolute Error (MAE) rates for the included datasets and methods were calculated. [11] notes that little attention has been directed at the evaluation of ordinal classification solutions, and that simple measures, such as accuracy, are not sufficient. In [11] a number of evaluation metrics, for ordinal classification, are compared. As a result MSE is suggested as the best metric when more (smaller) errors are preferred to reduce the number of large errors; while MAE is a good metric if, overall, fewer errors are preferred with more tolerance for large errors. Table 7 provides a summary of the results of the experiments. From the table it can be seen that the NA-BIA produces better results with datasets with ordinal classes.
366
4.3
M. Wardeh et al.
Application 3: PISA-Based Solution to the Imbalanced Class Problem
Another application of PISA is using dynamic coalitions between different agents to produce better performance in the face of imbalanced class problem. It has been observed (e.g.[17]) that class imbalance (i.e a significant differences in class prior probabilities) may produce an important deterioration of the performance achieved by existing learning and classification systems. This situation is often found in real-world data describing an infrequent but important case (e.g. Table 2. There have been a number of proposed mechanisms for dealing with the class imbalance problem (e.g. [10,21]). [12,17] note a number of different approaches: 1. Changing class distributions: by “upsizing” the small class at random (or focused random), or by “downsizing” the large class at random (or focused random). 2. At the classifier level by either: manipulating classifiers internally, costsensitive learning or one-class learning. 3. Specially designed ensemble learning methods. 4. Agent-based remedies such as that proposed in [18] where three agents, each using a different classification paradigm, generate classifiers from a filtered version of the training data. Individual predictions are then combined according to a voting scheme. The intuition is that the models generated using different learning biases are more likely to make errors in different ways. In the following we present a refinement of the basic PISA model which enables PISA to tackle the imbalance-class problem in multi-class datasets, using Dynamic Coalitions between agents representing the rare classes. Unlike the biased agreement approach (Sub-section 4.2), coalition requires mutual agreement among a number of participants, thus a preparation step is necessary. However, for the purposes of this paper we assume that the agents representing the rare classes are in coalition from the start of the dialogue, thus eliminating the need for a preparatory step. The agents in a coalition stop attacking each other, and only attack CARs placed by agents outside the coalition. The objective of such coalition is to attempt to remove the agents representing dominant class(es) from the dialogue, or at least for a pre-defined number of rounds. Once the agent in question is removed from the dialogue, the coalition is dismantled and the agents go on attacking each others as in a normal PISA dialogue. In the following we provide experimental analysis of two coalition techniques: 1. Coalition (1): The coalition is dismantled if the agent supporting the dominant class does not participate in the dialogue for two consecutive rounds. 2. Coalition (2): The coalition is dismantled if the agent supporting the dominant class does not participate in the dialogue for two consecutive rounds, and this agent is not allowed to take any further part in the dialogue.
Multi-agent Based Classification Using Argumentation from Experience
367
Table 8. The application of PISA with imbalanced multi-class datasets from Table 2 PISA
ER Coal(1)
Coal(2)
PISA
BER Coal(1)
Coal(2)
PISA
5.02 6.21 4.11 5.05 2.24 4.96 9.55 9.90 12.00 14.69 6.03 6.09 9.13
4.18 5.02 3.73 4.95 1.43 3.91 4.24 8.00 6.37 12.02 5.15 7.10 8.47
3.78 4.03 4.22 4.95 1.14 3.60 4.01 7.00 5.77 5.74 5.64 6.86 6.28
11.90 15.95 9.53 8.25 13.43 8.49 16.13 13.23 12.26 16.09 16.18 17.18 9.63
9.68 11.90 7.24 2.54 7.96 4.95 7.72 8.33 6.53 7.45 10.93 5.58 5.91
8.70 14.64 4.47 3.17 9.63 4.48 4.24 3.92 6.64 5.81 3.92 5.15 5.82
87.47 69.31 79.42 84.44 68.17 75.79 63.57 67.19 79.74 80.12 74.16 77.41 76.70
Datasets Connect4 Lympo Car Eval Heart Page Bloc Derma Annealing Zoo Auto Glass Ecoli Flare Chess
G-Mean Coal(1) Coal(2)
PISA
91.00 92.81 92.52 89.97 84.02 90.14 91.52 85.51 90.87 93.24 96.01 95.76 92.22
4710 15 74 343 159 194 750 43 210 180 139 2393 2412
89.96 82.60 88.40 87.67 85.43 84.27 86.20 85.42 87.88 93.60 87.31 91.21 91.26
Time Coal(1) Coal(2) 5376 65 163 531 207 119 980 93 336 178 86 2291 3305
5818 55 158 612 222 107 881 85 293 171 81 6267 3393
To test the hypothesis that the above approaches improves the performance of PISA when applied to imbalanced class datasets we ran a series of TCV tests using a number of datasets from Table 2, which have imbalanced class distributions. The results were compared against the use of PISA without any coalition strategy. Four measures were used in this comparison: error rate, balanced error rate, time and geometric mean (g-mean)4 . This last measure was used to quantify the classifier performance in the class [1]. Table 8 provides the result of the above experiment. From,the table it can be seen that both coalition techniques boost the performance of PISA, with imbalance-class datasets, with very little additional cost in time, due to the time needed to dismantle the coalitions.
5
Conclusions
The PISA Arguing from Experience Framework has been described. PISA allows a collection of agents to conduct a dialogue concerning the classification of an example. The system progresses in a round-by-round manner. During each round agents can elect to propose an argument advocating their own position or attack another agent’s position. The arguments are mined and expressed in the form of CARs, which are viewed as generalisations of the individual agent’s experience. In the context of classification PISA provides for a “distributed” classification mechanism that harnesses all the advantages offered by Multi-agent Systems. The effectiveness of PISA is comparable with that of other classification paradigms. Furthermore the PISA approach to classification can operate with temporally evolving data. We have also demonstrated that PISA can be utilised to produce better performance with imbalanced classes and ordinal classification problems.
References 1. Alejo, R., Garcia, V., Sotoca, J., Mollineda, R., Sanchez, J.: Improving the Performance of the RBF Neural Networks with Imbalanced Samples. In: Proc. 9th Intl. Conf. on Artl. Neural Networks, pp. 162–169. Springer, Heidelberg (2007) 1 4 C where pii is the class The geometric mean is defined as g − mean = ( C i=1 pii ) accuracy of class i, and C is the number of classes in the dataset.
368
M. Wardeh et al.
2. Bauer, E., Kohavi, R.: An empirical comparison of voting classification algorithms: Bagging, Boosting and variants. J. Machine Learning 36, 105–139 (1999) 3. Brieman, L.: Bagging predictors. J. Machine Learning 24, 123–140 (1996) 4. Blake, C.L., Merz, C.J.: UCI Repository of machine learning databases. University of California (1998), http://www.ics.uci.edu/~ mlearn/MLRepository.html 5. Cao, L., Gorodetsky, V., Mitkas, P.: Agent Mining: The Synergy of Agents and Data Mining. IEEE Intelligent Systems 24(3), 64–72 (2009) 6. Coenen, F., Leng, P., Ahmed, S.: Data structure for association rule mining: T-trees and p-trees. J. IEEE Trans. Knowl. Data Eng. 16(6), 774–778 (2004) 7. Coenen, F., Leng, P.: Obtaining Best Parameter Values for Accurate Classification. In: Proc. ICDM 2005, pp. 597–600. IEEE, Los Alamitos (2005) 8. Coenen, F.: The LUCS-KDD Decision Tree Classifier Software Dept. of Computer Science, The University of Liverpool, UK (2007), http://www.csc.liv.ac.uk/ frans/KDD/Software/DecisionTrees /decisionTree.html 9. Dietterich, T.: Ensemble methods in machine learning. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, pp. 1–15. Springer, Heidelberg (2000) 10. Elkan, C.: The Foundations of Cost-Sensitive Learning. In: Proc. IJCAI 2001, vol. 2, pp. 973–978 (2001) 11. Gaudette, L., Japkowicz, N.: Evaluation Methods for Ordinal Classification. In: Yong, G., Japkowicz, N. (eds.) AI 2009. LNCS, vol. 5549, pp. 207–210. Springer, Heidelberg (2009) 12. Guo, X., Yin, Y., Dong, C., Zhou, G.: On the Class Imbalance Problem. In: Proc. ICNC 2008, pp. 192–201. IEEE, Los Alamitos (2008) 13. Frank, E., Hall, M.: A simple approach to ordinal classification. In: Flach, P.A., De Raedt, L. (eds.) ECML 2001. LNCS (LNAI), vol. 2167, pp. 145–157. Springer, Heidelberg (2001) 14. Freund, Y., Schapire, R.: Experiments with a new boosting algorithm. In: Proc. ICML 1996, pp. 148–156 (1996) 15. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA Data Mining Software: An Update. J. SIGKDD Explorations 11(1) (2009) 16. Han, J., Pei, J., Yiwen, Y.: Mining Frequent Patterns Without Candidate Generation. In: Proc. SIGMOD 2000, pp. 1–12. ACM Press, New York (2000) 17. Japkowicz, N., Stephen, S.: The Class Imbalance Problem: A systematic study. J. Intelligent Data Analysis 6(5), 429–449 (2002) 18. Kotsiantis, S., Pintelas, P.: Mixture of Expert Agents for Handling Imbalanced Data Sets. Annals of Mathematics, Computing & TeleInformatics 1, 46–55 (2003) 19. Melville, P., Mooney, R.: Constructing Diverse Classifier Ensembles Using Artificial Training Examples. In: Proc. IJCAI 2003, pp. 505–510 (2003) 20. Opitz, D., Maclin, R.: Popular Ensemble Methods: An Empirical Study. J. Artif. Intell. Research 11, 169–198 (1999) 21. Philippe, L., Lallich, S., Do, T., Pham, N.: A comparison of different off-centered entropies to deal with class imbalance for decision trees. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD 2008. LNCS (LNAI), vol. 5012, pp. 634–643. Springer, Heidelberg (2008) 22. Blaszczynski, J., Slowinski, R., Szelag, M.: Probabilistic Rough Set Approaches to Ordinal Classification with Monotonicity Constraints. In: H¨ ullermeier, E., Kruse, R., Hoffmann, F. (eds.) IPMU 2010. LNCS, vol. 6178, pp. 99–108. Springer, Heidelberg (2010)
Multi-agent Based Classification Using Argumentation from Experience
369
23. Wardeh, M., Bench-Capon, T., Coenen, F.: Multi-Party Argument from Experience. In: McBurney, P., Rahwan, I., Parsons, S., Maudet, N. (eds.) ArgMAS 2009. LNCS, vol. 6057, Springer, Heidelberg (2010) 24. Wardeh, M., Bench-Capon, T., Coenen, F.: Arguments from Experience: The PADUA Protocol. In: Proc. COMMA 2008, Toulouse, France, pp. 405–416. IOS Press, Amsterdam (2008) 25. Wardeh, M., Bench-Capon, T., Coenen, F.: Dynamic Rule Mining for Argumentation Based Systems. In: Proc. 27th SGAI Intl. Conf. on AI (AI 2007), pp. 65–78. Springer, London (2007) 26. Webb, G.: MultiBoosting: A Technique for Combining Boosting and Wagging. J. Machine Learning 40(2), 159–196 (2000)
Agent-Based Subspace Clustering Chao Luo1 , Yanchang Zhao2 , Dan Luo1 , Chengqi Zhang1 , and Wei Cao3 1 Data Sciences and Knowledge Discovery Lab Centre for Quantum Computation and Intelligent Systems Faculty of Engineering & IT, University of Technology, Sydney, Australia {chaoluo,dluo,chengqi}@it.uts.edu.au 2 Data Mining Team, Centrelink, Australia
[email protected] 3 Hefei University of Technology, China
[email protected]
Abstract. This paper presents an agent-based algorithm for discovering subspace clusters in high dimensional data. Each data object is represented by an agent, and the agents move from one local environment to another to find optimal clusters in subspaces. Heuristic rules and objective functions are defined to guide the movements of agents, so that similar agents(data objects) go to one group. The experimental results show that our proposed agent-based subspace clustering algorithm performs better than existing subspace clustering methods on both F1 measure and Entropy. The running time of our algorithm is scalable with the size and dimensionality of data. Furthermore, an application in stock market surveillance demonstrates its effectiveness in real world applications.
1
Introduction
As an extension of traditional full-dimensional clustering, subspace clustering seeks to find clusters in subspaces in high-dimensional data. Subspace clustering approaches can provide fast search in different subspaces, so as to find clusters hidden in subspaces of a full dimensional space. The interpretability of the results is highly desirable in data mining applications. As a basic approach, the clustering results should be easily utilized by other methods, such as visulization techniques. In last decade, subspace clustering has been researched widely. However, there are still some issues in this area. Some subspaces clustering, such as CLIQUE [3], produce only overlapped clustering, where one data point can belong to several clusters. This makes the clusters fail to provide a clear description of data. In addition, most subspace clustering methods generate low quality clusters. In order to obtain high quality subspace clustering, we design a model of Agent-based Clustering on Subspaces (ACS). By simulating the actions and interactions of autonomous agents with a view to accessing their effects on the system as a whole, Agent-based subspace clustering can result in far more complex and interesting clustering. The clusters obtained can provide a natural description of data. Agent-based subspace clustering is a powerful clustering modeling technique that can be applied to real-business problems. J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 370–381, 2011. c Springer-Verlag Berlin Heidelberg 2011
Agent-Based Subspace Clustering
371
This paper is organized as follows. Section 2 gives the background and related work of subspace clustering. Section 3 presents our model of agent-based subspace clustering. The experimental results and evaluation are given in Section 4. An application on market manipulation is also provided in Section 4. We conclude the research in Section 5.
2
Background and Related Works
The goal of clustering is to group a given set of data points into clusters such that all data within a cluster are similar to each other. However, with the increase of dimensionality, traditional clustering methods are questioned. For example, the traditional distance measures fail to differentiate the nearest neighbor from the farthest point in very high-dimensional spaces. Subspace clustering was introduced in order to solve the problem and identify the similarity of data points in subspaces. According to search strategy, there are two main approaches to subspace clustering: bottom-up search methods and top-down search methods. The bottom-up search methods, such as CLIQUE[3], ENCLUS [5] and DOC[9], take advantage of the downward closure property of density to reduce the search space by using an APRIORI style approach. Candidate high-dimensional subspaces are generated from low dimensional subspaces which contain dense units. The searching stops when no candidate subspaces are generated. The top-down subspace clustering approaches, such as PROCLUS [1], FINDIT and COSA, starts by finding an initial approximation of the clusters with equal weight on full dimensions. Then each dimension is assigend a weight for each cluster. The updated weights are then used in the next iteration to update the clusters. Most top-down methods use sampling technique to improve their performance. The CLIQUE algorithm [3] is one of the first subspace clustering algorithms. The algorithm combines density and grid based clustering. In CLIQUE, grid cells are defined by a fixed grid splitting each dimensions in ξ equal width cells. Dense units are those cells with more than τ data points. CLIQUE uses an apriori style search technique to find dense units. A cluster in CLIQUE is defined as a connection of dense units. The hyperrectangular clusters are then defined by a Disjunctive Normal Form (DNF) experssion. Clusters generated by CLIQUE may be found in overlapping subspaces, which means that each data point can belong to more than one clusters. In order to obtain an effective hard clustering where clusters are not overlapped with each other, we utilize agent-based modeling to find subspace clusters. Agent-based clustering approach is able to provide a natural description of a data clustering. Flexibility is another advantage of agent-based clustering. There are two main categoreis of agent-based clustering methods: multi-agent clustering and biologically inspired clustering. Ogston et al. presented a method of clustering within a fully decentralized multi-agent system [8]. In their system, each object is considered as an agent and agents try to form groups among themselves. Agents are connected in a random network and the agents search in a peer-to-peer fashion for other similar agents. Usually, the network is complex, which limits the use of this approach.
372
C. Luo et al.
Biologically inspired clustering algorithms have been studied by many researchers. Ant colonies, flocks of birds, swarms of bees, etc., are agent-based insect models that have been used in many applications [10]. In these methods, the bio-inspired agents can change their environment locally. They have the ability to accomplish the tasks that can not be achieved by a single agent. Although many agent-based clustering have been proposed, there is no work reported yet on subspace clustering in agent-based approach. In the next section, we show how agent-based methods can be applied to subspace clustering.
3 3.1
Agent-Based Subspace Clustering Problem Statement
Let S = {S1 , S2 , . . . , Sd } be a set of dimensions, and S = S1 × S2 . . . × Sd a d-dimensional numerical or categorical space. The input consists of a set of ddimensional points V = {v1 , v2 , .., vm }, where vi = (vi1 , vi2 , . . . , vid ). Vij , the jth component of vi , is drawn from dimension Sj . Table 1 is a simple example of data. The columns are dimensions S = {S1 , S2 , S3 , S4 , S5 }, and the rows are data sets V = {v1 , v2 , v3 , v4 , v5 }. In this example, numerical data set is used to show the discretization process. After discretization, clustering on numerical data is similar with the clustering on categorical data. Table 1. A simple example of data points
v1 v2 v3 v4 v5
S1 1 2 1 25 23
S2 14 12 23 4 2
S3 23 22 23 14 12
S4 12 13 11 13 1
S5 4 13 12 11 23
S6 21 4 2 2 2
Table 2. Data after discretization v1 v2 v3 v4 v5
S1 1 1 1 3 3
S2 2 2 2 1 1
S3 3 3 3 2 2
S4 2 2 2 2 1
S5 1 2 2 2 3
S6 3 1 1 1 1
The expected output is a hard clustering C = {C1 , C2 , . . . , Ck }. C is a partitioning of the input data set V . C1 , C2 , . . . , Ck are disjoint clusters, and |C | = m. Let Ci .dimensions stands for the dimensions of cluster Ci and i i Ci .dimensions ⊆ S, ∀Ci ∈ C. Now, the question is what is the best clustering C? Based on the goal of hard clustering, a larger cluster size |Ci | and a higher dimensionality |Ci .dimensions| are preferred. However, there is a conflict between them. For example, assume that one cluster C1 has |C1 | = 500 and |C1 .dimension| = 2 while another cluster C2 has |C2 | = 100 and C2 .dimension = 10. Which cluster is ”better” or preferred? In order to balance the two preferred choices, we define a measure to evaluate the quality of clustering with respect to both cluster size |Ci | and dimensionality |Ci .dimensions| as follows: M (C) = (|Ci |)2 × |Ci .dimensions| ∀Ci ∈ C (1) i
Agent-Based Subspace Clustering
373
The clusters C with optimized M (C) will have a large data size and a large dimensionality at the same time. 3.2
Agent-Based Subspace Clustering
In this section, we describe the design of our agent-based subspace clustering approach and explain how to implement the tasks of subspace clustering definded above. Firstly, we briefly present the model of agent-based subspace clustering. In the agent-based model, there are a set of agents and each agent represents a data point. The agents move from one local environment to another local environment. We named the local environment as bins. The movement of agents is instructed by some heuristic rules. There is a global environment to determinate when agents stop moving. In this way, an optimized clustering is obtained as a whole. To sum up, the complex subspace clustering is achieved by using the simple behaviors of agents under the guidance of heuristic rules [4,6]. The key components of agent-based subspace clustering are: agents, the local environment bins and the global environment. In order to explain the detail of the agent-based subspace clustering model, we take the data set in Table 1 as a simple example. – Let A = {a1 , a2 , . . . , am } represent agents. For example, agents A = {a1 , a2 , a3 , a4 , a5 } represent data in Table 1. – Let B = {B1 , B2 , . . . , Bn } be a set of bins. The bins B are the local environment of agents. Therefore each bin Bi (Bi ∈ B) contains a number of agents. We refer Bi .agents as the agents contained by Bi . For each agent aj in Bi .agents, we say agent aj belong to bin Bi . Bin Bi has a property Bi .dimensions, which denotes the subspace under which Bi .agents are similar to each other. In this model, we choose CLIQUE as the method to generate bins B. The first step of CLIQUE is to discrete the numeric values into intervals. Table 2 shows the agents after discretization with ξ = 3, which is the number of levels in each dimension. The intervals on different dimensions form units. CLIQUE firstly finds all dense units as the basic elements of clusters. Then the connected dense units are treated as final clusters. Figure 1 is an example of result of CLIQUE with τ = 0.8 and ξ = 3 on the data in Table 1. There are two clusters on subspaces S4 and S6 respectively. However, this clustering is unble to satisfy the definition of hard clustering. In our model, the groups generated by CLIQUE are treated as bins B as an input to our model to generate higher quality clusters. The global environment is an important component of an agent-based model. We define an objective function as a global environment based on Equation (1). In our model, local environment bins B are being optimized with the movement of agents A. When objective function M (B) in Equation (2) reaches its maximal value, agents stop moving. Bins B are fully optimized and are treated as final clusters C. M (B) = |Bi .dimensions| × (|Bi .objects|)2 ∀Bi ∈ B (2) i
374
C. Luo et al.
Fig. 1. An example of CLIQUE clustering
Some simple rules are defined to make sure that M (B) can be optimized by the movements of agents A. The movement of agents is a parallel decentralized process. Initially, each agent ai (ai ∈ A) randomly choose a bin Bj (Bj ∈ B ∧ ai ∈ Bj ) it belongs to as its local environment. In the next loop, agent ai randomly chooses another bin Bk (k = j ∧Bk ∈ B) as the destination of movement. The ΔM (Bj ) and ΔM (Bk ) measure the changes in Bj and Bk with respect to the global objective function M (B). move(ai ) in Equation (5) indicates the influence of the movement on M (B). If move(ai ) is evaluated as positive, the agent will move from its bin Bj to the destination Bk . Otherwise, agent ai stays in Bj . ΔM (Bj ) = ((|Bj .agents| − 1)2 × |Bj .dimensions|) − (|Bj .agents|2 × |Bj .dimensions|) (3) ΔM (Bk ) = ((|Bk .agents|+1)2 ×|Bk .dimensions|2 )−(|Bk .agents|2 ×|Bk .dimensions|) (4) move(ai ) = ΔM (Bk ) + ΔM (Bj )
(5)
When bins B are generated by CLIQUE, each agent ai may be contained in multiple bins, and they are called the preferred bins of ai , denoted as ai .bins. Therefore, ai ∈ Bj , ∀Bj ∈ ai .bins. In order to improve the efficiency of movement, we only allow agent ai move among its preferred bins ai .bins. When objective function M (B) reachs its maximal value, the movements stop and the final clustering C is obtained. Figure 2 is an example of the final clustering on the example data in Table 1. Two clusters are generated: One cluster contain points v1 , v2 , and v3 on dimensions S1 , S2 , S3 and S4 , and another cluster contain v4 and v5 on dimensions S1 , S2 , S3 and S6 . 3.3
Algorithm
The algorithm is composed of the following three steps.
Agent-Based Subspace Clustering
375
Fig. 2. Example of agent-based clustering
Step 1. Generate local bins B. In this step, we utilize CLIQUE algorithm[3] to generate bins B. The input is agents A and parameters ξ and τ . Parameter ξ is used to partition each dimension into intervals with equal length. τ is the parameter to select dense units. The output of this step is local bins B. Step 2. Agents A move among their preferred bins. Each agent ai ∈ A randomly chooses one of its preferred bin bk ∈ B as destination of movement. If the movement is positive, this movement is done. Otherwise, this movement is cancelled. This process repeats until objective function M (B) reaches its maximum. Step 3. If M (B) reaches the maximal value, then the clustering process stops. Each bin in B is treated as a final cluster. The output of this step is the final clusters C.
4 4.1
Experiments Data and Evaluation Criteria
In the experiments, we compare our ACS algorithm with existing subspace clustering algorithems which include CLIQUE, DOC, FIRES, P3C, Schism, Subclu, MineClus, and PROCLUS [1]. All these algorithms are implemented on a Weka subspace clustering plugin tool[7]. Table 3 shows the datasets used in our experiments, which are public data sets from UCI repositery. F1 measure and Entropy are chosen to evaluate the algorithms. – F1 measure considers recall and precision [7]. For each cluseter Ti in clustering T , there is a set of mapped found clusters mapped(Ti ). Let VTi be the objects of the cluster Ti and Vm(Ti ) the union of all objects from the clusters in mapped(Ti). Recall and precision are formalized by: recall(Ti ) =
|VTi ∩ Vm(Ti ) | |VTi |
precision(Ti ) =
|VTi ∩ Vm(Ti ) | |Vm(Ti ) |
(6) (7)
The harmonic mean of precison and recall is the F1 measure. A high F1 measure correspons to a good cluster quality.
376
C. Luo et al.
Algorithm 1. Agent-based Subspace Clustering Input: Agents A = {a1 , . . . , am }, Parameters ξ, τ Output: C = {C1 , . . . , Cm } // Step 1 1. B = CLIQU E(V, ξ, τ ) // Step 2 2. for all bj in B do 3. for all ai in bj .agents do 4. insert bj into ai .bins 5. end for 6. end for 7. for all bj in B do 8. bj .agents ← null 9. end for 10. M (B) = 0 11. repeat 12. for all ai in A do 13. Randomly choose destination bin bk 14. if ΔM (ai ) > 0 then 15. ai move to bk 16. else 17. ai stay at the bj 18. end if 19. end for 20. until M (B) is not increased for a certain number of continuous loops // Step 3 21. for all bi in B do 22. Ci ← Bi 23. end for
Table 3. Public Data Sets of UCI repository Data Name Attributes Num. Data Size Breasta 34 198 Diabetesb 8 768 Glass 10 214 Shape 18 160 Pendigits 16 7494 Liver 7 345 a b
Breast Cancer Wisconsin (Prognostic). Pima Indians Diabetes.
– Entropy is to measure the homogeneity of the found clusters with respect to the true clusters [7]. Let C be the found clusters and T is the true clusters. For each Cj ∈ C, the entropy of mCj is defined as: E(Cj ) = − p(Ti |C).log(p(Ti |Cj )) (8) i=1
Agent-Based Subspace Clustering
377
The overall quality of the clustering is obtained as the average over all clusters Cj ∈ C weighted by the number of objects per cluster. By normalizing with the maximal entropy log(m) for m hidden clusters and taking the inverse, the range is between 0 (low quality) and 1 (perfect): k
1−
4.2
|Cj |.E(Cj ) log(m) kj=1 |Cj | j=1
(9)
Experimental Results
For a fair evaluation, we show the best results of all algorithms in the massive experiments with various parameter settings for each algorithm. Figures 3-8 show the performance of the algorithms on data breast, diabetes, glass, pendigits, liver and shape. From the figures, we can see that ACS performs better than the other subspaces methods on both F1 measure and Entropy. For breast, glass and shape data, ACS has the best performance on F1 measure and Entropy. In particular, ACS has much better F1 measure than the others. For diabetes, pendigits and liver data, the performance of ACS ranks higher on F1 measure and entropy. In fact, ACS performs similarly to the first rank algorthms in each figure.
Fig. 3. Results on Breast Data
Fig. 4. Results on diabetes data
378
C. Luo et al.
Fig. 5. Results on glass data
Fig. 6. Results on pendigits data
Fig. 7. Results on liver data
Fig. 8. Results on shape data
Agent-Based Subspace Clustering
379
Fig. 9. Scalability with dimensionality
Fig. 10. Scalability with data size
Fig. 11. Running time with various parameters
Fig. 12. Results on stock data
Figures 9 and 10 show the time consumed with respect to the dimensionality and data size. It is obvious that the time consumed by ACS is similar with those of MineClus, CLIQUE and Schism, while STATPC, DOC, FIRES, P3C and PROCLUS consume much longer time than the first group. We can conclude that ACS are fast and scalable with the number of dimensions and data size.
380
C. Luo et al.
ACS has two parameters: ξ and τ . The figure 11 show the time change with the parameters. From these figures, we can see the run time decrease with the increase of ξ and τ . 4.3
A Case Study
Our technique has been applied to stock market surveillance. In stock market, the key surveillance function is identifying market anomalies, such as market manipulation, to provide a fair and efficient trading platform. Market manipulation refers to the trade action which aims to interfere with the demand or supply of a given stock to make the price increase or decrease in a particular way. A data set is composed based on the financial model proposed by Rajesh and Guo [2]. In this model, there are three period of time to describe the stock market manipulation: pre-manipulation, manipulation and post-manipulation. The stock price rises throughtout the manipulation period and then falls in the post-manipulation. We analyze stock market manipulation in HKEx (Hong Kong Stock Exchange). The market manipulation data are collected from SFC (Securityies and Futures Commission). The trade data are collected from Yahoo website. The attributes of the daily trade data include: daily price, daily volume, daily volatility, daily highest price, daily lowest price. We also collect the index of Hong Kong in the same period of time. The trade days when a manipulation occurs are treated as normal trade day while the trade days when a manipulation happended are treated as abnormal day. We test the performance of ACS and other subspace clustering algorithms on this dataset to see their performance in reality. From Figure 12, we can see ACS perform the best on both F1 measure and Entropy. The results show that ACS is a potential approach to cluster data sets in the real-world applications.
5
Conclusion and Future Work
This paper presents an agent-based subspace clustering algorithm, with which agent-based method is used to obtain an optimized subspace clustering by moving agents among local environments. The experimental results show that the proposed technique outperforms existing subspace clustering algorithms. The effectiveness of our technique is also validated by a case study in stock market surveillance. This research can be extented in two ways. One potential research is utilizing agent-based subspace clustering to identify outliers in high dimensional data. The other future work is to research on using agent-based method for semisupervised subspace clustering.
References 1. Aggarwal, C.C., Wolf, J.L., Yu, P.S., Procopiuc, C., Park, J.S.: Fast algorithms for projected clustering. In: SIGMOD 1999: Proceedings of the, ACM SIGMOD international conference on Management of data, pp. 61–72. ACM, New York (1999)
Agent-Based Subspace Clustering
381
2. Aggarwal, R.K., Wu, G.: Stock market manipulations. Journal of Business 79(4), 1915–1954 (2006) 3. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD Rec. 27(2), 94–105 (1998) 4. Cao, L.: In-depth behavior understanding and use: the behavior informatics approach. Information Science 180(17), 3067–3085 (2010) 5. Cheng, C.-H., Fu, A.W., Zhang, Y.: Entropy-based subspace clustering for mining numerical data. In: KDD 1999: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 84–93. ACM, New York (1999) 6. Cao, P.A.M.L., Gorodetsky, V.: Agent mining: The synergy of agents and data mining. IEEE Intelligent Systems 24(3), 64–72 (2009) 7. M¨ uller, E., G¨ unnemann, S., Assent, I., Seidl, T.: Evaluating clustering in subspace projections of high dimensional data. Proc. VLDB Endow. 2(1), 1270–1281 (2009) 8. Ogston, E., Overeinder, B., van Steen, M., Brazier, F.: A method for decentralized clustering in large multi-agent systems. In: AAMAS 2003: Proceedings of the second international joint conference on Autonomous agents and multiagent systems, pp. 789–796. ACM, New York (2003) 9. Procopiuc, C.M., Jones, M., Agarwal, P.K., Murali, T.M.: A monte carlo algorithm for fast projective clustering. In: SIGMOD 2002: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pp. 418–427. ACM, New York (2002) 10. Xu, X., Chen, L., He, P.: A novel ant clustering algorithm based on cellular automata. Web Intelli. and Agent Sys. 5(1), 1–14 (2007)
Evaluating Pattern Set Mining Strategies in a Constraint Programming Framework Tias Guns, Siegfried Nijssen, and Luc De Raedt Katholieke Universiteit Leuven Celestijnenlaan 200A, B-3001 Leuven, Belgium {Tias.Guns,Siegfried.Nijssen,Luc.DeRaedt}@cs.kuleuven.be
Abstract. The pattern mining community has shifted its attention from local pattern mining to pattern set mining. The task of pattern set mining is concerned with finding a set of patterns that satisfies a set of constraints and often also scores best w.r.t. an optimisation criteria. Furthermore, while in local pattern mining the constraints are imposed at the level of individual patterns, in pattern set mining they are also concerned with the overall set of patterns. A wide variety of different pattern set mining techniques is available in literature. The key contribution of this paper is that it studies, compares and evaluates such search strategies for pattern set mining. The investigation employs concept-learning as a benchmark for pattern set mining and employs a constraint programming framework in which key components of pattern set mining are formulated and implemented. The study leads to novel insights into the strong and weak points of different pattern set mining strategies.
1
Introduction
In the pattern mining literature, the attention has shifted from local to global pattern mining [1,10] or from individual patterns to pattern sets [5]. Local pattern mining is traditionally formulated as the problem of computing Th(L, ϕ, D) = {π ∈ L | ϕ(π, D) is true}, where D is a data set, L a language of patterns, and ϕ a constraint or predicate that has to be satisfied. Local pattern mining does not take into account the relationships between patterns; the constraints are evaluated locally, that is, on every pattern individually, and if the constraints are not restrictive enough, too many patterns are found. On the other hand, in global pattern mining or pattern set mining, one is interested in finding a small set of relevant and non-redundant patterns. Pattern set mining can be formulated as the problem of computing Th(L, ϕ, ψ, D) = {Π ⊆ Th(L, ϕ, D) | ψ(Π, D) is true}, where ψ expresses constraints that have to be satisfied by the overall pattern sets. In many cases a function f is used to evaluate pattern sets and one is then only interested in finding the best pattern set Π, i.e. arg maxΠ∈Th(L,ϕ,ψ,D) f (Π). Within the data mining and the machine learning literature numerous approaches exist that perform pattern set mining. These approaches employ a wide variety of search strategies. In data mining, the step-wise strategy is common, J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 382–394, 2011. c Springer-Verlag Berlin Heidelberg 2011
Evaluating Pattern Set Mining Strategies
383
in which first all frequent patterns are computed; they are heuristically postprocessed to find a single compressed pattern set; examples are KRIMP [16] and CBA [12]. In machine learning, the sequential covering strategy is popular, which repeatedly and heuristically searches for a good pattern or rule and immediately adds this pattern to the current pattern- (or rule-)set; examples are FOIL [14] and CN2 [3]. Only a small number of techniques, such as [5,7,9], search for pattern sets exhaustively, either in a step-wise or in a sequential covering setting. The key contribution of this paper is that we study, evaluate and compare these common search strategies for pattern set mining. As it is infeasible to perform a detailed comparison on all pattern set mining tasks that have been considered in the literature, we shall focus on one prototypical task for pattern set mining: boolean concept-learning. In this task, the aim is to most accurately describe a concept for which positive and negative examples are given.Within this paper we choose to fix the optimisation measure used to accuracy; our focus is on the exploration of a wide variety of search strategies for this measure, from greedy to complete and from step-wise to one-step approaches. To be able to obtain a fair and detailed comparison we choose to reformulate the different strategies within the common framework of constraint programming. This choice is motivated by [4,13], who have shown that constraint programming is a very flexible and usable approach for tackling a wide variety of local pattern mining tasks (such as closed frequent itemset mining and discriminative or correlated itemset mining), and recent work [9,7] that has lifted these techniques to finding k-pattern sets under constraints (sets containing exactly k patterns). In [7], a global optimization approach to mining pattern sets has been developed and has been shown to work for concept-learning, rule-learning, redescription mining, conceptual clustering as well as tiling. In the present work, we employ this constraint programming framework to compare different search strategies for pattern set mining, focusing on one mining task in more detail. This paper is organized as follows: in Section 2, we introduce the problem of pattern set mining and its benchmark, concept-learning; in Section 3, we formulate these problems in the framework of constraint programming and introduce various search strategies for pattern set mining; in Section 4, we report on experiments, and finally, in Section 5, we conclude.
2
Pattern Set Mining Task
The benchmark task on which we shall evaluate different pattern set mining strategies is that of finding boolean concepts in the form of k-term DNF expressions. This task is well-known in computational learning theory [8] and is closely related to rule-learning systems such as FOIL [14] and CN2 [3] and data mining systems such as CBA [12] and KRIMP [16]. It is – as we shall now show – a pattern set mining task of the form arg maxΠ∈Th(L,ϕ,ψ,D) f (Π). In this setting, one is given a set of positive and negative examples, where each example corresponds to a boolean variable assignment to the items in I, the set of possible items. Thus each example is an itemset Ix ⊆ I. Positive examples will belong to the set of transactions T + , negatives ones to T − . The
384
T. Guns, S. Nijssen, and L. De Raedt
pattern language is the set L = 2I . Hence each pattern corresponds to an itemset Ip ⊆ I and represents a conjunction of items. The task is then to learn a concept description (a boolean formula) that covers all (or most) of the positive examples and none (or only a few) of the negatives. This can be measured using the accuracy measure, defined as: accuracy(p, n) =
p + (N − n) P +N
(1)
where p and n are the number of positive, respectively negative, examples covered, and P and N are the total number of positive, resp. negative, examples present in the database. Concept descriptions are pattern sets, where each pattern set corresponds to a disjunction of patterns (conjunctions). Following [7,15], we shall focus on finding pattern sets that contain exactly k patterns. Thus the pattern sets correspond to k-term DNF formulas. An example is considered covered by the pattern set if the example is a superset of at least one of the itemsets in the pattern set. Thus the task considered is an instance of the pattern set mining task arg maxΠ∈Th(L,ϕ,ψ,D) f (Π), where f is the accuracy, D = T = T + ∪ T − , and L = 2I ; ϕ can be instantiated to a minimum support constraint (requiring that each pattern covers a certain number of examples), a minimum accuracy constraint (requiring that each pattern is individually accurate), or to true, a constraint which is always true and allows any pattern to be used which leads to an accurate final set. ψ states that |Π| = k. Finding a good pattern set is often a hard task; many pattern set mining tasks, such as the task of k-term DNF learning, are NP complete [8]. Hence, there are no straightforward algorithms for solving such tasks in general, giving rise to a wide variety of search algorithms. The pattern set mining techniques they employ can be categorized along two dimensions. Two-Step vs One-Step: in the two step approach, one first mines patterns under local constraints to compute the set Th(L, ϕ, D); afterwards, these patterns are fed into another algorithm that computes arg maxΠ∈Th(L,ϕ,ψ,D)f (Π) using post-processing. In the one step approach, this strict distinction between these two phases can not be made. Exact vs Approximate: exact methods provide strong guarantees for finding the optimal pattern set under the given constraints, while approximate methods employ heuristics to find good though not necessarily optimal solutions. In the next section we will consider the instantiations of these settings for the case of concept learning. However, first we will introduce the constraint programming framework within which we will study these instantiations.
3
Constraint Programming Framework
Throughout the remainder of this paper we shall employ the constraint programming framework of [4] for representing and solving pattern set mining problems.
Evaluating Pattern Set Mining Strategies
385
This framework has been shown 1) to allow for the use of a wide range of constraints, 2) to work for both frequent and discriminative pattern mining [13], and 3) to be extendible towards the formulation of k pattern set mining, cf. [7,9]. These other papers provide detailed descriptions of the underlying constraint programming algorithms and technology, including an analysis of the way in which they explore the search tree and a performance analysis. On the other hand, in the present paper – due to space restrictions – we need to focus on the declarative specification of the constraint programming problems; we refer to [4,13,7] for more details on the search strategy of such systems. 3.1
Constraint Programming Notation
Following [4], we assume that we are given a domain of items I and transactions T , and a binary matrix D. A key insight of the work of [4] is that constraint based mining tasks can be formulated as constraint satisfaction problems over the variables in π = (I, T ), where a pattern π is represented using the vectors I and T , with a boolean variable Ii and Tt for every item i ∈ I and every transaction t ∈ T . A candidate solution to the constraint satisfaction problem is then one assignment of the variables in π which corresponds to a single itemset. For instance, the pattern represented by π = (< 1, 0, 1 >, < 1, 1, 0, 0, 1 >) has items 1 and 3, and covers transactions 1, 2 and 5. Following [7], a pattern set Π of size k simply consists of k such patterns: Π = {π1 , . . . , πk }, ∀p = 1, . . . , k : πp = (I p , T p ). We now discuss the different two-step and one-step pattern set mining approaches. 3.2
Two-Step Pattern Set Mining
In two step pattern set mining approaches, one first searches for the set of local patterns Th(L, ϕ, D) that satisfy a set of constraints, and then post-processes these to find the pattern sets in Th(L, ϕ, ψ, D). Step 1: Local Pattern Mining. Using the above notation one can formulate many local pattern mining problems, such as frequent and discriminative pattern mining. Indeed, consider the following constraints, introduced in [4,13]: ∀t ∈ T : Tt = 1 ↔
Ii (1 − Dti ) = 0. (Coverage)
i∈I
∀i ∈ I : Ii = 1 ↔
Tt (1 − Dti ) = 0. (Closedness)
t∈T
∀i ∈ I : Ii = 1 → ∀i ∈ I : Ii = 1 → accuracy(
t∈T +
Tt Dti ≥ θ. (Min. frequency)
t∈T
Tt Dti ,
Tt Dti ) ≥ θ. (Min. accuracy)
t∈T −
In these constraints, the coverage constraint links the items to the transactions: it states that the transaction set T must be identical to the set of all transactions that are covered by the itemset I. The closedness constraint removes redundancy
386
T. Guns, S. Nijssen, and L. De Raedt
by ensuring that an itemset has no superset with the same frequency. It is a well-known property that every non-closed pattern has an equally frequent and accurate closed counterpart. The minimum frequency constraint ensures that itemset I covers at least θ transactions. It can more simply be formulated as t∈T Tt ≥ θ. The above formulation is equivalent, but posted for each item separately (observe that t∈T Tt Dti counts the number of t in column i of binary matrix D for which Tt = 1). This so-called reified formulation results in more effective propagation; cf. [4]. Finally, to mine for all accurate patterns instead of all frequent patterns, the minimum accuracy constraint can be used, which ensures that itemsets have an accuracy of at least θ. The reified formulation again results in more effective propagation [13]. To emulate the two step approaches that are common in data mining [12,16,1], we shall employ two alternatives for the first step: 1) using frequent closed patterns, which are found with the coverage, closedness and minimum frequency constraints; 2) using accurate closed patterns, found with the coverage, closedness and minimum accuracy constraints. Both of these approaches preform the first step in an exact manner. They find the set of all local patterns adhering to the constraints. Step 2: Post-processing the Local Patterns. Once the local patterns have been computed, the two step approach post-processes them in order to arrive at the pattern set. We describe the two main approaches for this. Post-processing by Sequential Covering (Approximate). The most simple approach to the second step is to perform greedy sequential covering, in which one iteratively selects the best local pattern from Th(L, ϕ, D) and removes all of the positive examples that it covers. This continues until the desired number of patterns k has been reached or all positive examples are already covered. This type of approach is most common in data mining systems. Whereas in the first step the set Th(L, ϕ, D) is computed exactly in these methods, the second step is often an iterative loop in which patterns are selected greedily from this set. Post-processing using Complete Search (Exact). Another possibility is to perform a new round of pattern mining as described in [5]. In this case, each previously found pattern in P = Th(L, ϕ, D) can be seen as an item r in a new database; each new item identifies a pattern. One is looking for the set of pattern identifiers P ⊆ P with the highest accuracy. In this case, the set is not a conjunction of items, but a disjunction of patterns, meaning that a transaction is covered if at least one of the patterns r ∈ P covers it. This can be formulated in constraint programming after a transformation of the data matrix D into a matrix M where the rows correspond to the transactions in T and the columns to the patterns in P. Moreover Mtr is 1 if and only if pattern r covers transaction t and 0 otherwise. The solution set is now represented using Π = (P, T ), where P is the vector representation of the pattern set, that is, Pr = 1 iff r ∈ P . The formulation of post-processing using complete search is now:
Evaluating Pattern Set Mining Strategies ∀t ∈ T : Tt = 1 ↔ ∀r ∈ P : Pr = 1 → accuracy(
Pr Mtr ≥ 1. (Disj . Coverage)
r∈P
Ltr ,
t∈T +
387
Pr = k
Ltr ) ≥ θ
(Min. Accuracy)
t∈T −
(Set Size)
r∈P
To obtain a reified formulation of the accuracy constraint we here use Ltr = max(Tt , Mtr ) = Mtr + (1 − Mtr )Tt . The column for pattern r in this matrix represents the transaction vector if the pattern r would be added to the set P . The first constraint is the disjunctive coverage constraint. The second constraint is the minimum accuracy constraint, posted on each pattern separately and taking the disjunctive coverage into account. Lastly, the set size constraint limits the pattern set to size k. This type of exact two-step approach is relatively new in data mining. Two notable works are [11,5]. In these publications, it was proposed to post-process a set of patterns by using a complete search over subsets of patterns. If an exact pattern mining algorithm is used to compute the initial set of pattern in the first step, this gives a method that is overall exact and offers strong guarantees on the quality of the solution found. 3.3
One-Step Pattern Set Mining
This type of strategy, which is common in machine learning, searches for the pattern set Th(L, ϕ, ψ, D) directly, that is, the computation of Th(L, ϕ, ψ, D) and Th(L, ϕ, D) is integrated or interleaved. This can remove the need to have strong constraints with strict thresholds in ϕ. There are two approaches to this: Iterative Sequential Covering (Approximate). In the iterative sequential covering approach that we investigate here, a beam search is employed (with beam width b) to heuristically find the best pattern set. At each step during the search a local pattern mining algorithm is used to find the top-b patterns (with the highest accuracy) and uses these to compute new candidate pattern sets on its beam, after which it prunes all but the best b pattern sets from its beam. This setting is similar to 2-step sequential covering, only that here, at each iteration, the most accurate pattern is mined for directly, instead of selecting it from a set of previously mined patterns. Mining for the most accurate pattern can be done in a constraint programming setting by doing branch-and-bound search over the accuracy threshold θ. In the experimental section, we shall consider different versions of the approach, corresponding to different sizes of the beam. When b = 1, one often talks about greedy sequential covering. Examples of one-step greedy sequential covering methods are FOIL and CN2; however, they use greedy algorithms to identify the local patterns instead of a branch-and-bound pattern miner. In data mining, the use of branch-and-bound pattern mining algorithms was recently studied for identifying top-b patterns; see for instance [2].
388
T. Guns, S. Nijssen, and L. De Raedt
Global Optimization (Exact). The last option is to specify the problem of finding a pattern set of size k as a global optimization problem. This is possible in a constraint programming framework, thanks to its generic handling of constraints, cf. [7]. The formulation, searching for k patterns πp = (I p , T p ) directly, is as follows: ∀p ∈ {1, . . . , k} : ∀t ∈ T : Ttp ↔
Iip (1 − Dti ) = 0, (Coverage)
(2)
Ttp (1 − Dti ) = 0, (Closed )
(3)
i∈I
∀p ∈ {1, . . . , k} : ∀i ∈ I : Iip ↔
t∈T
T 1 < T 2 < . . . < T k (Canonical ) ⎡ ⎤ p ∀t ∈ T : Bt = ⎣ ( Tt ) ≥ 1⎦ , (Disj .coverage) p∈{1..k}
maximize accuracy(
t∈T +
Bt ,
Bt ). (Accurate)
(4) (5) (6)
t∈T −
Each pattern has to cover the transactions (Eq. 2) and be closed (Eq. 3). The canonical form constraint in Eq. 4 enforces a fixed lexicographic ordering on the itemsets, thereby avoiding to find equivalent but differently ordered pattern sets. In Eq. 5, the variables Bt are auxiliary variables representing whether transaction t is covered by at least one pattern, corresponding to a disjunctive coverage. The one-step global optimization approaches to pattern set mining are less common; the authors are only aware of [7,9]. One could argue that some iterative pattern mining strategies will find pattern sets that are optimal under certain conditions. For instance, Tree2 [2] can find a pattern set with minimal error on supervised training data; however, it neither provides guarantees on the size of the final pattern set nor provides guarantees under additional constraints.
4
Experiments
We now compare the different approaches to boolean concept learning that we presented and answer the following two questions: – Q1: Under what conditions do the different strategies perform well? – Q2: What quality/runtime trade-offs do the strategies make? To measure the quality of a pattern set, we evaluate its accuracy on the dataset. This is an appropriate means of evaluation, as in the boolean concept learning task we consider, the goal is to find a concise description of the training data, rather than a hypothesis that generalizes to an underlying distribution. The experiments were performed using the Gecode-based system proposed by [4] and performed on PCs running Ubuntu 8.04 with Intel(R) Core(TM)2 Quad CPU Q9550 processors and 4GB of RAM. The datasets were taken from the website accompanying this system1 . The datasets were derived from the UCI 1
http://dtai.cs.kuleuven.be/CP4IM/datasets/
Evaluating Pattern Set Mining Strategies
389
Table 1. Data properties and number of patterns found for different constraints and thresholds. 25M+ denotes that more than 25 million patterns were found. Transactions Items Class distr. Total patterns Pattern poor/rich frequency ≥ 0.7 frequency ≥ 0.5 frequency ≥ 0.3 frequency ≥ 0.1 accuracy ≥ 0.7 accuracy ≥ 0.6 accuracy ≥ 0.5 accuracy ≥ 0.4
Mushroom 8124 119 52% 221524 poor 12 44 293 3287 197 757 11673 221036
Vote Hepatitis German-credit Austr.-credit Kr-vs-kp 435 137 1000 653 3196 48 68 112 125 73 61% 81% 70% 55% 52% 227032 3788342 25M+ 25M+ 25M+ poor poor rich rich rich 1 137 132 274 23992 13 3351 2031 8237 369415 627 93397 34883 257960 25M+ 35771 1827264 2080153 24208803 25M+ 193 361 2 11009 52573 1509 3459 262 492337 2261427 9848 31581 6894 25M+ 25M+ 105579 221714 228975 25M+ 25M+
Machine Learning repository [6] by discretising numeric attributes into eight equal-frequency bins. To obtain reasonably balanced class sizes we used the majority class as the positive class . Experiments were run on many datasets, but we here present the findings on 6 diverse datasets whose basic properties are listed in the top 3 rows of Table 1. 4.1
Two-Step Pattern Set Mining
The result of a two-step approach obviously depends on the quality of the patterns found in the first step. We start by investigating the feasibility of this first step, and then study the two-step methods as a whole. Step 1: Local Pattern Mining. As indicated in Section 3.2, we employ two alternatives: using frequent closed patterns and using accurate closed patterns. Both methods rely on a threshold to influence the number of patterns found. Table 1 lists the number of patterns found on a number of datasets, for the two alternatives and with different thresholds. Out of practical considerations we stopped the mining process when more than 25 million patterns were found. Using this cut-off, we can distinguish pattern poor data (data having less than 25 million patterns when mining unconstrained) and pattern rich data. In the case of pattern poor data, one can mine using very low or even no thresholds. In the case of pattern rich data, however, one has to use a more stringent threshold in order not be overwhelmed by patterns. Unfortunately, one has to mine with different thresholds to discover how pattern poor or rich an unseen dataset is. Step 2: Post-processing the Local Patterns. We now investigate how the quality of the global pattern sets is influenced by the threshold used in the first step, and how this compares to pattern sets found by 1-step methods that do not have such thresholds.
390
T. Guns, S. Nijssen, and L. De Raedt
Fig. 1. Quality & runtime for approx. methods, pattern poor hepatitis dataset. In the left figure, algorithms with identical outcome are grouped together.
Fig. 2. Quality & runtime for approx. methods, pattern rich australian-credit dataset.
Post-processing by Sequential Covering (Approximate). This two-step approach picks the best local pattern from the set of patterns computed in step one. As such, the quality of the pattern set depends on whether the right patterns are in the pre-computed pattern set. We use our generic framework to compare two-step sequential covering to the one-step approach. For pattern poor data for which the set of all patterns can be calculated, such as the mushroom, vote and hepatitis dataset, using all patterns obviously results in the same pattern set as found by the one-step approach. Figure 1 shows the prototypical result for such data: low thresholds lead to good pattern sets, while higher thresholds gradually worsen the solution. For this dataset, starting from K=3, no better pattern set can be found. The same is true for the mushroom dataset, while in the vote dataset the sequential covering method continues to improve for higher K. Also note that in Figure 1 a better solution is found when using patterns with accuracy greater than 40%, compared to patterns with accuracy greater than 50%. This implies that a better pattern set can be found containing a local pattern that has a low accuracy on the whole data. This indicates that using accurate local patterns does not permit putting high thresholds in the first step. With respect to question Q2, we can observe that using a lower threshold comes at the cost of higher runtimes. However, for pattern poor datasets such as the one in Figure 1, these times are still manageable. The remarkable efficiency of the one-step sequential covering method is thanks to recent advances in mining top-k discriminative patterns [13].
Evaluating Pattern Set Mining Strategies
391
Table 2. Largest K (up to 6) and time to find it for the 2-step complete search method. - indicates that step 1 was aborted because more than 25 million patterns were found, – indicates that step 2 did not manage to finish within the timeout of 6 hours. * indicates that no other method found a better pattern set.
all freq. ≥ 0.7 freq. ≥ 0.5 freq. ≥ 0.3 freq. ≥ 0.1 acc. ≥ 0.7 acc. ≥ 0.6 acc. ≥ 0.5 acc. ≥ 0.4
Mushroom Vote Hepatitis German-cr. Austr.-cr. K sec K sec K sec K sec K sec – – – 6 0.2 only 1 pat 6 0.03 6 2.12 6 0.59 6 2.2 6 0.01 2 2650 2 8163 6 14244 6 14 6 0.89 – – – 2 9477 1 1015 – – – 6 8.6 6 0.12 6 3.05 6 0.01 1 713 *4 6714 5 14205 2 6696 6 104 – – 1 391 1 3169 1 696 – – – – -
Kr-vs-kp K – – – – -
On pattern rich data such as the german-credit, australian-credit and kr-vskp dataset, similar behaviour can be observed. The only difference is that one is forced to use more stringent thresholds. Because of this, the pattern set found by the one-step approach can usually not be found by the two-step approaches. Figure 2 exemplifies this for the australian-credit dataset. Using a frequency threshold of 0.1, the same pattern set as for the one-step method is found for up to K=3, but not so for higher K. When using the highest thresholds, there is a risk of finding significantly worse pattern sets. On the kr-vs-kp dataset, when using high frequency thresholds significantly worse results were found as well, while this was not the case for the accuracy threshold. With respect to Q2 we have again observed that lower thresholds lead to higher runtimes for the twostep approaches. Lowering the thresholds further to find even better pattern sets would correspondingly come at the cost of even higher computation times. Post-processing using Complete Search (Exact). When post-processing a collection of patterns using complete search, the size of that collection becomes a determining factor for the success of the method. Table 2 shows the same datasets and threshold values as in Table 1; here the entries show the largest K for which a pattern set could be found, up to K=6, and the time it took. A general trend is that in case many patterns are found in step 1, e.g. more than 100 000, the method is not able to find the optimal solution. With respect to Q1, only for the mushroom dataset the method found a better pattern set than any other method, when using all accurate patterns with threshold 0.4. For all other sets it found however, one of the 1-step methods found a better solution. Hence, although this method is exact in its second step, it depends on good patterns from its first step. Unfortunately finding those usually requires using low threshold values with corresponding disadvantages.
392
T. Guns, S. Nijssen, and L. De Raedt
Table 3. Largest K for which the optimal solution was found within 6 hours Mushroom Vote Hepatitis German-credit Australian-credit Kr-vs-kp K=2 K=4 K=3 K=2 K=2 K=3
Fig. 3. Quality & runtime for 1-step methods, german-credit dataset. In the left figure, algorithms with identical outcome are grouped together.
4.2
One-Step Pattern Set Mining
In this section we compare the different one-step approaches, who need no local pattern constraints and thresholds. We investigate how feasible the one-step exact approach is, as well as how close the greedy sequential covering method brings us to this optimal solution, and whether beam search can close the gap between the two. When comparing the two-step sequential covering approach with the one-step approach, we already remarked that the latter is very efficient, though it might not find the optimal solution. The one-step exact method is guaranteed to find the optimal solution, but has a much higher computational cost. Table 3 below shows up to which K the exact method was able to find the optimal solution within the 6 hours time out. Comparing these results to the two-step exact approach in Table 2, we see that pattern sets can be found without constraints, where the two-step approach failed even with constraints. With respect to Q1 we observed that only for the kr-vs-kp dataset the greedy method, and hence all beam searches with a larger beam, found the same pattern sets as the exact method. For the mushroom and vote dataset, starting from beam width 5, the optimal pattern set was found. For the german-credit and australian-credit, a beam width of size 15 was necessary. The hepatitis dataset was the only dataset for which the complete method was able to find a better pattern set, in this case for K=3, within the timeout of 6 hours. Figure 3 shows a representative figure, in this case for the german-credit dataset: while the greedy method is not capable of finding the optimal pattern set, larger beams successfully find the optimum. For K=6, beam sizes of 15 or 20 lead to a better pattern set than when using a lower beam size. The exact method stands out as being the most time consuming. For beam search methods, larger beams clearly lead to larger runtimes. The runtime only increases slightly
Evaluating Pattern Set Mining Strategies
393
for increasing sizes of K because the beam search is used in a sequential covering loop that shrinks the dataset at each iteration.
5
Conclusions
We compared several methods for finding pattern sets within a common constraint programming framework, where we focused on boolean concept learning as a benchmark. We distinguished one step from two step approaches, as well as exact from approximate ones. Each method has its strong and weak points, but the one step approximate approaches, which iteratively mine for patterns, provided the best trade-off between runtime and accuracy and do not depend on a threshold; additionally, they can easily be improved using a beam search. The exact approaches, perhaps unsurprisingly, do not scale well to larger and pattern-rich datasets. A newly introduced approach for one-step exact pattern set mining however has optimality guarantees and performs better than previously used two-step exact approaches. In future work our study can be extended to consider other problem settings in pattern set mining, as well as other heuristics and evaluation metrics; furthermore, even though we cast all settings in one implementation framework in this paper, a more elaborate study could clarify how this approach compares to the pattern set mining systems in the literature. Acknowledgements. This work was supported by a Postdoc and project “Principles of Patternset Mining” from the Research Foundation—Flanders, as well as a grant from the Agency for Innovation by Science and Technology in Flanders (IWT-Vlaanderen).
References 1. Bringmann, B., Nijssen, S., Tatti, N., Vreeken, J., Zimmermann, A.: Mining sets of patterns. In: Tutorial at ECMLPKDD 2010 (2010) 2. Bringmann, B., Zimmermann, A.: Tree2 - decision trees for tree structured data. In: Jorge, A., Torgo, L., Brazdil, P., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS (LNAI), vol. 3721, pp. 46–58. Springer, Heidelberg (2005) 3. Clark, P., Niblett, T.: The CN2 induction algorithm. Machine Learning 3, 261–283 (1989) 4. De Raedt, L., Guns, T., Nijssen, S.: Constraint programming for itemset mining. In: KDD, pp. 204–212. ACM, New York (2008) 5. De Raedt, L., Zimmermann, A.: Constraint-based pattern set mining. In: SDM. SIAM, Philadelphia (2007) 6. Frank, A., Asuncion, A.: UCI machine learning repository (2010), http://archive.ics.uci.edu/ml 7. Guns, T., Nijssen, S., De Raedt, L.: k-Pattern set mining under constraints. CW Reports CW596, Department of Computer Science, K.U.Leuven (October 2010), https://lirias.kuleuven.be/handle/123456789/278655 8. Kearns, M.J., Vazirani, U.V.: An introduction to computational learning theory. MIT Press, Cambridge (1994)
394
T. Guns, S. Nijssen, and L. De Raedt
9. Khiari, M., Boizumault, P., Cr´emilleux, B.: Constraint programming for mining nary patterns. In: Cohen, D. (ed.) CP 2010. LNCS, vol. 6308, pp. 552–567. Springer, Heidelberg (2010) 10. Knobbe, A., Cr´emilleux, B., F¨ urnkranz, J., Scholz, M.: From local patterns to global models: The lego approach to data mining. In: F¨ urnkranz, J., Knobbe, A. (eds.) Proceedings of LeGo 2008, an ECMLPKDD 2008 Workshop (2008) 11. Knobbe, A.J., Ho, E.K.Y.: Pattern teams. In: F¨ urnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PKDD 2006. LNCS (LNAI), vol. 4213, pp. 577–584. Springer, Heidelberg (2006) 12. Liu, B., Hsu, W., Ma, Y.: Integrating classification and association rule mining. In: KDD, pp. 80–86 (1998) 13. Nijssen, S., Guns, T., De Raedt, L.: Correlated itemset mining in ROC space: a constraint programming approach. In: KDD, pp. 647–656. ACM, New York (2009) 14. Quinlan, J.R.: Learning logical definitions from relations. Machine Learning 5, 239–266 (1990) 15. R¨ uckert, U., De Raedt, L.: An experimental evaluation of simplicity in rule learning. Artif. Intell. 172(1), 19–28 (2008) 16. Siebes, A., Vreeken, J., van Leeuwen, M.: Item sets that compress. In: Ghosh, J., Lambert, D., Skillicorn, D.B., Srivastava, J. (eds.) SDM, pp. 395–406. SIAM, Philadelphia (2006)
Asking Generalized Queries with Minimum Cost Jun Du and Charles X. Ling Department of Computer Science, The University of Western Ontario, London, Ontario, N6A 5B7, Canada
[email protected],
[email protected]
Abstract. Previous works of active learning usually only ask specific queries. A more natural way is to ask generalized queries with don’tcare features. As each of such generalized queries can often represent a set of specific ones, the answers are usually more helpful in speeding up the learning process. However, despite of such advantages of the generalized queries, more expertise (or effort) is usually required for the oracle to provide accurate answers in real-world situations. Therefore, in this paper, we make a more realistic assumption that, the more general a query is, the higher querying cost it causes. This consequently yields a trade-off that, asking generalized queries can speed up the leaning, but usually with high cost; whereas, asking specific queries is much cheaper (with low cost), but the learning process might be slowed down. To resolve this issue, we propose two novel active learning algorithms for two scenarios: one to balance the predictive accuracy and the querying cost; and the other to minimize the total cost of misclassification and querying. We demonstrate that our new methods can significantly outperform the existing active learning algorithms in both of these two scenarios.
1
Introduction
Active learning, as an effective learning paradigm to reduce the labeling cost in supervised settings, has been intensively studied in recent years. In most traditional active learning studies, the learner usually regards the specific examples directly as queries, and requests the corresponding labels from the oracle. For instance, given a diabetes patient dataset, the learner usually presents the entire patient example, such as [ID = 7354288, name = John, age = 65, gender = male, weight = 230, blood−type = AB, blood−pressure = 160/90, temperature = 98, · · · ] (with all the features), to the oracle, and requests the corresponding label whether this patient has diabetes or not. However, in this case, many features (such as ID, name, blood-type, and so on) might be irrelevant to diabetes diagnosis. Not only could queries like this confuse the oracle, but each answer responded from the oracle is also applicable for only one specific example. In many real-world active learning applications, the oracles are often human experts, thus they are usually capable of answering more general queries. For instance, given the same diabetes patient dataset, the learner could ask a generalized query, such as “are men over age 60, weighted between 220 and 240 pounds, likely to have diabetes?”, where only three relevant features (gender, J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 395–406, 2011. c Springer-Verlag Berlin Heidelberg 2011
396
J. Du and C.X. Ling
age and weight) are provided. Such generalized query can often represent a set of specific examples, thus the answer for the query is also applicable to all these examples. For instance, the answer to the above generalized query is applicable for all men over age 60 and weighted between 220 and 240 pounds. This allows the active learner to improve learning more effectively and efficiently. However, although the oracles are indeed capable of answering such generalized queries in many applications, the cost (effort) is often higher. For instance, it is relatively easy (i.e., with low cost) to diagnose whether one specific patient has diabetes or not, with all necessary information provided. However, it is often more difficult (i.e., with higher cost) to provide accurate diabetes diagnoses (accurate probability) for all men over age 60 and weighted between 220 and 240 pounds. In real-world situation, more domain expertise is usually required for the oracles to answer such generalized queries well, thus the cost for asking generalized queries is often more expensive. Consequently, it yields a trade-off in active learning: on one hand, asking generalized queries can speed up the leaning, but usually with high cost; on the other hand, asking specific queries is much cheaper (with low cost), but the learning process might be slowed down. In this paper, we apply a cost-sensitive framework to study generalized queries in active learning. More specifically, we assume that the querying cost is known to be non-uniform, and ask generalized queries in the following two scenarios: – Scenario 1 (Balancing Acc./Cost Trade-off ): We consider only querying cost in this scenario. Thus, instead of tending to achieve high predictive accuracy by asking as few as possible queries (as in traditional active learning), the learning algorithm is required to achieve high predictive accuracy by paying as low as possible querying cost. – Scenario 2 (Minimizing Total Cost): In addition to querying cost, we also consider misclassification cost produced by the learning model in this scenario.1 Thus, the learning algorithm is required to achieve minimum total cost of querying and misclassification in the learning process. In particular, we propose a novel method to, first construct generalized queries according to two objective functions in the above two scenarios, and then update the training data and the learning model accordingly. Empirical study in a variety of settings shows that, the proposed methods can indeed outperform the existing active learning algorithms in simultaneously maximizing the predictive performance and minimizing the querying cost.
2
Related Work
All of the active learning studies make assumptions. Specifically, most of the previous works assume that the oracles can only answer specific queries, and the costs for asking these queries are uniform. Thus, most active learning algorithms 1
In this paper, we only consider that both the querying cost and the misclassification cost are on the same scale. Extra normalization might be required otherwise.
Asking Generalized Queries with Minimum Cost
397
Table 1. Assumptions in active learning studies Uniform Cost Non-uniform Cost
Specific Queries [7,11,12,2,3,9] [8,6,10]
Generalized Queries [4] This Paper
(such as [7,11,12,2,3,9]) are designed to achieve as high as possible predictive accuracy by asking a certain amount of queries. [4] relaxes the assumption of asking specific queries, and proposes active leaning with generalized queries. However, it assumes that the oracles can answer these generalized queries as easily as the specific ones. That is, the costs for asking all the queries are still the same, regardless of the queries being specific or generalized. [8,6,10] relax the assumption of uniform cost, and study active learning in cost-sensitive framework. However, they limit their research in specific queries, and only consider that the costs for asking those specific ones are different. In this paper, we study generalized queries with cost in active learning. Specifically, we assume that the oracles can answer both specific and generalized queries, but with different cost. This assumption is more flexible, more general, and more applicable to the real-world applications. Under this assumption, considering uniform cost for generalized queries (such as [4]) and considering non-uniform costs for specific queries (such as [8,6,10]) can both be regarded as special cases. Table 1 illustrates the different assumptions in active learning studies. As far as we know, this is the first time to propose this more general assumption and design corresponding learning algorithms for active learning.
3
Algorithm for Asking Generalized Queries
In this section, we design active learning algorithm to ask generalized queries. Roughly speaking, the active learning process can be broken into the following two steps in each learning iteration: – Step 1: Based on the current training and unlabeled datasets, the learner constructs a generalized query according to certain objective function. – Step 2: After obtaining the answer of the generalized query, the learner updates the training dataset, and updates the learning model accordingly. We will discuss each step in detail in the following subsections. 3.1
Constructing Generalized Queries
In each learning iteration, constructing the generalized queries can be regarded as searching the optimal query in the query space, according to the given objective function. We propose two objective functions for the previous two scenarios, and design an efficient searching strategy to reduce the computation complexity.
398
J. Du and C.X. Ling
Balancing Acc./Cost Trade-off. In scenario 1, we only consider querying cost, and still use accuracy to measure the predictive performance of the learning model, thus the learning algorithm is required to balance the trade-off between the predictive accuracy and the querying cost. We therefore design an objective function to choose query that yields maximum ratio of accuracy improvement to querying cost in each iteration. More formally, Equation 1 shows the objective function for searching query in iteration t, where q t denotes the optimal query, Qt denotes the entire query space, CQ (q) denotes the querying cost for the current candidate query q, ΔAcct (q) denotes the accuracy improvement produced by q, which can also be presented by subtracting the accuracy in iteration t − 1 (denoted by Acct−1 ) from the accuracy in iteration t (denoted by Acct (q)).2 q t = arg maxt q∈Q
ΔAcct (q) Acct (q) − Acct−1 = arg maxt q∈Q CQ (q) CQ (q)
(1)
We can see from Equation 1 that, estimating ΔAcct (q)/CQ (q) is required to evaluate the candidate query q. As we assume that the querying cost CQ (q) is known, we only need separately estimate the accuracies before and after asking q (i.e., Acct−1 and Acct (q)). Estimating Acct−1 is rather easy. We simply apply cross-validation or leaveone-out to the current training data, and obtain the desired average accuracy. However, estimating Acct (q) is a bit difficult. Note that, if we know the answer of q, the training data could be updated by using exactly the same strategy we will describe in Section 3.2 (Updataing Learning Model), and Acct (q) thus could be easily estimated on the updated training data. However, the answer of q is still unknown in the current stage, thus here, we apply a simple strategy to optimistically estimate this answer, and then evaluate q accordingly. Specifically, we first assume that the label of q is certainly 1.3 Thus, we update the training data (using the same method as in Section 3.2), and estimate Acct (q) accordingly. Then, we assume that the label of q is certainly 0, and again update the training data and estimate Acct (q) in the same way. We compare these two estimates of Acct (q), and optimistically choose the better (higher) one as the final estimate. Minimizing Total Cost. In Scenario 2, we consider both the querying and misclassification costs, and require the learning algorithm to achieve minimum total cost in the learning process. However, calculating this total cost of querying and misclassification is a bit tricky. In real-world applications, the learning model constructed on the current training data is often used for the future prediction, thus the “true” misclassification cost should also be calculated according to the future predicted examples. 2 3
The accuracy improvement (ΔAcct (q)) can be negative, when the accuracy after asking the query (Acct (q)) is even lower than the one before asking (Acct−1 ). We only consider binary classification with labels 0 and 1 here, for better illustration.
Asking Generalized Queries with Minimum Cost
399
We assume that the rough size of such “to-be-predicted” data is known in this paper, due to the following reason. In reality, the quantity of such “to-be-predicted” data directly affects the quantity of resource (effort, cost) that should be spent in constructing the learning model. For instance, if the model would be used for only few times and on only limited unimportant data, it might not be worth to spend much resource on model construction; on the other hand, if the model is expected to be extensively used on a large amount of important data, it would be even more beneficial to improve the model performance by spending more resource. In many such real-world situations, in order to determine how much resource should be spent in constructing the model, it is indeed known (or could be estimated) that how extensively the model would be used in the future (i.e., the rough quantity of the to-be-predicted data). It is exactly the same case in our current scenario of generalized queries. More specifically, if the current learning model will only “play a small role” (i.e., make predictions on only few examples) in the future, it may not worth paying high querying cost to construct a high-performance model. On the other hand, if a large number of examples need to be predicted, it would be indeed worthwhile to acquire more generalized queries (at the expense of high querying cost), such that an accurate model with low misclassification cost could be constructed. This indicates that, the number of “to-be-predicted” examples is crucial in minimizing total cost. Therefore, we formalized the total cost after t iterations i (denoted by CTt ) in Equation 2, where CQ denotes the querying cost in the ith t iteration, CM denotes the misclassification cost after t iterations, which further can be calculated as the product of the average misclassification cost4 after t t iterations (denoted by AvgCM ) and the number of future predicted examples (denoted by n). CTt
=
t i=1
i CQ
+
t CM
=
t
i t CQ + AvgCM ×n
(2)
i=1
To obtain the minimum total cost for the learning model, we greedily choose the query that maximumly reduces the total cost in each learning iteration. More formally, Equation 3 shows the objective function for searching query in iteration t, where all notations keep same as above. q t = arg maxt (CTt−1 − CTt (q)) q∈Q
t−1 t = arg maxt ((AvgCM − AvgCM (q)) × n − CQ (q)) q∈Q
(3)
t In the current setting, we assume that CQ and n are both known, thus we need t−1 t estimate AvgCM and AvgCM (q) separately, according to Equation 3. We again t−1 adopt the similar strategy as in the previous subsection. Specifically, AvgCM 4
Average misclassification cost represents the misclassification cost averaged on each tested examples.
400
J. Du and C.X. Ling
could be directly estimated by cross-validation or leave-one-out on the original t training set, and AvgCM (q) can be optimistically estimated by assuming the label of q is certainly 0 and 1 respectively (see Section 3.1 for details). Searching Strategy. Given the above two objective functions for two scenarios, the learner is required to search the query space and find the optimal one in each iteration. In most traditional active learning studies, each unlabeled example is directly regarded as a candidate query. Thus, in each iteration, the query space simply contains all the current unlabeled examples, and exhaustive search is usually applied directly. However, when asking generalized queries, each unlabeled example can generate a set of candidate generalized queries, due to the existence of the don’t-care features. For instance, given a specific example with d features, there exist d1 generalized queries with one don’t-care feature, d2 generalized queries with two don’t-care features, and so on. Thus, altogether 2d corresponding generalized queries could be constructed from each specific example. Therefore, given an unlabeled dataset with l examples, the entire query space would be 2d l. This query space is thus quite large (grows exponentially to the feature dimension), and it is unrealistic to exhaustively evaluate every candidate. Instead, we apply greedy search to find the optimal query in each iteration. Specifically, for each unlabeled example (with d features), we first construct all the generalized queries with only one don’t-care feature (i.e., d1 = d queries), and choose the best as the current candidate. Then, based only on this candidate, we continue to construct all the generalized queries with two don’t-care features (i.e., d−1 = d− 1 queries), and again only keep the best. The process repeats to 1 greedily increase the number of don’t-care features in the query, until no better query can be generated. The last generalized query thus is regarded as the best for the current unlabeled example. We conduct the same procedure on all the unlabeled examples, thus we can find the optimal generalized query based on the whole unlabeled set. With such greedy search strategy, the computation complexity of searching is thus O(d2 ) with respect to the feature dimension d. This indicates an exponential improvement over the complexity of the original exhaustive search Θ(2d ). Note that, it is true that such local greedy search cannot guarantee finding the true optimal generalized query in the entire query space, but the empirical study (see Section 4) will show it still works effectively in most cases. 3.2
Updating Learning Model
After finding the optimal query in each iteration, the learner will request the corresponding label from the oracle, and update the learning model accordingly. However, the generalized queries often contain don’t-care features, and the labels for such generalized queries are also likely to be uncertain. In this subsection, we study how to update the learning model by appropriately handling such don’t-care features and uncertain answers in the queries.
Asking Generalized Queries with Minimum Cost
401
Roughly speaking, we consider the don’t-care features as missing values, and handle the uncertain labels by taking partial examples in the learning process. More specifically, we simply treat the generalized queries with don’t-care features as specific ones with missing values. As many learning algorithms (such as decision tree based algorithms, most generative models, and so on) have their own mechanisms to naturally handle missing values, this simple strategy can be widely applied. In terms of the uncertain labels of the queries, we handle them by taking partial examples in the learning process. For instance, given a query with an uncertain label (such as, 90% probability as 1 and 10% probability as 0), the learning algorithm simply takes 0.9 part of the example as certainly 1 and 0.1 part as certainly 0. Taking partial examples into learning is often implemented by re-weighting examples, which is also applicable to many popular learning algorithms. This simple strategy can elegantly update the learning model. However, a pitfall of the strategy also occurs. When updating the learning model, the current strategy always regards one generalized query as only one specific example (with missing values). This might significantly degrade the power of the generalized queries. On the other hand, if one generalized query is regarded as too many specific examples, it might also overwhelm the original training data. Therefore, here we regard each generalized query as n (same) examples (with missing values), where n is suggested to be half of the initial training set size by the empirical study. So far, we have proposed a novel method to construct the generalized query and update the learning model in each active learning iteration. In particular, we have designed two objective functions to balance the accuracy/cost trade-off and minimize the total cost of misclassification and querying. In the following section, we will conduct experiments on real-world datasets, to empirically study the performance of the proposed algorithms.
4
Empirical Study
In this section, we empirically study the performance of the proposed algorithms on 15 real-world datasets from the UCI Machine Learning Repository [1], and compare them with the existing active learning algorithms. 4.1
Experimental Configurations
We compare the proposed algorithms with the traditional pool-based active learning (with uncertain sampling) [7] (denoted by “Pool”) and the active learning with generalized queries [4] (denoted by “AGQ”). “Pool” and “AGQ” represent two special cases for querying cost: “Pool” only asks specific queries (with low querying cost), but cannot take advantage of the generalized queries to improve the predictive performance; on the other hand, “AGQ” tends to ask as general as possible queries to promptly improve the predictive performance, but with the expense of high querying cost. We expect that the proposed algorithms (for the two scenarios) can simultaneously maximize the predictive performance and minimize the querying cost, thus outperforming “Pool” and “AGQ”.
402
J. Du and C.X. Ling
Table 2. The 15 UCI datasets used in the experiments Dataset breast-cancer breast-w colic credit-a credit-g diabetes heart-statlog hepatitis ionosphere kr-vs-kp mushroom sick sonar tic-tac-toe vote
Type of Att. nom num nom/num nom/num nom/num num num nom/num num nom nom nom num nom nom
No. of Att. 9 9 22 15 20 8 13 19 33 36 22 27 60 9 16
Class Dist. 196/81 458/241 232/136 307/383 700/300 500/268 150/120 32/123 126/225 1669/1527 4208/3916 3541/231 97/111 332/626 267/168
Training Size 1/5 1/10 1/5 1/20 1/100 1/10 1/10 1/5 1/20 1/100 1/200 1/200 1/5 1/10 1/20
All of the 15 UCI datasets have binary class and no missing values. Information on these datasets is tabulated in Table 2. Each whole dataset is first split randomly into three disjoint subsets: the training set, the unlabeled set, and the test set. The test set is always 25% of the whole dataset. To make sure that active learning can possibly show improvement when the unlabeled data are labeled and included into the training set, we choose a small training set for each dataset such that the “maximum reduction” of the error rate5 is large enough (greater than 10%). The training sizes of the 15 UCI datasets range from 1/200 to 1/5 of the whole datasets, also listed in Table 2. The unlabeled set is the whole dataset taking away the test set and the training set. In our experiments, we set the querying cost (CQ ) for any specific query as 1, and study the following three cost settings for generalized queries with r don’t-care features, as follows: – CQ = 1 + 0.5 × r: This setting represents a linear growth of CQ with respect to r. For instance, the cost of asking a generalized query with two don’t-care features is (CQ = 1 + 0.5 × 2 = 2), which equals to the cost of asking two specific ones. – CQ = 1 + 0.05 × r: This setting also represents a linear growth of CQ with respect to r. However, the cost of asking generalized queries is rather low in this case. For instance, the cost of asking a generalized query with 20 don’t-care features equals to the cost of asking two specific ones. – CQ = 1 + 0.5 × r2 : This setting represents a non-linear growth of CQ with respect to r. In addition, the cost of asking generalized queries is higher in this case. For instance, the cost of asking a generalized query with only two don’t-care features equals to the cost of asking three specific ones. 5
The “maximum reduction” of the error rate is the error rate on the initial training set R alone (without any benefit of the unlabeled examples) subtracting the error rate on R plus all the unlabeled data in U with correct labels. The “maximum reduction” roughly reflects the upper bound on error reduction that active learning can achieve.
Asking Generalized Queries with Minimum Cost
403
Note that, these settings of querying cost are only used here for empirically study, any other types of querying cost could be easily applied without changing the algorithms. As for all the 15 UCI datasets, we have neither true target functions nor human oracles to answer the generalized queries, we simulate the target functions by constructing learning models on the entire datasets in the experiments. The simulated target function regards each generalized query as a specific example with missing values, and provides the posterior class probability as the answer to the learner. The experiment is repeated 10 times on each dataset (i.e., each dataset is randomly split 10 times), and the experimental results are recorded. 4.2
Results for Balancing Acc./Cost Trade-Off
In Scenario 1, we use accuracy to measure the performance of the learning model. Thus, we use an ensemble of bagged decision trees (implemented in Weka [5]) as the learning algorithm in the experiment. Any other learning algorithms can also be implemented in real-world applications. Figure 1 demonstrates the performance of the proposed algorithm considering only querying cost (denoted by “AGQ-QC”; see Section 3.1), compared with “Pool” and “AGQ” on a typical UCI dataset “breast-cancer”. We can see from the subfigures of Figure 1 that, with all the three querying cost settings, “AGQQC” can always effectively increase the predictive accuracy of the learning model with low querying cost, and outperform “Pool” and “AGQ”. More specifically, in the case that (CQ = 1 + 0.5 × r), “AGQ-QC” significantly outperforms both “Pool” and “AGQ” during the entire learning process. In the case that (CQ = 1 + 0.05 × r), although “AGQ-QC” still outperforms the other two algorithm, it performs similarly to “AGQ”. As the cost of asking generalized queries is rather low in this case, “AGQ-QC” tends to discover as more as possible don’tcare features in the queries, thus producing similar predictive performance as “AGQ”. In the case that (CQ = 1 + 0.5 × r2 ), “AGQ-QC” still significantly outperforms the other algorithms. Note that, In this case, the cost of asking generalized queries is relatively high (i.e., grows quadratically with the number of don’t-care feature), thus “AGQ” tends to discover as few as possible don’t-care features, and consequently behaves similarly to “Pool”.
!"#
!
!
!
!"#
!"#
Fig. 1. Comparison between “AGQ-QC”, “AGQ” and “Pool” on a typical UCI data “breast-cancer”, for balancing acc./cost trade-off
404
J. Du and C.X. Ling
Table 3. Summary of the t-test for balancing acc./cost trade-off
Pool AGQ
C = 1 + 0.5 × r 6/7/2 14/0/1
AGQ-QC C = 1 + 0.05 × r 10/4/1 6/7/2
C = 1 + 0.5 × r2 5/6/4 15/0/0
To quantitatively compare the learning curves, we measure the actual values of the accuracies in 10 equal-distance points on the x-axis. The 10 accuracies of one curve are compared with the 10 accuracies of another using the two-tailed, paired t-test with 95% confidence level. The t-test results on all the 15 UCI datasets with all the three querying cost settings are summarized in Table 3. Each entry in the table, w/t/l, means that the algorithm in the corresponding column wins on w, ties on t, and loses on l datasets, compared with the algorithm in the corresponding row. We can observe the similar phenomena from Table 3 that, “AGQ-QC” significantly outperforms “AGQ” when the querying cost is relatively high (CQ = 1 + 0.5 × r2 and CQ = 1 + 0.5 × r), and significantly outperforms “Pool” when the querying cost is relatively low (CQ = 1 + 0.05 × r). 4.3
Results for Minimizing Total Cost
In Scenario 2, we use total cost to measure the performance of the learning model. Thus, we use a cost-sensitive algorithm CostSensitiveClassifier based on an ensemble of bagged decision trees (implemented in Weka [5]) as the learning algorithm in the experiments. In addition, we set the false negative (FN) and false positive (FP) costs as 2 and 10 respectively, and we set the number of the future predicted examples as 1000. Still, any other settings can be easily applied without changing the algorithm. Figure 2 demonstrates the performance of the proposed algorithm considering total cost (denoted by “AGQ-TC”), compared with “Pool” and “AGQ” on the same UCI dataset “breast-cancer”. We can see from Figure 2 that “AGQTC” effectively decreases the total cost of the learning model, and significantly outperforms “Pool” and “AGQ” with most querying cost settings. More specifically, we can discover the similar pattern between “AGQ-TC” and “AGQ” as in the previous subsection: When the querying cost is relatively low (such as CQ = 1 + 0.05 × r), “AGQ-TC” and “AGQ” tend to perform similarly; when the querying cost is relatively high (such as CQ = 1 + 0.5 × r2 ), “AGQ-TC” often significantly outperforms “AGQ”. The t-test results on the 15 UCI datasets are summarized in Table 4. It clearly shows that, “AGQ-TC” performs significantly better than “AGQ” on most (or even all) tested datasets, when the querying cost is relatively high (CQ = 1 + 0.5 × r2 and CQ = 1 + 0.5 × r). When compared with “Pool”, “AGQTC” still wins (or at least ties) on a majority of tested datasets, especially when the querying cost is relatively low (CQ = 1+0.05×r). These experimental results clearly indicate that “AGQ-TC” can indeed significantly decrease the total cost, and outperforms “AGQ” and “Pool”.
Asking Generalized Queries with Minimum Cost
&
&
&
&
&
&
%
405
$
$
$
$
$
%
$
%
$
%
' ( ) "
' ( ) "
' ( ) "
Fig. 2. Comparison between “AGQ-QC”, “AGQ” and “Pool” on a typical UCI data “breast-cancer”, for minimizing total cost Table 4. Summary of the t-test for minimizing total cost C = 1 + 0.5 × r 6/7/2 15/0/0
Pool AGQ
4.4
AGQ-TC C = 1 + 0.05 × r 10/4/1 6/6/3
C = 1 + 0.5 × r2 6/6/3 15/0/0
Approximate Probabilistic Answers
In the previous experiments, we have assumed that the oracle is always capable of providing accurate probabilistic answers for the generalized queries. However, in real-world situations, it is more common that only “approximate probabilistic answers” are provided (especially when the oracles are human experts). We speculate that small perturbations in the probabilistic answers will not dramatically affect the performance of the proposed algorithms. This is because small perturbations in label probabilities only represent light noises. These light noises could be cancelled out in the successive updates of the training set. With a robust base learning algorithm (such as the bagged decision trees), such small noises would be insensitive. In this subsection, we study this issue experimentally. To simulate the approximate probabilistic answer, we first calculate the exact accurate probabilistic answer from the target model, and then randomly alter it with up to 20% noise. Figure 3 demonstrates the performance the proposed algorithms with such approximate probabilistic labels (denoted by “AGQ-AC (appr)” and “AGQ-TC (appr)”), compared with “AGQ-AC” and “AGQ-TC”, with the setting (CQ = 1 + 0.5 × r) and on the typical data (“breast-cancer”).
**
&
!
& & **
$
$
!"#
(a) Comparison between “AGQ-QC” and “AGQ-QC (appr)” (with up to 20% noise).
$
%
$
%
$
%
' ( ) "
(b) Comparison between “AGQ-TC” and “AGQ-TC (appr)” (with up to 20% noise).
Fig. 3. Experimental results with approximate probabilistic answer on “breast-cancer”
406
J. Du and C.X. Ling
We can clearly see from these figures that, when only the approximate probabilistic answers are provided by the oracle, the performance of the proposed algorithms are not significantly affected. The similar experimental results can be shown with other settings and on other datasets. This indicates that, the proposed algorithms are rather robust with such more realistic approximate probabilistic answers, thus can be directly deployed in real-world applications.
5
Conclusion
In this paper, we assume that the oracles are capable of answering generalized queries with non-uniform costs, and study active learning with generalized queries in cost-sensitive framework. In particular, we design two objective functions to choose generalized queries in the learning process, so as to either balance the accuracy/cost trade-off or minimize the total cost of misclassification and querying. The empirical study verifies the superiority of the proposed methods over the existing active learning algorithms.
References 1. Asuncion, A., Newman, D.J.: UCI machine learning repository (2007) 2. Baram, Y., El-Yaniv, R., Luz, K.: Online choice of active learning algorithms. Journal of Machine Learning Research 5, 255–291 (2004) 3. Cohn, D.A., Ghahramani, Z., Jordan, M.I.: Active learning with statistical models. Journal of Artificial Intelligence Research 4, 129–145 (1996) 4. Du, J., Ling, C.X.: Active learning with generalized queries. In: Proceedings of the 9th IEEE International Conference on Data Mining, pp. 120–128 (2009) 5. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explorations 11(1), 10–18 (2009) 6. Kapoor, A., Horvitz, E., Basu, S.: Selective supervision: Guiding supervised learning with decision-theoretic active learning. In: Proceedings of International Joint Conference on Artificial Intelligence (IJCAI), pp. 877–882 (2007) 7. Lewis, D.D., Catlett, J.: Heterogeneous uncertainty sampling for supervised learning. In: Proceedings of ICML 1994, 11th International Conference on Machine Learning, pp. 148–156 (1994) 8. Margineantu, D.D.: Active cost-sensitive learning. In: Nineteenth International Joint Conference on Artificial Intelligence (2005) 9. Roy, N., Mccallum, A.: Toward optimal active learning through sampling estimation of error reduction. In: Proc. 18th International Conf. on Machine Learning, pp. 441–448 (2001) 10. Settles, B., Craven, M., Friedland, L.: Active learning with real annotation costs. In: Proceedings of the NIPS Workshop on Cost-Sensitive Learning (2008) 11. Seung, H.S., Opper, M., Sompolinsky, H.: Query by committee. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pp. 287–294 (1992) 12. Tong, S., Koller, D.: Support vector machine active learning with applications to text classification. Journal of Machine Learning Research 2, 45–66 (2002)
Ranking Individuals and Groups by Influence Propagation Pei Li1 , Jeffrey Xu Yu2 , Hongyan Liu3 , Jun He1 , and Xiaoyong Du1 1
2
Renmin University of China, Beijing, China {lp,hejun,duyong}@ruc.edu.cn The Chinese University of Hong Kong, Hong Kong, China
[email protected] 3 Tsinghua University, Beijing, China
[email protected]
Abstract. Ranking the centrality of a node within a graph is a fundamental problem in network analysis. Traditional centrality measures based on degree, betweenness, or closeness miss to capture the structural context of a node, which is caught by eigenvector centrality (EVC) measures. As a variant of EVC, PageRank is effective to model and measure the importance of web pages in the web graph, but it is problematic to apply it to other link-based ranking problems. In this paper, we propose a new influence propagation model to describe the propagation of predefined importance over individual nodes and groups accompanied with random walk paths, and we propose new IPRank algorithm for ranking both individuals and groups. We also allow users to define specific decay functions that provide flexibility to measure link-based centrality on different kinds of networks. We conducted testing using synthetic and real datasets, and experimental results show the effectiveness of our method.
1
Introduction
Ranking the centrality (or importance) of nodes within a graph is a fundamental problem in network analysis. Recently, the online social networking sites, such as Facebook and MySpace, provide users with a platform to make people connected. Learning and mining on these large-scale social networks attract attentions of many researchers in the literature [1]. In retrospect, Freeman [2] reviewed and evaluated the methods about centrality measures, and categorized them into three conceptual foundations: degree, betweenness, and closeness. Accompanied with eigenvector centrality (EVC) proposed by Bonacich [3], these four measures dominate the empirical usage. The first three methods measure the centrality by simply calculating the edge degree or the mean or fraction of geodesic paths [4], and treat every node equally. In this paper, we focus on EVC, which ranks the centrality of a node v by considering the centrality of nodes that surround v. In the literature, most of link analysis approaches focus on the link structures and ignore the intrinsic characteristics of nodes over a graph. However, in many networks, nodes also contain important information, such as the page content in a web graph. Simply overlooking these predefined importance may facilitate the J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 407–419, 2011. c Springer-Verlag Berlin Heidelberg 2011
408
P. Li et al. Table 1. Notations G(V, E) I(i) O(i) w(i, j) T |X| X Z Zi (a) K R Rj (i) GR
A directed and weighted graph with group feature The set of in-coming neighbors of node i The set of out-going neighbors of node i The weight of edge (i, j) The transition matrix of graph G The size of set X The sum of vector X or matrix X Predefined importance (or initial influence) of all nodes The influence received by node a on the i-th iteration/step The maximum iterations/steps in IP model The final ranking vector of nodes The ranking of node i on the j-th iteration/step The final ranking vector of groups
usage of link spam. We believe the intrinsic characteristics of nodes also affect link-based ranking significantly. The main contributions of this work are summarized below. First, we discuss the problems with the current EVC approaches, for example, PageRank, which ignores the intrinsic impacts of nodes on the ranking. Second, we propose a new Influence Propagation model, called IP model, which propagates the user-defined importance over nodes in a graph by random walking. We allow users to specify decay functions to control how the influence propagates over nodes. It is worth noting that most of EVC approaches only use an exponential function, that is not proper in many cases which we will address later. Third, we give algorithms to rank an individual node and all nodes in a graph efficiently. Fourth, we discuss how to rank a group (a set of nodes) regarding the centrality using both inner and outer structural information. The remainder of the paper is organized as follows. Section 2 gives the motivation of our work. Section 3 discuss our new influence propagation model, and ranking algorithms for individual nodes and groups. We conducted extensive performance studies and report our findings in Section 4. The related work is given in Section 5 and we conclude in Section 6. The notations used in this paper are summarized in Table 1.
2
The Motivation
In this section, first, we discuss our motivation to propose a new influence model, and explain why PageRank is not applicable in some cases. Second, we give our intuitions on how to rank the centrality for a set of nodes. Why Not PageRank: As a typical variant of EVC [3], PageRank [5] models the behavior of a random surfer, who clicks some hyperlink in the current page with probability c, and periodically jumps to a random page because “gets bored” with probability (1 − c). Let T be a transition matrix for a directed graph. For the p-th row and q-th column element of T , Tp,q = 0 if (p, q) ∈ / E, and
Ranking Individuals and Groups by Influence Propagation
e
a
d b
c (a)
409
Initial Importance 𝑅0 Normalized IPRank Scores (%) [0.2, 0.2, 0.2, 0.2, 0.2] [14.9, 10.0, 33.4, 24.3, 17.4] [0.3, 1.0, 0.2, 0.2, 0.8] [15.5, 14.2, 30.5, 21.2, 18.6] [0.8, 1.0, 0.2, 0.2, 0.8] [17.8, 13.8, 30.4, 20.5, 17.5] PageRank scores: [0.149, 0.100, 0.334, 0.243, 0.174] ( )
(b)
Fig. 1. (a) A simple directed network in which every node has an predefined importance. (b) PageRank scores and normalized IPRank scores corresponding to different predefined importance Z. Decay function is set to be f (k) = 0.8k .
Tp,q = w(p, q)/ i∈O(p) w(p, i) otherwise, where w(p, q) is the weight of edge (p, q). The matrix form of PageRank can be written below. R = cRT + (1 − c)U
(1)
Here, U corresponds to the distribution vector of web pages that a random surfer periodically jumps to and U = 1 holds. Based on Eq. (1), PageRank scores can be iteratively computed by Rk = cRk−1 T + (1 − c)U . The solution R is a steady probability distribution with R = 1, and decided by T and U only. It is important to note that the initial importance R0 of all nodes in PageRank are ignored (refer to Eq. (1)). In other words, R0 is not propagated in PageRank. As shown in Fig. 1, for the graph shown in Fig. 1(a), the PageRank scores for a, b, c, d, and e are 0.149, 0.1, 0.223, 0.243, and 0.174, respectively, regardless any given initial importance R0 . However, in many real applications, the initial importance R0 plays a significant role and greatly influences the resulting R. In addition, simply applying the PageRank to measure centrality in general may result in unexpected results, because PageRank is originally designed to bring order to the web graph. For example, to model the “word-of-mouth” effect in social networks [6] where people are likely to be influenced by their friends, the behavior of “random jumping” used in PageRank is not reasonable, since the influence only occurs between two directly connected persons. Motivated by propagating the initial predefined importance of nodes and randomly jumping, we claim that PageRank is not applicable for link-based ranking in all possible cases. In this paper, we propose a more general and customizable model for link-based ranking. We propose a new Influence Propagation (IP) model and IPRank to rank nodes and groups, based on their structural contexts in the graph and predefined importance. Group Ranking: In this paper, a group is a set of nodes in a graph. We categorize group centrality measures into two types. The first type exploits the inner information of a group. Two simple approaches to rank a group are either to sum or to average the centrality scores of nodes in group. However, summing is obviously problematic because larger groups tend to obtain higher scores. Averaging is unacceptable in some cases where a group with only one but highscore node beats another group with a large number of nodes. The second type employs the information outside a group. [7] analyzed this problem and proposed a measure based on the number of nodes outside a group that are connected to
410
P. Li et al.
(a) Group 1
(b) Group 2
(c) Group 3
Fig. 2. Three groups with the same degrees connected to outside nodes. ((a) and (b) are altered from Fig. 4.2.1 in [7].)
members of this group. More explicitly, let C be a group, N (C) be the set of all nodes that are not in C, but are neighbors of a member in C. [7] normalizes and |N (C)| computes group degree centrality by |V , where |V | is the number of nodes in |−|C| the graph. Clearly, this method measures group centrality from the view of nodes outside this group. However, given two large groups A and B where |A| > |B|, we find that |N (A)| > |N (B)| is more likely to be and |V |−|A| < |V |−|B| holds, making larger groups have a higher degree centrality more easily. Moreover, this method ignores the centrality scores of nodes in groups. In this work, we investigate how to combine the inner and outer structural context of a specific group. Some intuitions are given below. Consider Fig. 2. First, regarding the outer structural context, Group 2 should have a higher score than Graph 1, because Group 2 is with a larger span of neighbors. This intuition is drown from real-world networks such as friendship network, where a group with more contacts outside this group have a higher ranking. Second, regarding the inner structure of a group, both Group 2 and Group 3 have the same outside neighbors, but the inner structure of Group 3 is more compact and cohesive, so Group 3 is with a higher score than Group 2.
3
Ranking Nodes and Groups
In our influence propagation model, we consider every node has its own influence that needs to be propagated. This influence represents the predefined importance of a node, such as content, status, or preference. We consider a directed edge-weighted graph G(V, E), where V and E are the sets of nodes and edges respectively. Every node in V has attributes to describe its properties, and the attributes of a node can be used to indicate which groups the node belongs to. We use MA (a) to denote the belonging of the node a to a group A, and call it membership degree. Let Z be a vector to represent the predefined importance of nodes in G based on the attributes, and every element in Z is non-negative. The influence propagates following random walk [8]. Like the existing work [9,6,10,11], in our approach, influence propagation is a process that the incoming influence from in-neighbors of a node a to the node a itself at time t propagates to the out-neighbors of the node a with transition probability and decay effected at next time (t + 1). Regarding decay, we introduce a discrete decay function f (k) to describe the retained influence on the k-th step during the propagation with decay, where k ∈ {1, 2, ..., K} and K is the maximum propagation steps. The most prevalent decay function used in PageRank
Ranking Individuals and Groups by Influence Propagation
411
is f (k) = ck where 0 < c < 1. Generally, f (k) is a non-increasing function that satisfies f (k) < 1 and smaller f (k) results in smaller maximum steps k. We allow users to configure f (k) into other forms such as linear function to adapt different situations. To help assessing the maximum propagation steps K, a user needs to specify a threshold h that satisfies the following condition. f (K) ≥ h and f (K + 1) < h
(2)
which also defines the condition of convergence. According to the definition, we show a proposition to describe the influence propagation on a random walk path with cycles permitted, and define IPRank scores in Definition 1. Proposition 3.1: For a random path p = v0 , v1 , ..., vk that starts at time 0, the influence Z(0) propagating from v0 to vk is Z(k) = Z(0) · f (k) ·
k−1
Ti,i+1
(3)
i=0
Proof Sketch: Let’s analyze the case of one-step propagation. For an edge vi , vj , the influence Z(j) propagating from vi is ff (j) · Z(i) · Ti,j . Since path p (i) can be viewed as a sequence of one-step propagations, Eq. (3) holds. 2 Definition 1. The IPRank score of a node in a graph is measured by the influence of this node and the influence propagated in from other nodes. Like PageRank, the assumption behind IPRank is that, more influence a node receives, more important this node is. However, our IPRank is more general than the mutual enforcement based rankings. First, the initial importance Z of nodes will be taken into consideration. Z is propagated in our method and influences IPRank scores. Reconsider Fig. 1(a). We show that IPRank scores are different corresponding to different Z as shown in Fig. 1(b). Second, we allow users to specify a decay function. 3.1
IPRanking Nodes
The key to compute IPRank score R(v) of a node v is how we collect the influence propagated in from other nodes. Noting that after a propagation over k steps, the influence will be so small that it can be ignored. Therefore, we only need to collect random walk paths that reach the node v within k steps. A possible method is by random walk backwards, where the random surfer walks reversely along links starting from v and traverses nodes recursively. All nodes traversed can be viewed as the starting points of such random paths, whose probability can be assessed by Proposition 3.1. Consider the node a in the graph G in Fig. 1(a) and suppose k = 1. Since we reverse all edges and traverse b and e starting from the node a, two random walk paths on G that reach the node a in one step are collected. We summarize the recursive procedure IPRank-One in Algorithm 1, which computes IPRank score of the given node a. Furthermore, supposing the average size of out-neighbors
412
P. Li et al.
Algorithm 1. IPRank-One(G, v, Z, T , K) graph G(V, E), node v, predefined importance Z, transition matrix T , and maximum step K Output: IPRank score R(v) Input:
1: initialize R(v) = Z(v); 2: PathRecursion(v, v, 1, 0); 3: return R(v); 4: 5: 6: 7: 8: 9: 10: 11:
Procedure PathRecursion(v, n, x, y) y = y + 1; for every node u in in-neighbor set of the node n in G do R(v) = R(v) + Z(u) · x · Tu,n · f (y); if y < K then PathRecursion(v, u, x · Tu,n , y); end if end for
k in graph G is d, Algorithm 1 needs to traverse i=1 di nodes and thus collects the same number of random walk paths. The time complexity of IPRank-One is O(dk ) and acceptable for querying IPRank scores for one or a few nodes. But it is obviously inefficient when we need to compute IPRank scores of all nodes in a graph. Based on our observations, random walk paths generated by IPRank queries of different nodes contain the shared segments, which can be reused to save computational cost. For example, influence propagation along path a, b, a, c and a, b, a are both computed on IPRank queries for node c and a, but they contain the same segment a, b, a. We develop an algorithm to compute IPRank for all nodes in matrix form that works as follows. We call it IPRank-All, which is motivated by our IP model, where different nodes propagate their influence with different steps. The initial influence of all nodes is stored in a row vector Z. In the first step, all nodes propagate influence to its out-neighbors with decay factor f (1). Let us consider the influence received by a node. Suppose that in-neighbor set of a node v is I(v), I(v) the influence received by v is Z1 (v) = f (1) · i=1 (Z(i) · Ti,v ). Consider all nodes such as v, we get Z1 = f (1) · ZT in matrix form. In the second step, according to our IP model, all elements in Z1 will propagate to its out-neighbors and the influence vector received on the second step is Z2 = f (2) · ZT 2 . Analogously, the influence vector received on the k-th step can be computed iteratively by Zk = f (k) · ZT k =
f (k) · Zk−1 T f (k − 1)
(4)
Recall Definition 1, IPRank vector obtained withink steps is as follows. Rk =
k i=1
Zi + Z = Z +
k i=1
f (i) · ZT i = Z
1+
k
f (i) · T i
(5)
i=1
Eq. (4) and Eq. (5) form the main computation of IPRank-All algorithm. Let Xk = ZT k , Zk can be computed iteratively by applying Zk = f (k) · Xk =
Ranking Individuals and Groups by Influence Propagation
413
Algorithm 2. IPRank-All(G, Z, T , h) graph G(V, E), initial influence vector Z, transition matrix T , and threshold h Output: IPRank scores R Input:
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14:
initialize R = Z; for every node v ∈ V do obtain K according to Eq. (2); RefineRecursion(v, Z(v), 0, K); end for return R; Procedure RefineRecursion(v, x, y, K) y = y + 1; for every node u in out-neighbor set of node v do R(u) = R(u) + x · Tv,u · f (y); if y < K then RefineRecursion(u, x · Tv,u , y, K); end if end for
f (k) · Xk−1 T . So time complexity of IPRank-All algorithm is O(KN d), where d is the average in-degree and N is the graph size. When it is specific to the most popular decay function f (k)= ck , we get Rk = Z
1+
k
ci · T i
= c · Rk−1 T + Z
(6)
i=1
Hence, Rk can be computed iteratively if the decay function is exponential. Eq. (6) implies a mutual reinforcement of importance like PageRank does. However, when f (k) is not exponential, we cannot compute Rk iteratively using Eq. (6). In this case, the efficient way to obtain Rk is to compute all Zk iteratively by Eq. (4) and sum them up. The algorithm for IPRank-All is given in Algorithm 2. While the IPRank in Eq. (4) and Eq. (5) propagates one step of all nodes at a time, Algorithm 2 propagates all steps of one node. Some useful propositions about IPRank computing are given below. Proposition 3.2: The convergence rate of IPRank scores Rk is decided by the decay function f (k). 2 Proof Sketch: According to Eq. 5, Rk − Rk−1 = f (k) · Z · T k . Since each row of T is normalized to one unless all elements in this row are zero, T k ≤ |V | is tenable. Proposition 3.2 holds. 2 Proposition 3.3: When f (k) = ck , IPRank is an extension of PageRank in fact, and also a variant of eigenvector centrality (EVC) measure. 2
414
P. Li et al.
Proof Sketch: Letting Z = (1 − c)U in Eq. (6), we get PageRank as shown in Eq. (1). When k → ∞, Rk = Rk−1 , and therefore R = c · RT + Z. Suppose that X is a |V |-by-|V | matrix with non-zero values only on the diagonal that satisfies RX = Z, we get R = R(T · c) + RX = R(T · c + X). Therefore, R is an eigenvector of (T · c + X). 3.2
IPRanking Groups
As a set of nodes, the group’s structural context consists of links from both the outside and inside. If we view a group as a big node and apply IPRank on it, we simply get the group centrality measured from the outside of this group, which says “group centrality is the influence propagated in from nodes outside this group”. Formally, if we use Z(u, v) to represent the influence Z(u) propagating from node u to node v (no matter via how many steps), we rank a group A from the viewpoint of the outside structure. GRout =
MA (v) ·
v∈A
Z(u, v)
(7)
u∈A /
MA (v) is the membership degree. On the other hand, if nodes in the group are more connected to each other, this group should have a higher centrality. We do not use the simple approaches such as summing and averaging, because they ignore the link information between individual nodes in a group. To reduce the effect of the group size, individual nodes with a high centrality should play a more important role, especially when they are highly connected. IP model is also effective to help rank groups from the viewpoint of the inner structure, by propagating the influence of these high-score individuals via links. That is, GRin =
MA (v) ·
Z(v) +
v∈A
Z(u, v)
(8)
u∈A
Finally we combine rankings from outer and inner structural context together to rank groups in graph G(V,E), as shown below. GR =
MA (v) ·
v∈A
Z(v) +
Z(u, v)
(9)
u∈V
Ranking groups in a graph G is an extension of our IPRank algorithms. The basic idea of our IPRank algorithms is to collect influence propagated in from other nodes. In brief, we show three steps to perform group ranking in a graph G. (i) Set the centrality score R(v) of a node v as initial influence Z(v). (ii) Propagate influence via links by our IPRank algorithm. (iii) Rank groups by IPRank scores and the membership degree according to Eq. (9).
4
Experimental Study
We report our experimental results to confirm the effectiveness of our IPRank on both individual and group levels. We compare IPRank with other four centrality measures on accuracy, and we use various synthetic datasets and a large real co-authorship network from DBLP. All algorithms were implemented in Java, and all experiments were run on a machine with a 2.8 GHz CPU.
Ranking Individuals and Groups by Influence Propagation
415
Table 2. (a) Normalized centrality scores of different measures. (b) Normalized IPRank scores while predefined importance of node b increases step by step. CD CB CC P ageRank IP Rank
(a) [0.20, [0.21, [0.29, [0.16, [0.16,
0.10, 0.21, 0.00, 0.12, 0.21,
0.30, 0.19, 0.33, 0.32, 0.30,
0.20, 0.19, 0.04, 0.23, 0.19,
0.20] 0.21] 0.33] 0.17] 0.14]
[0.2, [0.2, [0.2, [0.2,
0.0, 0.2, 0.4, 0.6,
Z 0.2, 0.2, 0.2, 0.2,
(b) 0.2, 0.2, 0.2, 0.2,
0.2] 0.2] 0.2] 0.2]
Normalized IPRank Scores (%) [16.2, 5.66, 33.2, 25.8 19.1] [16.1, 11.6, 31.9, 23.2, 17.2] [16.0, 15.6, 31.1, 21.4, 15.9] [16.0 18.4, 30.5, 20.2, 14.9]
IPRank Vs. Others: A Case Study: In this experiment we evaluate the results produced by IPRank and other centrality measures based on degree, betweenness, closeness, and eigenvector. The comparison was performed on the small graph shown in Fig. 1(a). Note that for each centrality measure, there are many variants, so we adopt the definition from Wikipedia [12]. For graph G(V, E), the degree centrality CD (a) for node a is CD (a) = indegee(a)/(|V |− 1). For the betweenness centrality, we define CB (a) s=a=t (σst (a)/σst ), where σst is the number of shortest paths from s to t and σst (a) is the number of such paths that pass through node a. Closeness centrality is a little complex, because the shortest path between two nodes may not exist on directed graphs. So we adopt the definition in [13] where the closeness centrality CC (a) = t∈V 2−d(a,t) , in which d(a, t) is the shortest distance from node a to t and d(a, t) = ∞ if it is unreachable. Finally, we use PageRank in [5] to measure eigenvector centrality. To make the results comparable, we normalize centrality scores of different measures to be one and show them in Table 2(a). For IPRank, we set the predefined importance Z = [0.2, 0.8, 0.2, 0.2, 0.2] and decay f (k) = 0.7k . Intuitively, degree centrality is similar to EVC but only considers directed neighbors. For betweenness and closeness centrality, they base on shortest distance and emphasize on the “prestige” rather than “popularity” of a node. Thus, CB (c) < CB (b) and CC (b) = 0 is not compliant to the human intuition of ranking that is mainly based on popularity. In IPRank, we set a higher predefined importance on b, which contributes to a and makes R(a) > R(e) finally, to be contrary of PageRank. Moreover, we show different normalized IPRank scores in Table 2(b) while Z(b) increases step by step. Increasing predefined importance of a node generally results in a higher IPRank score of this node. Decay functions also influence IPRank scores significantly. For example, based on the same predefined importance Z = [0.2, 0.8, 0.2, 0.2, 0.2], we define f (k) = 0.2 for k ≤ 3 and f (k) = 0 for k ≥ 4, and alter decay function from f (k) = 0.7k to f (k). Finally, we obtain a new IPRank score vector [0.15, 0.34, 0.22, 0.16, 0.13], which is quite different from [0.16, 0.21, 0.30, 0.19, 0.14] shown in Table 2(a). Results on DBLP Co-Authorship Network: We use the author information of the entire DBLP1 conference papers (a total of 745,593) to build a large coauthorship network. This network consists of 534,058 authors (nodes), 1,589,343 co-author relationships (edges). There are 2,644 different conferences, and an author is associated with a vector showing how many papers he/she contributes 1
http://dblp.uni-trier.de/xml/, last modified in September 2009.
416
P. Li et al.
Table 3. (a) Ranking without predefined importance. (b) IPRank on KDD area. (c) IPRank on WWW area. (a) Author Wei Li Wei Wang Wen Gao Wei Zhang Jun Zhang Chin-Chen Chang Li Zhang Lei Wang Alberto L. S-V C. C. Jay Kuo
(b) Conf. Author iscas Jiawei Han icra Philip S. Yu icip Christos Faloutsos hicss Heikki Mannila hci Padhraic Smyth chi Bing Liu wsc Jian Pei vtc Vipin Kumar iccS Mohammed Javeed Zaki icc Srinivasan Parthasarathy
(c) Conf. Author kdd Wei-Ying Ma icdm Zheng Chen icml C. Lee Giles icde Ravi Kumar sigmod Erik Wilde nips Katsumi Tanaka aaai Yong Yu sdm Wolfgang Nejdl vldb Torsten Suel www Andrew Tomkins
Conf. www icde sigir semweb sigmod cikm chi vldb kdd aaai
to conferences. The maximum number of co-authors is 361 by Philip S. Yu and nearly 54.3% authors appear once in DBLP. We set the number of co-authors as the edge weight. A conference serves as a group, and membership degree between author a and conference C is decided by the ratio of a publishing in C. IPRank algorithm follows a human intuition: if author a and author b have the same number of co-authors but a’s co-authors are more important, according to our IP model, a receives more influence from neighbors and earns a higher centrality compared with b. Besides, we consider the decay of influence propagation via co-authorships should be not exponential, since an author means a lot to his coauthors but little to other authors away from several hops. In this experiment, we set decay function f (k) = 1 − 0.3k for k ≤ 3 and f (k) = 0 if k ≥ 4. To illustrate the necessity of predefined importance, first we consider every author is equal important and show the corresponding top-10 authors and conferences in Table 3(a). Some authors rank high only because they have lots of co-authors, and larger conferences result in higher rankings. Second, we bias the ranking to a special area by predefining importance for authors. In Table 3(b), the authors published papers in KDD are given a higher predefined importance and shortly we obtain top-10 centrality ranking of authors and conferences in the Knowledge Discovery and Data Mining area. Third, we bias IPRank to WWW area by giving higher predefined importance to authors who published in WWW, and show results in Table 3(c). We can see IPRank with predefined importance produces reasonable results. Experiments display that IPRank-All only takes 0.91 second to complete all three iterations. Efficiency and Convergence Rate: PageRank does not provide ways to compute the score of only one node. In contrast, IPRank-One can do this without accuracy loss, and an advantage is that if we only need to obtain IPRank scores of a few nodes, IPRank-One is more efficient than IPRank-All. We execute experiments on a random graph with 1M nodes and 3M edges. IPRank-All takes 3.65s to perform all iterations, whereas IPRank-One only needs 0.01s to respond IPRank query for one node. IPRank-All+ provides a more accurate measure
Ranking Individuals and Groups by Influence Propagation 400 Time Cost (ms)
Time (s)
25 20 15 10
1.0 0.9
300
Precision
30
200 100
K 5
6
7
8
9
(a)
0.8 0.7 0.6
5 0
417
Nodes 0 0.1M 0.2M 0.3M 0.4M 0.5M 0.6M 0.7M 0.8M
(b)
Iterations
0.5 1
2
3
4
5
6
7
8
9
10 11
(c)
Fig. 3. (a) Time cost of traverse increasing with steps K. (b) Performance of IPRankAll as node size increases. (c) Convergence rate of IPRank-All on DBLP dataset.
than IPRank-All when the decay of some large predefined importance needs more iterations. Algorithms show that IPRank-One and IPRank-All+ are all based on traverse of nodes that reach the target node within K steps. Fig. 3(a) shows time cost of such a traverse increases rapidly when K increases. We recommend IPRank-One for IPRank query of a few nodes and IPRank-All+ for more accurate IPRanking. IPRank-All is suitable for most of cases. We set |E|/|V | = 5 and let the graph size |V | increase. Time cost of each IPRank-All iteration increases near linearly and looks acceptable, as shown in Fig. 3(b). We test the convergence rate of IPRank-All on DBLP co-authorship network with decay f (k) = 0.7k . The precision of iteration k is defined by averaging Rk (a)/R(a) for every node a. Fig. 3(c) shows that after 10 iterations, the error of precision is below 0.01.
5
Related Work
Historically, measuring the centrality of nodes (or individuals) in a network has been widely studied. Freeman [2] reviewed and categorized these methods into three conceptual foundations: degree, betweenness, and closeness. Accompanied with eigenvector centrality (EVC) proposed by Bonacich [3], these four measures dominate the empirical usage of centrality. A recent summary can be found in [4]. Besides, Tong et al. [14] proposed cTrack to find central objects in a skewed time-evolving bipartite graph, based on random walk with restart. In recent years, the trend of exploiting structural context becomes prevalent in network analysis. The crucial intuition behind this trend is that “individuals relatively closer in a network are more likely to have the similar characters”. A typical example is PageRank [5], where the page importance is flowing and mutual reinforced along hyperlinks. Other examples and applications were explored in recent works such as [10,11,15]. [15] analyzed the propagation of trust and distrust on large networks consisting of people. [11] used a few labeled examples to discriminate irrelevant results by computing proximity from the relevant nodes. Gyongyi et al. discovered other good pages by propagating the trust of a small set of good pages [10].
418
P. Li et al.
Other studies that applied predefined importance to their measures include [16] and [17]. [16] modified PageRank to be topic-sensitive by assigning importance scores for each pages with respect to a particular topic. [17] assigned PageRank scores to each page, and measured the similarity between web pages by propagating their own similarity and receiving similarities from other pages. On the other hand, most of works use exponential decay simply and there are few studies on applying user-defined decay functions in random walking. Perhaps the most explicit study on decay function is [18], which discussed three decay (or damping) functions on link-based ranking and showed a linear approximation to PageRank. We are the first to introduce predefined importance and decay function into EVC under a well-established intuitive model. We categorize group centrality measures into two types, which exploit the inner and outer information of a group respectively. Approaches that sum or average centrality scores of individuals in this group belong to the first type. As an example of the second type, [7] ranked group C by the nodes that are not in C, but are neighbors to a member in C. Besides, there are some studies on quasi-cliques [13,19], which can be viewed as a special kind of groups.
6
Conclusion
In this paper, we proposed an new influence propagation model to propagate user-defined importance on nodes to others along random walk paths with user control by allowing users to define decay functions. We propose new algorithms to measure the centrality of individuals and groups according to the user’s view. We tested our approaches using large real dataset from DBLP, and confirmed the effectiveness and efficiency of our approaches. Acknowledgement: The work was supported in part by grants of the Research Grants Council of the Hong Kong SAR, China No. 419109, and National Nature Science Foundation of China No. 70871068, 70890083 and 60873017.
References 1. Zhang, H., Smith, M., Giles, C.L., Yen, J., Foley, H.C.: Snakdd 2008 social network mining and analysis report. SIGKDD Explorations 10(2), 74–77 (2008) 2. Freeman, L.C.: Centrality in social networks: conceptual clarification. Social Networks 1, 215–239 (1978) 3. Bonacich, P.: Factoring and weighting approaches to status scores and clique identification. Journal of Mathematical Sociology 2(1), 113–120 (1972) 4. Newman, M.: The mathematics of networks. In: Blume, L., Durlauf, S. (eds.) The New Palgrave Encyclopedia of Economics, 2nd edn. Palgrave MacMillan, Basingstoke (2008), http://www-ersonal.umich.edu/~ mejn/papers/palgrave.pdf 5. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab (1999) ´ Maximizing the spread of influence 6. Kempe, D., Kleinberg, J.M., Tardos, E.: through a social network. In: KDD, pp. 137–146 (2003)
Ranking Individuals and Groups by Influence Propagation
419
7. Everett, M.G., Borgatti, S.P.: Extending centrality. In: Wasserman, S., Faust, K. (eds.) Social network analysis: methods and applications, pp. 58–63. Cambridge University Press, Cambridge (1994) 8. Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press, Cambridge (1995) 9. Valente, T.: Network Models of the Diffusion of Innovations. Hampton Press, New Jersey (1995) 10. Gy¨ ongyi, Z., Garcia-Molina, H., Pedersen, J.O.: Combating web spam with trustrank. In: VLDB, pp. 576–587 (2004) 11. Sarkar, P., Moore, A.W.: Fast dynamic reranking in large graphs. In: WWW, pp. 31–40 (2009) 12. Centrality in Wikipedia, http://en.wikipedia.org/wiki/Centrality 13. Dangalchev, C.: Mining frequent cross-graph quasi-cliques. Physica A: Statistical Mechanics and its Applications 365(2), 556–564 (2006) 14. Tong, H., Papadimitriou, S., Yu, P.S., Faloutsos, C.: Proximity tracking on timeevolving bipartite graphs. In: SDM, pp. 704–715 (2008) 15. Guha, R.V., Kumar, R., Raghavan, P., Tomkins, A.: Propagation of trust and distrust. In: WWW, pp. 403–412 (2004) 16. Haveliwala, T.H.: Topic-sensitive pagerank. In: WWW, pp. 517–526 (2002) 17. Lin, Z., Lyu, M.R., King, I.: Pagesim: a novel link-based measure of web page aimilarity. In: WWW, pp. 1019–1020 (2006) 18. Baeza-Yates, R.A., Boldi, P., Castillo, C.: Generalizing pagerank: damping functions for link-based ranking algorithms. In: SIGIR, pp. 308–315 (2006) 19. Jiang, D., Pei, J.: Mining frequent cross-graph quasi-cliques. TKDD 2(4) (2009)
Dynamic Ordering-Based Search Algorithm for Markov Blanket Discovery Yifeng Zeng, Xian He, Yanping Xiang, and Hua Mao Department of Computer Science, Aalborg University, DK-9220 Aalborg, Denmark Department of Computer Science, Uni. of Electronic Sci. and Tech. of China, P.R. China {yfzeng,huamao}@cs.aau.dk,{hexian1987,xiangyanping}@gmail.com
Abstract. Markov blanket discovery plays an important role in both Bayesian network induction and feature selection for classification tasks. In this paper, we propose the Dynamic Ordering-based Search algorithm (DOS) for learning a Markov blanket of a domain variable from statistical conditional independence tests on data. The new algorithm orders conditional independence tests and updates the ordering immediately after a test is completed. Meanwhile, the algorithm exploits the known independence to avoid unnecessary tests by reducing the set of candidate variables. This results in both efficiency and reliability advantages over the existing algorithms. We theoretically analyze the algorithm on its correctness and empirically compare it with the state-of-the-art algorithm. Experiments show that the new algorithm achieves computational savings of around 40% on multiple benchmarks while securing similar or even better accuracy. Keywords: Graphical Models, Markov Blanket, Conditional Independence.
1 Introduction Bayesian network (BN) [1] is a type of statistical models that efficiently represent the joint probability distribution of a domain. It is a directed acyclic graph where nodes represent domain variables of a subject of matter, and arcs between the nodes describe the probabilistic relationship of variables. One problem that naturally arises is the learning of such a model from data. Most of the existing algorithms fail to construct a network of hundreds of variables in size. A reasonable strategy for learning a large BN is to firstly discover the Markov blanket of variables, and then to guide the construction of the full BN [2,3,4,5]. Markov blanket indeed is an important concept and possesses potential uses in numerous applications. For every variable of interest T , the Markov blanket contains a set of parents, children, and spouses (i.e., parents of common children) of T in a BN [1]. The parents and children reflect the direct cause and direct effect of T respectively while the spouses represent the direct cause of T ’s direct effect. Such causal knowledge is essential if domain experts desire to manipulate the data process, e.g. to perform a troubleshooting on a faulty device, or to test the body reaction to a medicine, or to study the symptom of a disease, etc. Furthermore, conditioned on its Markov blanket variables, the variable T is probabilistically independent of all other variables in the domain. Given this important property, the Markov blanket is inextricably connected to J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 420–431, 2011. c Springer-Verlag Berlin Heidelberg 2011
Dynamic Ordering-Based Search Algorithm
421
A
B
E
C
D
F
T
H
R
I
J
M
K
L
Fig. 1. Markov blanket of the target node T in the BN. It includes the parents and children of T , P C(T ) = {C, D, I}, and the spouses, SP (T ) = {R, H}.
the feature selection problems. Koller and Sahami [6] showed the Markov blanket of T is the theoretically optimal set of features to predict T ’s values. We show an instance of Markov blanket within a small BN in Fig. 1. The goal of this paper is to identify the Markov blanket of a target variable from data in an efficient and reliable manner. Research on Markov blanket discovery is traced back to the Grow-Shrink algorithm (GS) in Margaritis and Thrun’s work [7]. The Grow-Shrink algorithm is the first Markov blanket discovery algorithm proved to be correct. Tsamardinos et. al. [8,9] proposed several variants of GS, like the incremental association Markov blanket (IAMB) and Interleaved IAMB that aim at the improved speed and reliability. However, the algorithms are still limited on achieving data efficiency. To overcome this limitation, attempts have been made including the Max-Min Parents and Children (MMPC) [10] and HITON-PC [11] algorithms for Markov blanket discovery. Neither of them is shown to be correct. This motivates a new generation of algorithms like the Parent-Child based search of Markov blanket (PCMB) [12] and the improved one - Iterative PCMB (IPCMB) [13]. Besides the proved soundness, IPC-MB inherits the searching strategy from the MMPC and HITON-PC algorithms: it starts to learn both parents and children of the target variable and then proceeds to identify spouses of the target variable. It results in the Markov blanket from which we are able to differentiate direct causes (effects) from indirect relation to the target variable. The differentiation on Markov blanket variables is rather useful when the Markov blanket will be further analyzed to recover the causal structure, e.g., providing a partial order to speed up the learning of the full BN. In a similar vein, we will base the new algorithm on the IPC-MB and provide improvement on both time and data efficiency. In this paper, we propose a novel Markov blanket discovery algorithm, called Dynamic Ordering-based Search (DOS) algorithm. Akin to the existing algorithms, the DOS takes an independence − based search to find a Markov blanket by assuming that data were generated from a f aithf ul BN modeling the domain. It conducts a series of statistical conditional independence tests toward the goal of identifying a number of Markov blanket variables (parents and children as well as spouses). Our main contribution on developing the DOS is on two aspects. Firstly, we arrange the sequence of independence tests by ordering variables not only in the candidate set, but also in the conditioning sets. We order the candidates using the independence measurement like the mutual information [14], the p-value returned by the G2 tests [15], etc. Meanwhile, we order the conditioning variables in terms of the f requency that the variables enter
422
Y. Zeng et al.
into the conditioning set in the known independence tests. We re-order the variables immediately when an independence test is completed. By ordering both types of variables, we are able to detect true negatives effectively within a small amount of conditional independence tests. Secondly, we exploit the known conditional independence tests to remove true negatives from the candidate set at the earliest time. By doing so, we need test only a small number of the conditioning sets (generated from the candidate set) thereby improving time efficiency. In addition, we can limit the conditioning set into a small size in the new independence tests, which achieves data efficiency. We further provide the proof on the correctness of the new DOS algorithm. Experimental results show the benefit of dynamically ordering independence tests and demonstrate the superior performance over the IPC-MB algorithm.
2 Background and Notations In the present paper we use uppercase letters (e.g., X, Y , Z) to denote random variables and boldface uppercase letters (e.g., X, Y, Z) to represent sets of variables. We use U to denote the set of variables in the domain. A “target” variable is denoted as T unless stated otherwise. “Nodes” and “variables” will be used interchangeably. We use I(X, Y |Z) to denote the fact that two nodes X and Y are conditionally independent given the set of nodes Z. Using conditional independence, we may define the Markov blanket of the target variable T , denoted by M B(T ), as follows. Definition 1 (Markov Blanket). The Markov blanket of T is a minimal set of variables conditioned on which all other variables are independent of T , i.e., ∀X ∈ U − {M B(T ) ∪ T } I(X, T |M B(T )). Bayesian network (BN) [1] is a directed acyclic graph G where each node is annotated with a conditional probability distribution (CPD) given any instantiation of its parents. The multiplication of all CPDs constitutes a joint probability distribution P modeling the domain. In a BN, a node is independent of its non-descendants conditioned on its parents. Definition 2 (Faithfulness). A BN G and a joint probability distribution P is faithful to one another iff every conditional independence entailed by the graph G is also present in P [1,15]. A BN is faithful if it is faithful to its corresponding distribution P , i.e., IG (X, Y |Z)=IP (X, Y |Z). A graphical criterion for entailed conditional independence is that of d-separation [1] in a BN. It is defined as follows. Definition 3 (d-separation). Node X is d-separated from node Y conditioned on Z in the graph G if, for all paths between X and Y , either of the following two conditions holds: 1. The connection is serial or diverging and Z is instantiated. 2. The connection is converging, and neither Z nor any of Z’s descendants is instantiated.
Dynamic Ordering-Based Search Algorithm
423
Due to the faithfulness assumption and d-separation criterion, we are able to learn a BN from the data generated from the domain. We may utilize statistical tests to establish conditional independence between variables that is structured in the BN. This motivates the main idea of an independence-based (or constraint-based) search for learning BN [15]. Most of current BN or Markov blanket learning algorithms are based on the following theorem [15]. Theorem 1. If a BN G is faithful to a joint probability distribution P then: 1. Nodes X and Y are adjacent iff X and Y are conditionally dependent given any other set of nodes. 2. For each triplet of nodes X, Y and Z in G such that X and Y are adjacent to Z, but X and Y are not adjacent, X → Y ← Z is a sub-graph of G iff X and Y are dependent conditioned on every other set of nodes that contains Z. A faithful BN allows the Markov blanket to be graphically represented in G. Furthermore, Tsamardinos et. al. [9] shows the uniqueness of Markov blanket in Theorem 2. Theorem 2. If a BN G is faithful, then for every variable T , the Markov blanket of T is unique and is the set of parents, children, and spouses of T . We observe that the first part of Theorem 1 allows one to find parents and children of the target node T , denoted by P C(T ), since there shall be an edge between P C(T ) and T ; the second part provides possibility on identifying a spouse of T , denoted by SP (T ). Hence Theorem 1 together with Theorem 2 provide a foundation for the Markov blanket discovery.
3 DOS Algorithm The Dynamic Ordering-based Search algorithm (DOS) discovers the Markov blanket M B(T ) through two procedures. In the first procedure, the algorithm finds a candidate set of parents and children of the target node T , called CP C(T ). It starts with the whole set of domain variables and gradually excludes those that are independent of T conditioned on a subset of the remained set. In the second procedure, the algorithm identifies spouses of the target node, called SP (T ), and removes the false positives from CP C(T ). The resulted CP C(T ) is the output M B(T ). Prior to presenting the DOS algorithm, we introduce three functions. The first function, called Indep(X, T |S), measures the independence between the variable X and the target variable T conditioned on a set of variables S. In our algorithm, we use G2 tests to compute the conditional independence and take the p-value (returned by G2 test) for the independence measurement [15]. The smaller the p-value, the higher the dependence. In practice, we compare the p-value to a confidence threshold 1-α. More precisely, we let Indep(X, T |S) be equivalent to the p-value so that we are able to connect the independence measurement to conditional independence, i.e., I(X, T |S)=true iff Indep(X, T |S) ≥ 1 − α. Notice that we assume independence tests are correct. The second function, called F req(Y ), is a counter that measures how frequent a variable Y enters into the conditioning set S in the previous conditional independence tests Indep(X, T |S). A large F req(Y ) value implies a large probability of d-separating
424
Y. Zeng et al.
X from T using Y in the conditioning set S. In general, the variables belonging to P C(T ) have a large F req(Y ) value. The third function, called GenSubset(V, k), generates all subsets of size k from the set of variables V in the Banker’s sequence [16]. The Banker’s sequence is one way of enumerating all subsets of a set. It examines subsets in monotonically increasing order by size. For all subsets of size k, it constructs the subset by sequentially picking up k elements from the set. We denote the resulted set by SS, i.e., SS = GenSubset(V, k). Notice that SS contains a set of ordered subsets of identical size. For example, we may have SS = GenSubset({A, B, C}, 2) = {{A, B}, {A, C}, {B, C}}. 3.1 Algorithm Formulation We present details of the DOS algorithm in Fig 2. As mentioned above, the new algorithm uses two procedures, called GenCP C and Ref CP C respectively, to discover the Markov blanket of the target variable T . It starts with the GenCP C procedure that aims to find a candidate set of parents and children of T . The GenCP C procedure searches the CP C(T ) by shrinking the set of T ’s adjacent variables called ADJ(T ). The initial ADJ(T ) is the whole set of domain variables except T , i.e., ADJ(T ) = U − {T }. The procedure then removes an Non-PC (non-parent and child) variable from ADJ(T ) if the variable is conditionally independent of T given a subset of the adjacent set (lines 7-9). We use G2 estimation in the conditional independence tests (line 7), and check the independence for each adjacent variable (line 4) by examining all empty conditioning sets (cutsize=0) first, then all conditioning sets of size 1, later all those of size 2 and so on, until cutsize → |ADJ(T )| (lines 1 and 15). Recall that the number of data instances to reliable G2 tests is exponential to the size of the conditioning set S. Hence the strategy of monotonically increasing size of S contributes to the improvement on data efficiency. We observe that the plain algorithm needs to iterate every T ’s adjacent variable and test the conditional independence possibly given all subsets of the adjacent set ADJ(T ). Clearly we may speed up the procedure by reducing ADJ(T ) at the earliest time. In other words, we shall remove Non-PC variables from ADJ(T ) as early as possible using effective conditional independence tests. This is relevant to two issues: 1) Selection of an adjacent variable (X ∈ ADJ(T )) that is most likely to become the Non-PC variable; 2) Selection of the conditioning set S that can effectively d-separate the adjacent variable X from T . We solve the first issue by choosing the variable that has the minimum relevance with T measured by the p-values (line 4), i.e., X = argmax Indep(X, T |S). A large p-value(=Indep(X, T |S)) implies a large X∈ADJ(T )
probability of claiming conditional independence. The selected variable X is the one that has not been visited and has the largest p-value among all un-visited adjacent variables. Notice that we use the known p-values in the previous independence tests where the size of S is 1 smaller than that in the new tests (line 7). We solve the second issue by setting a counter function F req(Y ) to each variable Y . The function records how often the variable Y (in the conditioning set) d-separates an Non-PC variable from T . We update the counter immediately after an effective test is executed, and order the adjacent variables in the descending order of counters (lines 12-13). We generate the conditioning sets SS, each of which has the size cutesize, from
Dynamic Ordering-Based Search Algorithm
425
Dynamic Ordering-based Search algorithm (DOS) Input: Data D, Target Variable T , Confidence Level 1-α Output: Markov Blanket M B(T ) Main Procedure 1: Initialize the adjacent set of T : ADJ(T ) = U − {T } 2: Find the CP C(T ) through GenCP C:CP C(T ) = GenCP C(D, T, ADJ(T )) 3: Find the SP (T ) and remove the false positives through Ref CP C:M B(T ) = Ref CP C(D, T, CP C(T )) Sub-Procedure: Generate the CP C(T ) GenCP C(D, T, ADJ (T )) 1: Initialize the size of conditioning set S: cutsize=0 2: WHILE (|ADJ(T )| > cutsize) DO 3: Initialize the Non-PC set: N P C(T )=∅ 4: FOR each X ∈ ADJ(T ) and choose X = argmax Indep(X, T |S) DO 5: Generate the conditioning sets: SS = GenSubset(ADJ(T ) − {X}, cutsize) 6: FOR each S ∈ SS DO 7: IF (Indep(X, T |S) ≥ 1 − α) THEN 8: N P C(T )=N P C(T ) ∪ X 9: ADJ(T ) = ADJ(T ) − N P C(T ) 10: Keep the d-separate sets: Sepset(X, T )=S 11: FOR each Y ∈ S DO 12: Update F req(Y ) 13: Order ADJ(T ) using F req(Y ) in the descending order 14: Break 15: cutsize = cutsize + 1 16: Return CP C(T ) = ADJ(T ) Sub-Procedure: Refine the CP C(T ) Ref CP C(D, T, CP C(T )) 1: FOR each X ∈ CP C(T ) DO 2: Find the CP C for X: CP C(X) = GenCP C(D, X, U − {X}) 3: IF T ∈ CP C(X) THEN 4: Remove the false positives: CP C(T ) = CP C(T ) − {X} 5: Continue 6: FOR each Y ∈ {CP C(X) − CP C(T )} DO 7: IF (Indep(Y, T |X ∪ Sepset(X, T )) < 1 − α) THEN 8: Add the spouse Y : SP (T ) = SP (T ) ∪ {Y } 9: CP C(T ) = CP C(T ) ∪ SP (T ) 10: Return M B(T ) = CP C(T )
Fig. 2. The DOS algorithm contains two sub-procedures. The GenCP C procedure finds a candidate set of parents and children of T by efficiently removing Non-PC from the set of domain variables while the Ref CP C procedure mainly adds spouses of T and removes false positives.
426
Y. Zeng et al.
ADJ(T ) using the GenSubSet function (line 5). Since we order ADJ(T ) variables and generate the subsets in the Banker’s sequence, the conditioning set S(∈ SS) firstly selected will have a large probability of being P C(T ) or its subset. Consequently, we may detect an Non-PC variable within few tests. Once we identify the Non-PC variable we immediately remove it from ADJ(T ) (lines 8-9). The reduced ADJ(T ) avoids to generate a large number of the conditioning sets as well as a big size of the conditioning set in the new tests. The GenCP C procedure returns the candidate set of T ’s parents and children that excludes false negatives. However, it may include possible false positives. For instance, in Fig. 1, the variable M still remains in the output CP C(T ) because M is d-separated from T only conditioned on the set {R, I}. However, the variable R is removed early since it is independent from T given the empty set. Hence the tests will not condition on both R and I simultaneously. The problem is fixed by checking the symmetric relation between T and T ’s PC, i.e., T shall be in the PC set of T ’s PC variable and vice versa [2,12]. For example, we may find the candidate set of M ’s parents and children CP C(M ). If T does not belong to CP C(M ) we could safely remove M from CP C(T ). We present this solution in the procedure Ref CP C. In the procedure Ref CP C, we start to search the parent and children set for each variable in CP C(T ) (line 2). If the candidate PC variable violates the symmetry (e.g., T ∈ CP C(X)) it will be removed from CP C(T ) (line 4). If T ∈ CP C(X), we know that X is a true PC of T and CP C(X) may contain T ’s spouse candidates. A spouse is not within CP C(T ), but shares common children with T . We again use G2 tests to detect the dependence between the spouse and T , and identify the true spouse set SP (T ) (lines 7-9). We refine the CP C(T ) by removing the false positives and retrieving the spouses, and finally return the true M B(T ). 3.2 Theoretical Analysis The new algorithm DOS bases the searching scheme on the state-of-the-art algorithm IPC-MB. It embeds three functions (Indep, F req and GenSubSet) for the improvement on both the time and data efficiency. Its correctness stands on the two procedures, namely GenCP C and Ref CP C. The procedure GenCP C removes the Non-PC variable X if X is independent of T conditioned on any subset of ADJ(T ) − {X}. On the removal of false positives, the algorithm resorts to a check on the symmetric relation between T and each of T ’s PC. The additional check ensures a correct PC set of T . Besides removing the false positives, the procedure Ref CP C adds T ’s spouses to complete the M B(T ). Its correctness lies in the inference: the spouse Y is not a candidate of T ’s PC, but dependent of T conditioned on common children. We conclude the correctness of the DOS algorithm below. More technical proof is found in [13]. Theorem 3 (Correctness). The Markov blanket M B(T ) returned by the DOS algorithm is correct and complete given two assumptions: 1) the data D are faithful to a BN; and 2) the independence tests are correct. The primary complexity of the DOS algorithm is due to the procedure GenCP C in Fig. 2. Similar to the performance evaluation of BN learning algorithms, the complexity is measured in the number of conditional independence tests executed [15]. The
Dynamic Ordering-Based Search Algorithm
427
procedure needs to calculate the independence function Indep(X, T |S) for each domain variable given all subsets of ADJ(T ) in the worst case. Hence the number of tests is bounded by O(|U| · 2|ADJ(T )| ). Our strategy of selecting both the candidate variable X and the conditioning set S will quickly reduce the ADJ (T ) by removing Non-PC variables and test only the subsets of P C(T ) in most times. Ideally, we may expect the complexity is in the order of O(|U|·2|P C(T )| ). This is a significant reduction on the complexity since |P C(T )| |ADJ(T )| in most cases.
4 Experimental Results We evaluate the DOS algorithm performance over triple benchmark networks and compare it with the state-of-the-art algorithm IPC-MB. To be best of our knowledge, the IPC-MB is the best algorithm for Markov blanket discovery in the current study. Both algorithms are implemented in Java and the experiments are run on a WindowsXP platform with Pentium(R) Dual-core (2.60 GHz) with 2G memory. We describe the used networks in Table 1. The networks range from 20+ to 50+ variables in the domain and differ in the connectivity measured by both in/out-degree and PC numbers. They provide useful tools in a wide range of practical applications and have been proposed as benchmarks for evaluating both BN and Markov blanket learning algorithms [2]. For each of the networks we randomly sample data from the probability distribution of these networks. We use both the DOS and IPC-MB algorithms to reconstruct Markov blanket of every variable from the data. We compare the algorithms in terms of speed measured by both times and the number of conditional independence (CI) tests executed, and accuracy measured by both precision and recall. P recision is the ratio of true positives in the output (returned by the algorithms) while recall is the ratio of returned true positives in the true M B(T ). In addition, we use a combined measure that is the proximity of precision and recall of the algorithm to perfect precision and recall expressed as the Euclidean distance: Distance = (1 − precision)2 + (1 − recall)2 . The smaller the distance the closer the algorithm output is to the true Markov blanket. For a single experiment on a particular dataset we ran the algorithms using as targets all variables in a network and computed the average values for each measurement. For a particular size of dataset we randomly generated 10 sets and measured the average performance of each algorithm. We set α =0.05. Tables 2 reports the experimental results for datasets of different sizes. Each entry in the tables shows average and standard deviation values over 10 datasets of a particular size. In the table, “Insts.” refers to data Table 1. Bayesian networks used in the experiments Networks |U| Max In/Out-degree Min/Max|P C| Insurance 27 ALARM 37 Hailfinder 56
3/7 4/5 4/16
1/9 1/6 1/17
428
Y. Zeng et al.
Table 2. Both speed and accuracy comparison between the DOS and IPC-MB algorithms Networks Insts. Algs.
300
500 Insurance 1000
2000
500
1000 ALARM 2000
5000
500
1000 Hailfinder 2000
5000
IPC-MB DOS – IPC-MB DOS – IPC-MB DOS – IPC-MB DOS – IPC-MB DOS – IPC-MB DOS – IPC-MB DOS – IPC-MB DOS – IPC-MB DOS – IPC-MB DOS – IPC-MB DOS – IPC-MB DOS –
Speed(Reduction) Times(sec.) # CI tests 82±6 10631±1090 49±4 6003±421 40.24% 43.53% 97±14 15308±2400 55±7 8288±1045 43.30% 46.25% 79±12 11605±902 47±2 6535±327 41.77% 43.61% 143±18 15537±1491 85±9 8988±734 40.56% 42.15% 78±6 11329±678 52±2 7209±235 33.33% 36.37% 115±5 15143±811 71±6 8862±98 38.26% 41.48% 183±17 19538±886 110±6 10812±429 38.89% 44.66% 416±20 24781±897 234±13 12406±238 43.75% 49.94% 63±6 9952±142 51±4 7852±86 19.05% 21.10% 88±6 12046±327 63±3 8363±274 28.41% 30.57% 144±20 15269±486 98±7 10056±217 31.94% 34.14% 255±10 20152±524 148±4 11327±301 41.96% 43.79%
Accuracy(Improvement) Precision Recall Distance 0.75±0.05 0.24±0.04 0.86±0.04 0.87±0.03 0.31±0.03 0.74±0.03 16.00% 29.17% 13.95% 0.84±0.05 0.30±0.04 0.76±0.04 0.90±0.03 0.37±0.03 0.67±0.04 7.14% 23.33% 11.84% 0.93±0.05 0.36±0.04 0.66±0.03 0.95±0.04 0.42±0.03 0.60±0.03 2.15% 16.66% 9.09% 0.97±0.05 0.46±0.03 0.54±0.03 0.98±0.04 0.51±0.03 0.50±0.02 1.03% 10.86% 7.41% 0.76±0.06 0.44±0.03 0.65±0.06 0.80±0.03 0.49±0.02 0.59±0.04 5.26% 11.36% 9.23% 0.78±0.06 0.55±0.04 0.55±0.05 0.83±0.01 0.58±0.02 0.49±0.01 6.41% 9.09% 10.91% 0.89±0.03 0.67±0.01 0.39±0.04 0.91±0.01 0.68±0.01 0.36±0.02 2.25% 1.49% 7.69% 0.98±0.01 0.85±0.02 0.15±0.02 0.99±0.01 0.87±0.02 0.13±0.02 1.02% 2.35% 13.33% 0.85±0.01 0.38±0.03 0.63±0.03 0.88±0.01 0.41±0.02 0.59±0.02 3.53% 7.89% 6.64% 0.91±0.02 0.48±0.03 0.53±0.04 0.94±0.02 0.50±0.03 0.50±0.03 3.30% 4.17% 5.66% 0.94±0.02 0.55±0.02 0.50±0.01 0.95±0.03 0.57±0.02 0.48±0.01 1.05% 3.63% 4.00% 0.98±0.02 0.67±0.01 0.37±0.01 0.98±0.02 0.67±0.01 0.37±0.01 0.00% 0.00% 0.00%
instances and “Algs.” to both algorithms. For the speed comparison purpose, “# CI tests” denotes the total number of conditional independence tests. Reduction shows the percentage by which the DOS algorithm reduces the times and number of CI tests over the IPC-MB algorithm. For the accuracy comparison purpose, “Improvement” refers to the improvement of the DOS algorithm over the IPC-MB algorithm in terms of accuracy measurements like precision, recall and distance.
Dynamic Ordering-Based Search Algorithm
429
In the middle part of Table 2, we show the speed comparison between the DOS and IPC-MB algorithms over four different datasets on three networks. The DOS algorithm executes much faster than the I-PCMB for discovering the Markov blanket. This results from a significant reduction on the required CI tests in the DOS algorithm. As Table 2 shows, the DOS requires average 40% of CI tests less than that done by the IPC-MB. In some case (like ALARM network on 5000 data instances) the reduction is up to 49.94%. The improved time efficiency is mainly due to our ordering strategy that enables the DOS algorithm to quickly spot true negatives and reduce T ’s adajcent set thereby avoiding uncessary CI tests. In the right part of Table 2, we shows the accuracy of both algorithms on discovering the Markov blanket. As expected, both algorithms perform better (smaller distance) with a larger number of data instances. In most cases, the DOS algorithm has better performance than the IPC-MB algorithms. It has around 8% improvement in terms of the distance measurement compared with the IPC-MB algorithm. The improvement is mainly due to more true positives found in the DOS algorithm (shown by more improvement on the recall measurement). More importantly, the DOS demonstrates a larger improvement on the distance over a smaller number of data instances. For the example of Insurance network, the distance improvement is 13.95% over 300 data instances while it is 7.41% over 2000 data instances. This implies more reliable CI tests in the DOS algorithm. The significant reduction of CI tests (shown in Table 2) also indicates improved test reliability for the DOS algorithm. The reliability advantage appears because the DOS algorithm always conditions on the conditioning set of small size by removing as early as possible true negatives.
5 Related Work Margaritis and Thrun [7] proposed the first probably correct Markov blanket discovery algorithm - the Grow-Shrink algorithm. As implied by its name, the GS algorithm contains two phases: a growing phase and a shrinking phase. It attempts to firstly add potential variables into the Markov blanket and then remove false positives in the followed phase. As the GS conducts statistical independence tests conditioned on the superset of Markov blanket and many false positives may be included in the growing phase, it turns out to be inefficient and cannot be scaled to a large application. However, its soundness makes it a proved subject for future research. The IAMB [8] was proposed to improve the GS on the time and data efficiency. It orders the set of variables each time when a new variable is included into the Markov blanket in the growing phase. By doing so, the IAMB is able to add fewer false positives the first phase. However the independence tests are still conditioned on the whole (even large) set of Markov blanket, which does not really improve the data efficiency. Moreover, the computation of conditional information values for sorting the variables in each iteration is rather expensive in the IAMB. Yaramakala and Margaritis [17] proposed a new heuristic function to determine the independence tests and order the variables. However, as reported, there is no fundamental difference from the IAMB. Later, several IAMB’s variants appeared to improve the IAMB’s limit on data efficiency like the Max-Min Parents and Children (MMPC) [10], HITON-PC [11] and
430
Y. Zeng et al.
so on. Unfortunately, both algorithms (MMPC and HITON-PC) were proved incorrect [12], but they do introduce a new approach on identifying the Markov blanket. The algorithms find the Markov blanket by searching T ’s parents and children first, and then discover T ’s spouses. This novel strategy allows independence tests to be conditioned on a subset of T ’s neighboring or adjacent nodes instead of the whole set of Markov blanket. Following the same idea of MMPC and HITON-PC, Pena et. al. [12] proposed the PCMB to conquer the data efficiency problem of the IAMB. More importantly, the PCMB is proved correct in a theoretical way. Recently, Fu and Desmarais [13] proposed the IPC-MB that always conducts statistical independence tests conditioned on the minimum set of T ’s neighbors, which improves the PCMB on both the time and data efficiency. However, both algorithms need to iterate a large number of subsets of T ’s neighboring nodes in most cases and do not update the set of neighboring nodes immediately after a true negative is detected. This allows our improvement as presented in this paper.
6 Conclusion and Future Work We presented a new algorithm for Markov blanket discovery, called Dynamic Orderingbased Search (DOS). The DOS algorithm orders conditional independence tests through a strategic selection of both the candidate variable and the conditioning set. The selection is achieved by exploiting the known independence tests to order the variables. By doing so, the new algorithm can efficiently remove true negatives so that it avoids unnecessary conditional independence tests and the tests condition on a small set in size. We analyzed the correctness of the DOS algorithm as well as its complexity in terms of the number of conditional independence tests. Our empirical results show that the DOS algorithm performs much faster and more reliably than the state-of-the-art algorithm IPC-MB. The reliability advantage is more evident with a small number of data instances. A potential research direction is investigating the utility of our ordering scheme in independence-based algorithms for BN learning.
Acknowledgment The first author acknowledges partial support from National Natural Science Foundation of China (No. 60974089 and No. 60975052). Yanping Xiang thanks the support from National Natural Science Foundation of China (No. 60974089).
References 1. Pearl, J.: Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann Publishers Inc., San Francisco (1988) 2. Tsamardinos, I., Brown, L.E., Aliferis, C.F.: The max-min hill-climbing bayesian network structure learning algorithm. Machine Learning 65(1), 31–78 (2006)
Dynamic Ordering-Based Search Algorithm
431
3. Zeng, Y., Poh, K.L.: Block learning bayesian network structure from data. In: Proceedings of the Fourth International Conference on Hybrid Intelligent Systems (HIS 2004), pp. 14–19 (2004) 4. Zeng, Y., Hernandez, J.C.: A decomposition algorithm for learning bayesian network structures from data. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD 2008. LNCS (LNAI), vol. 5012, pp. 441–453. Springer, Heidelberg (2008) 5. Zeng, Y., Xiang, Y., Hernandez, J.C., Lin, Y.: Learning local components to understand large bayesian networks. In: Proceedings of The Ninth IEEE International Conference on Data Mining (ICDM), pp. 1076–1081 (2009) 6. Koller, D., Sahami, M.: Toward optimal feature selection. In: Proceedings of the Thirteenth International Conference on Machine Learning, pp. 284–292 (1996) 7. Margaritis, D., Thrun, S.: Bayesian network induction via local neighborhoods. Advances in Neural Information Processing Systems 12, 505–511 (1999) 8. Tsamardinos, I., Aliferis, C.F., Statnikov, A.R.: Algorithms for large scale markov blanket discovery. In: Proceedings of the Sixteenth International Florida Artificial Intelligence Research Society Conference, pp. 376–381 (2003) 9. Tsamardinos, I., Aliferis, C.: Towards principled feature selection: Relevancy, filters and wrappers. In: Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics (2003) 10. Tsamardinos, I., Aliferis, C., Statnikov, A.: Time and sample efficient discovery of markov blankets and direct causal relations. In: KDD, pp. 673–678 (2003) 11. Aliferis, C., Tsamardinos, I., Statnikov, A.: Hiton: A novel markov blanket algorithm for optimal variable selection. In: Proceedings of American Medical Informatics Association Annual Symposium (2003) 12. Pena, J.M., Nilsson, R., Bjorkegren, J., Tegner, J.: Towards scalable and data efficient learning of markov boundaries. International Journal of Approximate Reasoning 45(2), 211–232 (2007) 13. Fu, S., Desmarais, M.C.: Fast markov blanket discovery algorithm via local learning within single pass. In: Proceedings of the Twenty-First Canadian Conference on Artificial Intelligence, pp. 96–107 (2008) 14. Cover, T.M., Thomas, J.A.: Elements of Information Theory, 2nd edn. Wiley-Interscience, New York (2006) 15. Spirtes, P., Glymour, C., Scheines, R.: Causation, Prediction, and Search. MIT Press, Cambridge (2000) 16. Loughry, J., van Hemert, J., Schoofs, L.: Efficiently enumerating the subsets of a set. Department of Mathematics and Computer Science, University of Antwerp, RUCA, Belgium, pp. 1–10 (2000) 17. Yaramakala, S., Margaritis, D.: Speculative markov blanket discovery for optimal feature selection. In: Proceedings of the Fifth IEEE International Conference on Data Mining, pp. 809–812 (2005)
Mining Association Rules for Label Ranking Cl´audio Rebelo de S´ a1 , Carlos Soares1,2, Al´ıpio M´ ario Jorge1,3, Paulo Azevedo5 , and Joaquim Costa4 1
LIAAD-INESC Porto L.A., Rua de Ceuta 118-6, 4050-190, Porto, Portugal 2 Faculdade de Economia, Universidade do Porto 3 DCC - Faculdade de Ciencias, Universidade do Porto 4 DM - Faculdade de Ciencias, Universidade do Porto 5 CCTC, Departamento de Inform´ atica, Universidade do Minho
[email protected],
[email protected],
[email protected],
[email protected],
[email protected]
Abstract. Recently, a number of learning algorithms have been adapted for label ranking, including instance-based and tree-based methods. In this paper, we propose an adaptation of association rules for label ranking. The adaptation, which is illustrated in this work with APRIORI Algorithm, essentially consists of using variations of the support and confidence measures based on ranking similarity functions that are suitable for label ranking. We also adapt the method to make a prediction from the possibly conflicting consequents of the rules that apply to an example. Despite having made our adaptation from a very simple variant of association rules for classification, the results clearly show that the method is making valid predictions. Additionally, they show that it competes well with state-of-the-art label ranking algorithms.
1
Introduction
Label ranking is an increasingly popular topic in the machine learning literature [12,7,25]. Label ranking studies the problem of learning a mapping from instances to rankings over a finite number of predefined labels. It can be considered as a variant of the conventional classification problem [7]. In contrast to a classification setting, where the objective is to assign examples to a specific class, in label ranking we are interested in assigning a complete preference order of the labels to every example. There are two main approaches to the problem of label ranking. Decomposition methods decompose the problem into several simpler problems (e.g., multiple binary problems). Direct methods adapt existing algorithms or develop new ones to treat the rankings as target objects without any transformation. An example of the former is the ranking by pairwise comparisons [12]. Examples of algorithms that were adapted to deal with rankings as the target objects include decision trees [24,7], k -Nearest Neighbor [5,7] and the linear utility transformation [13,9]. This second group of algorithms can be divided into two approaches. The first one contains methods (e.g., [7]) that are based on statistical distributions of rankings, J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 432–443, 2011. c Springer-Verlag Berlin Heidelberg 2011
Mining Association Rules for Label Ranking
433
such as Mallows [17]. The other group of methods are based on measures of similarity or correlation between rankings (e.g., [24,2]). In this paper, we propose an adaptation of association rules mining for label ranking based on similarity measures. Association rules mining is a very important and successful task in data mining. Although its original purpose was only descriptive, several adaptations have been proposed for predictive problems. The paper is organized as follows: sections 2 and 3 introduce the label ranking problem and the task of association rule mining, respectively; section 4 describes the measures proposed here; section 5 presents the experimental setup and discusses the results; finally, section 6 concludes this paper.
2
Label Ranking
The formalization of the label ranking problem given here follows the one provided in [7].1 In classification, given an instance x from the instance space X, the goal is to predict the label (or class) λ to which x belongs, from a pre-defined set L = {λ1 , . . . , λk }. In label ranking the goal is to predict the ranking of the labels in L that are associated with x. We assume that the ranking is a total order over L defined on the permutation space Ω. A total order can be seen as a permutation π of the set {1, . . . , k}, such that π(a) is the position of λa in π. Let us also denote π −1 as the result of inverting the order in π. As in classification, we do not assume the existence of a deterministic X → Ω mapping. Instead, every instance is associated with a probability distribution over Ω. This means that, for each x ∈ X, there exists a probability distribution P (·|x) such that, for every π ∈ Ω, P (π|x) is the probability that π is the ranking associated with x. The goal in label ranking is to learn the mapping X → Ω. The training data is a set of instances T = {< xi , πi >}, i = 1, . . . , n, where xi are the independent variables describing instance i and πi is the corresponding target ranking. As an example, given a scenario where we have financial analysts making predictions about the evolution of volatile markets, it would be advantageous to be able to predict which analysts are more profitable in a certain market context [2]. Moreover, if we could have beforehand the full ordered list of the best analysts, this would certainly increase the chances of making good investments. Given the ranking π ˆ predicted by a label ranking model for an instance x, which is, in fact, associated with the true label ranking π, we need to evaluate the accuracy of the prediction. For that, we need a loss function on Ω. One such function is the number of discordant label pairs, D(π, π ˆ ) = #{(i, j)|π(i) > π(j) ∧ π ˆ (i) < π ˆ (j)} which, if normalized into the interval [−1, 1], is equivalent to Kendall’s τ coefficient. The latter is as a correlation measure where D(π, π) = 1 and D(π, π −1 ) = −1. We obtain a loss function by averaging this function over a set of examples. We will use it as evaluation measure in this paper, as it has been used in 1
An alternative formalization can be found in [25].
434
C.R. de S´ a et al.
recent studies [7]. However, other distance measures could have been used, like Spearman’s rank correlation coefficient [22].
3
Association Rules Mining
An association rule (AR) is an implication: A → C where A C = ∅, A, C ⊆ desc (X) where desc (X) is the set of descriptors of instances in X, typically pairs attribute, value. We also denote desc (xi ) as the set of descriptors of instance xi . Association rules are typically characterized by two measures, support and confidence. The support of rule A → C in T is sup if sup% of the cases in it contain A and C. Additionally, it has a confidence conf in T if conf % of cases in T that contain A also contain C. The original method for induction of AR is the APRIORI algorithm that was proposed in 1994 [1]. APRIORI identifies all AR that have a support and confidence higher than a given minimal support threshold (minsup) and a minimal confidence threshold (minconf ), respectively. Thus, the model generated is a set of AR of the form A → C, where A, C ⊆ desc (X), and sup(A → C) ≥ minsup and conf (A → C) ≥ minconf . For a more detailed description see [1]. Despite the usefulness and simplicity of APRIORI, it runs a time consuming candidate generation process and needs space and memory that is proportional to the number of possible combinations in the database. Additionally it needs multiple scans of the database and typically generates a very large number of rules. Because of this, many new pruning methods were proposed in order to avoid that. Such as the hashing [19], dynamic itemset counting [6], parallel and distributed mining [20], relational database systems integrated with mining [23]. Association rules were originally proposed for descriptive purposes. However, they have been adapted for predictive tasks such as classification (e.g., [18]). Given that label ranking is a predictive task, we describe some useful notation from an adaptation of AR for classification in Section 3.2. 3.1
Pruning
AR algorithms typically generate a large number of rules (possibly tens of thousands), some of which represent only small variations from others. This is known as the rule explosion problem [4]. It is due to the fact that the algorithm might find rules for which the confidence can be marginally improved by adding further conditions to the antecedent. Pruning methods are usually employed to reduce the amount of rules, without reducing the quality of the model. A common pruning method is based on the improvement that a refined rule yields in comparison to the original one [4]. The improvement of a rule is defined as the smallest difference between the confidence of a rule and the confidence of all sub-rules sharing the same consequent. More formally, for a rule A → C imp(A → C) = min (∀A ⊂ A, conf (A → C) − conf (A → C))
Mining Association Rules for Label Ranking
435
As an example, if one defines minImp = 0.1%, the rule A1 → C will be kept, if, and only if conf (A1 → C) − conf (A → C) ≥ 0.001, where A ⊂ A1 . 3.2
Class Association Rules
Classification Association Rules (CAR), were proposed as part of the Classification Based on AR (CBA) algorithm [18]. A class association rule (CAR) is an implication of the form: A → λ where A ⊆ desc (X), and λ ∈ L, which is the class label. A rule A → λ holds in T with confidence conf if conf % of cases in T that contain A are labeled with class λ, and with support sup in T if sup% of the cases in it contain A and are labeled with class λ. CBA takes a tabular data set T = {xi , λi }, where xi is a set of items and λi the corresponding class, and look for all frequent ruleitems of the form A, λ, where A is a set of items and λ ∈ L. The algorithm aims to choose a set of high accuracy rules Rλ to match T . Rλ matches an instance < xi , λi >∈ T if there is at least one rule A → λ ∈ Rλ , with A ⊆ desc(xi ), xi ∈ X, and λ ∈ L. If the rules cannot classify all examples, a default class is given to them (e.g., the majority class in the training data).
4
Association Rules for Label Ranking
We define a Label Ranking Association Rule (LRAR) as a straightforward adaptation of class association rules (CAR): A→π where A ⊆ desc (X) and π ∈ Ω. The only difference is that the label λ ∈ L is replaced by the ranking of the labels, π ∈ Ω. Similar to what the prediction made in CBA, when an example matches the rule A → π, the predicted ranking is π. In this regard, we can use the same basic principle of the ruleitem for CARs in LRARs, which is A, π where A is a set of items and π ∈ Ω. This approach has two important problems. First, the number of classes can be extremely large, up to a maximum of k!, where k is the size of the set of labels, L. This means that the amount of data required to learn a reasonable mapping X → Ω is too big. The second disadvantage is that this approach does not take into account the differences in nature between label rankings and classes. In classification, two examples either have the same class or not. In this regard, label ranking is more similar to regression than to classification. This property can be used in the induction of prediction models. In regression, a large number of observations with a given target value, say 5.3, increases the probability of observing similar values, say 5.4 or 5.2, but not so much for very different values, say -3.1 or 100.2. A similar reasoning can be made in label ranking. Let us consider the case of a data set in which ranking πa = {A, B, C, D, E} occurs in 1% of the examples. Treating rankings as classes would mean that P (πa ) = 0.01. Let us further consider that the rankings πb = {A, B, C, E, D}, πc = {B, A, C, D, E}
436
C.R. de S´ a et al.
and πd = {A, C, B, D, E} occur in 50% of the examples. Taking into account the stochastic nature of these rankings [7], P (πa ) = 0.01 seems to underestimate the probability of observing πa . In other words it is expected that the observation of πb , πc and πd increases the probability of observing πa and vice-versa, because they are similar to each other. This affects even rankings which are not observed in the available data. For example, even though πe = {A, B, D, C, E} is not present in the data set it would not be entirely unexpected to see it in future data. 4.1
Similarity-Based Support and Confidence
To take this characteristic into account, we can argue that the support of a ranking π increases with the observation of similar rankings and that the variation is proportional to the similarity. Given a measure of similarity between rankings s(πa , πb ), we can adapt the concept of support of the rule A → π as follows: suplr (A → π) =
s(πi , π)
i:A⊆desc(xi )
n
Essentially, what we are doing is assigning a weight to each target ranking in the training, πi , data that represents its contribution to the probability that π may be observed. Some instances xi ∈ X give full contribution to the support count (i.e., 1), while others may give partial or even a null contribution. Any function that measures the similarity between two rankings or permutations can be used, such as Kendall’s τ [16] or Spearman’s ρ [22]. The function used here is of the form: s (πa , πb ) if s (πa , πb ) ≥ θsup s(πa , πb ) = (1) 0 otherwise where s is a similarity function. This general form assumes that below a given threshold, θsup , is not useful to discriminate between different similarity values, as they are so different from πa . This means that, the support sup of A, πa will have contributions from all the ruleitems of the form A, πb , for all πb where s (πa , πb ) > θsup ). Again, many functions can be used as s . The confidence of a rule A → π is obtained simply by replacing the measure of support with the new one. conflr (A → π) =
suplr (A → π) sup (A → π)
Given that the loss function that we aim to minimize is known beforehand, it makes sense to use it to measure the similarity between rankings. Therefore, we use Kendall’s τ . In this case, we think that θsup = 0 would be a reasonable value, given that it separates the negative from the positive contributions. Table 1 shows an example of a label ranking dataset represented following this approach.
Mining Association Rules for Label Ranking
437
Table 1. An example of a label ranking dataset to be processed by the APRIORI-LR algorithm π1 π2 π3 TID A1 A2 A3 (1, 3, 2) (2, 1, 3) (2, 3, 1) 1 L XL S 0.33 0.00 1.00 2 XXL XS S 0.00 1.00 0.00 3 L XL XS 1.00 0.00 0.33
Algorithm 1. APRIORI-LR - APRIORI for Label Ranking Require: minsup and minconf Ck : Candidate ruleitems of size k Fk : Frequent ruleitems of size k T = {< xi , πi >}: Transactions in the database F1 = {A, π : #A = 1 AND suplr (A, π) ≥ minsup} k=1 while Fk = ∅ do Ck+1 = {cand = A1 ∩ A2, π : A1 , π , A2 , π ∈ Fk , #(A1 ∩ A2 ) = k + 1} Fk+1 = {c : c ∈ Ck+1 ∧ suplr (c) ≥ minsup} k =k+1 end while F = ∪ki=1 Fi Rπ = {A → π : A, π ∈ F ∧ conflr (A → π) ≥ minconf } return Rπ
To present a more clear interpretation, the example given in table 1, the instance ({A1 = L, A2 = XL, A3 = S}) (TID=1) contributes to the support count of the ruleitem {A1 = L, A2 = XL, A3 = S}, π3 with 1. The same instance, will also give a small contribution of 0.33 to the support count of the ruleitem {A1 = L, A2 = XL, A3 = S}, π1 , given their similarity. On the other hand, no contribution to the count of the ruleitem’s {A1 = L, A2 = XL, A3 = S}, π2 support is given, which are clearly different. 4.2
APRIORI-LR Algorithm
Using the definitions of support and confidence proposed, adaptation of any AR learning algorithm for label ranking is simple. However, for illustration purposes, we will present an adaptation of the APRIORI algorithm, called APRIORI-LR. Given a training set T = {< xi , πi >}, i = 1, . . . , n, frequent ruleitems are generated with Algorithm 1 and transformed in LRARs. Let Rπ be the set of all the generated label ranking association rules. The algorithm aims to create a set of high accuracy rules rπ ∈ Rπ to cover T . The classifier has the following format: < rπ1 , rπ2 , . . . , rπn >
438
C.R. de S´ a et al.
However, if these are insufficient to rank the given examples, a def ault ranking is used. The default ranking can be the average ranking [5], which is often used for this purpose. This approach has two problems. The first is that it can only predict rankings which were present in the training set (except when no rules apply and the predicted ranking is the default ranking). The second problem is that it solves conflicts between rankings without taking into account the “continuous” nature of rankings, which was illustrated earlier. The problem of generating a single permutation from a set of conflicting rankings has been studied in the context of consensus rankings. It has been shown in [15] that a ranking obtained by ordering the average ranks of the labels across all rankings minimizes the euclidean distance to all those rankings. In other words, it maximizes the similarity according to Spearman’s ρ [22]. Given m rankings πi (i = 1, . . . , m) we aggregate them by computing for each item j (j = 1, . . . , k): m πi,j rj = i=1 m The predicted ranking π ˆ is obtained by ranking the items according to the value of rj . We can take advantage of this in the ranker builder in the following way: the final predicted label ranking is the consensus of all the label rankings in the consequent of the rules rπ triggered by the test example. To implement pruning based on improvement for LR, some adaptation is required as well. Given that the relation between target values is different from classification, as discussed in Section 4.1, we have to limit the comparison between rules with different consequents, if the similarity function S (π, π ) ≥ θimp . 4.3
Parameter Tuning
Due to the intrinsic nature of each different dataset, or even of the pre-processing methods used to prepare the data (e.g., the discretization method), the maximum minsup/minconf needed to obtain a rule set Rπ that matches all or at least most of the examples, may vary significantly. We used a greedy method to define the minimum confidence. As stated earlier, a rule set Rπ matches an example if at least one rule (A → λ) ∈ Rπ , with A ⊆ desc(xi ), xi ∈ X. Then, our goal is to obtain a rule set Rπ that maximizes the number of examples that are matched, here defined as M . Additionally, we want the best rules, the rules with the highest confidence values. The parameter tuning method (Algorithm 2) determines the minconf that obtains the rule set according to those criteria. To set the step value we consider that, on one hand, a suitable minconf must be found as soon as possible. On the other hand, this very same value should be as high as possible. Therefore, 5% seems a reasonable step value. The ideal value for the minsup, is as close to 1% as possible. However, in some datasets, namely those with a larger number of attributes, frequent ruleitem
Mining Association Rules for Label Ranking
439
Table 2. Summary of the datasets Datasets type #examples #labels #attributes autorship A 841 4 70 bodyfat B 252 7 7 calhousing B 20640 4 4 cpu-small B 8192 5 6 elevators B 16599 9 9 fried B 40769 5 9 glass A 214 6 9 housing B 506 6 6 iris A 150 3 4 pendigits A 10992 10 16 segment A 2310 7 18 stock B 950 5 5 vehicle A 846 4 18 vowel A 528 11 10 wine A 178 3 13 wisconsin B 194 16 16
Algorithm 2. Parameter tuning Algorithm minconf = 100% minsup = 1 while M < 100% do minconf = minconf − 5% Run Algorithm 1 with (minsup,minconf ) and determine M end while return minconf
generation can be a very time consuming task. In this case, minsup must be set to a value larger than 1%. In this work, one such example is authorship, which has 70 attributes. This procedure has the important advantage that it does not take into account the accuracy of the rule sets generated, thus reducing the risk of over-fitting.
5
Experimental Results
The data sets in this work were taken from KEBI Data Repository in the Philipps University of Marburg [7] (Table 2). Continuous variables were discretized with two distinct methods: (1) recursive minimum entropy partitioning criterion ([11]) with the minimum description length (MDL) as stopping rule, motivated by [10] and (2) equal width bins. The evaluation measure is Kendall’s τ and the performance of the method was estimated using ten-fold cross-validation. The performance of APRIORI-LR is compared with a baseline method, the default ranking (explained earlier) and RPC [14]. For the generation of frequent ruleitems we used CAREN [3]. The base learner used in RPC is the Logistic Regression Algorithm, with the default configurations of the function Logit from the Stats package of R Programing Language [21]. Additionally, we compare the performance of our algorithm with the results obtained with constraint classification (CC), instance-based label ranking (IBLR) and ranking trees (LRT), that were presented in [7]. We note that we did not run experiments with these methods and simply compared our results with
440
C.R. de S´ a et al.
the published results of the other methods. Thus, they were probably obtained with different partitions of the data and can not be compared directly. However, they provide some indication of the quality of our method, when compared to the state-of-the-art. The value θimp was set to 0 in all experiments. This option may not be as intuitive as it is in θsup . However, since the focus of this work is the reduction of the number of generated rules, this value is suitable. 5.1
Results
Table 3 shows that the method obtains results with both discretization methods that are clearly better than the ones obtained by the baseline method. This means that the APRIORI-LR is identifying valid patterns that can predict label rankings. Table 4 presents the results obtained with pruned rules using the same minsup and minconf values as in the previous experiments and compares it to RPC using as a base learner Logistic Regression. Rd represents the percentage of the number of rules reduced by pruning. The results presented clearly show that the minImp constraint, set to 0.00 and 0.01, succeeded to reduce the number of rules. However, there was no improvement in accuracy, although it also did not decrease. Further tests are required to understand how this parameter affects the accuracy of the models. Finally, table 5 compares APRIORI-LR with state of the art methods based on published results [7]. Given that the methods were not compared under the same conditions, this simply gives us a rough idea of the quality of the method proposed here. It indicates that, despite the simplicity of the adaptation, APRIORI-LR is a competitive method. We expect that the results can be significantly improved, for instance, by implementing more complex pruning methods. Table 3. Results obtained with minimum entropy discretization and with equal width discretization with 3 bins for each attribute
authorship bodyfat calhousing cpu-small elevators fried glass housing iris pendigits segment stock vehicle vowel wine wisconsin
Minimum entropy Equal width (3 bins) τ τbaseline minsup minconf #rules M τ τbaseline minsup minconf #rules .608 .568 20 60 3717 100% NA .059 -.064 1 15 3289 98% .161 -.064 1 25 16222 .291 .048 1 35 221 97% .139 .048 1 20 889 .439 .234 1 35 2774 100% .279 .234 1 30 1559 .643 .289 1 60 1864 98% .623 .289 1 60 18160 .774 -.005 1 35 1959 97% .676 -.005 1 35 14493 .871 .684 1 85 485 99% .794 .684 1 75 11385 .758 .058 1 60 2547 96% .577 .058 1 45 5027 .960 .089 1 90 115 100% .883 .089 1 80 69 NA - .684 .451 10 75 18590 .829 .372 4 85 4949 96% .496 .372 35 75 4688 .890 .070 1 75 1606 100% .836 .070 1 65 1168 .774 .179 7 80 10480 99% .675 .179 15 80 6662 .680 .195 1 70 21419 99% .709 .195 1 70 143882 .844 .329 15 95 5960 100% .910 .329 1 95 165263 .031 -.031 1 0 1224 92% .280 -.031 5 20 404773
M 100% 100% 100% 100% 100% 100% 100% 100% 90% 49% 100% 83% 100% 100% 100%
Mining Association Rules for Label Ranking
441
Table 4. Comparison of APRIORI-LR with RPC
authorship bodyfat calhousing cpu-small elevators fried glass housing iris pendigits segment stock vehicle vowel wine wisconsin
Minimum entropy APRIORI-LR RPC τ mImp=0 Rd(%) mImp=1 Log. R. 0.608 0.634 -40 0.637 0.900 0.059 0.057 -20 0.058 0.264 0.291 0.299 -54 0.300 0.227 0.439 0.421 -91 0.418 0.446 0.643 0.647 -93 0.651 0.650 0.774 0.731 -71 0.730 0.827 0.871 0.834 -83 0.833 0.898 0.758 0.753 -84 0.753 0.648 0.960 0.961 -75 0.961 0.862 NA NA NA NA NA 0.829 0.828 -89 0.828 0.935 0.890 0.875 -75 0.874 0.795 0.774 0.781 -91 0.775 0.841 0.680 0.686 -91 0.685 0.670 0.844 0.871 -96 0.884 0.925 0.031 0.030 -8 0.031 0.612
Equal width (3 bins) APRIORI-LR RPC τ mImp=0 Rd(%) mImp=1 Log. R. NA NA NA 0.905 0.161 0.156 -98 0.156 0.175 0.139 0.112 -83 0.110 0.132 0.279 0.271 -97 0.271 0.286 0.623 0.620 -98 0.621 0.621 0.676 0.674 -93 0.676 0.671 0.794 0.767 -98 0.776 0.846 0.577 0.559 -96 0.562 0.552 0.883 0.876 -63 0.881 0.756 0.684 0.682 -96 0.685 0.879 0.496 0.496 -100 0.500 0.878 0.836 0.822 -88 0.822 0.675 0.675 0.675 -97 0.674 0.820 0.709 0.718 -97 0.721 0.571 0.910 0.877 -99 0.875 0.892 0.280 0.286 -99 0.293 0.478
Table 5. Comparison of APRIORI-LR with state-of-the-art methods
authorship bodyfat calhousing cpu-small elevators fried glass housing iris pendigits segment stock vehicle vowel wine wisconsin
6
APRIORI-LR EW ME NA 0.608 0.161 0.059 0.139 0.291 0.279 0.439 0.623 0.643 0.676 0.774 0.794 0.871 0.577 0.758 0.883 0.960 0.684 NA 0.496 0.829 0.836 0.890 0.675 0.774 0.709 0.680 0.910 0.844 0.280 0.031
CC 0.920 0.281 0.250 0.475 0.768 0.999 0.846 0.660 0.836 0.903 0.914 0.737 0.855 0.623 0.933 0.629
IBLR 0.936 0.248 0.351 0.506 0.733 0.935 0.865 0.745 0.966 0.944 0.959 0.927 0.862 0.900 0.949 0.506
LRT 0.882 0.117 0.324 0.447 0.760 0.890 0.883 0.797 0.947 0.935 0.949 0.895 0.827 0.794 0.882 0.343
Conclusions
In his paper we present a simple adaptation of an association rules algorithm for label ranking. This adaptation essentially consists of 1) ensuring that rules have label rankings in their consequent, 2) using variations of the support and confidence measures that are suitable for label ranking and 3) generating the model with parameters selected by a simple greedy algorithm. These results clearly show that this is a viable label ranking method. It outperforms a simple baseline and competes well with RPC, which means that, despite its simplicity, it is inducing useful patterns. Additionally, the results obtained indicate that the choice of the discretization method and the number of bins per attribute play an important role in the accuracy of the models. The tests indicate that the supervised discretization method (minimum entropy), gives better results than equal width partitioning. This is, however, not the main focus of this work. Improvement-based pruning was successfully implemented and reduced the number of rules in a substantial number. This plays an important role in generating models with higher interpretability.
442
C.R. de S´ a et al.
The new framework proposed in this work, based on distance functions, is consistent with the classical concepts underlying association rules. Furthermore, although it was developed in the context of the label ranking task, it can also be adapted for other tasks such as regression and classification. In fact, Classification Association Rules can be regarded as a special case of distance-based AR, where the distance function is 0-1 loss. This work uncovered several possibilities that could be better studied in order to improve the algorithm’s performance. They include: improving the prediction generation method; implementing better pruning methods; developing a discretization method that is suitable for label ranking; and the choice of parameters. For evaluation, we have used a measure that is typically used in label ranking. However, it is important to give more importance to higher ranks than to lower ones which can be done, for instance, with the weighted rank correlation coefficient [8]. Additionally, it is essential to test the methods on real label ranking problems. The KEBI datasets are adapted from UCI classification problems. We plan to test our methods on other problems including algorithm selection and predicting the rankings of financial analysts [2]. In terms of real world applications, these can be adapted to rank analysts, based on their past performance and also radios, based on user’s preferences.
Acknowledgments This work was partially supported by project Rank! (PTDC/EIA/81178/2006) from FCT and Palco AdI project Palco3.0 financed by QREN and Fundo Europeu de Desenvolvimento Regional (FEDER). We thank the anonymous referees for useful comments.
References 1. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: VLDB, pp. 487–499 (1994) 2. Aiguzhinov, A., Soares, C., Serra, A.P.: A similarity-based adaptation of naive bayes for label ranking: Application to the metalearning problem of algorithm recommendation. In: Pfahringer, B., Holmes, G., Hoffmann, A. (eds.) DS 2010. LNCS, vol. 6332, pp. 16–26. Springer, Heidelberg (2010) 3. Azevedo, P.J., Jorge, A.M.: Ensembles of jittered association rule classifiers. Data Min. Knowl. Discov. 21(1), 91–129 (2010) 4. Bayardo, R., Agrawal, R., Gunopulos, D.: Constraint-based rule mining in large, dense databases. Data Mining and Knowledge Discovery 4(2), 217–240 (2000) 5. Brazdil, P., Soares, C., Costa, J.: Ranking Learning Algorithms: Using IBL and Meta-Learning on Accuracy and Time Results. Machine Learning 50(3), 251–277 (2003)
Mining Association Rules for Label Ranking
443
6. Brin, S., Motwani, R., Ullman, J.D., Tsur, S.: Dynamic itemset counting and implication rules for market basket data. In: Proceedings of the 1997 ACM SIGMOD international conference on Management of data - SIGMOD 1997, pp. 255–264 (1997) 7. Cheng, W., H¨ uhn, J., H¨ ullermeier, E.: Decision tree and instance-based learning for label ranking. In: ICML 2009: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 161–168. ACM, New York (2009) 8. Pinto da Costa, J., Soares, C.: A weighted rank measure of correlation. Australian & New Zealand Journal of Statistics 47(4), 515–529 (2005) 9. Dekel, O., Manning, C.D., Singer, Y.: Log-linear models for label ranking. Advances in Neural Information Processing Systems (2003) 10. Dougherty, J., Kohavi, R., Sahami, M.: Supervised and unsupervised discretization of continuous features. In: Machine Learning - International Workshop Then Conference, pp. 194–202 (1995) 11. Fayyad, U.M., Irani, K.B.: Multi-interval discretization of continuous-valued attributes for classification learning. In: IJCAI, pp. 1022–1029 (1993) 12. F¨ urnkranz, J., H¨ ullermeier, E.: Preference learning. KI 19(1), 60 (2005) 13. Har-Peled, S., Roth, D., Zimak, D.: Constraint classification: A new approach to multiclass classification. In: Cesa-Bianchi, N., Numao, M., Reischuk, R. (eds.) ALT 2002. LNCS (LNAI), vol. 2533, pp. 365–379. Springer, Heidelberg (2002) 14. H¨ ullermeier, E., F¨ urnkranz, J., Cheng, W., Brinker, K.: Label ranking by learning pairwise preferences. Artif. Intell. 172(16-17), 1897–1916 (2008) 15. Kemeny, J., Snell, J.: Mathematical Models in the Social Sciences. MIT Press, Cambridge (1972) 16. Kendall, M., Gibbons, J.: Rank correlation methods. Griffin, London (1970) 17. Lebanon, G., Lafferty, J.D.: Conditional Models on the Ranking Poset. In: NIPS, pp. 415–422 (2002) 18. Liu, B., Hsu, W., Ma, Y.: Integrating classification and association rule mining. In: Knowledge Discovery and Data Mining, pp. 80–86 (1998) 19. Park, J.S., Chen, M.S., Yu, P.S.: An effective hash-based algorithm for mining association rules. ACM SIGMOD Record 24(2), 175–186 (1995) 20. Park, J.S., Chen, M.S., Yu, P.S.: Efficient parallel and data mining for association rules. In: CIKM, pp. 31–36 (1995) 21. R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2010), http://www.R-project.org ISBN 3-900051-07-0 22. Spearman, C.: The proof and measurement of association between two things. American Journal of Psychology 15, 72–101 (1904) 23. Thomas, S., Sarawagi, S.: Mining generalized association rules and sequential patterns using sql queries. In: KDD, pp. 344–348 (1998) 24. Todorovski, L., Blockeel, H., Dˇzeroski, S.: Ranking with Predictive Clustering Trees. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) ECML 2002. LNCS (LNAI), vol. 2430, pp. 444–455. Springer, Heidelberg (2002) 25. Vembu, S., G¨ artner, T.: Label Ranking Algorithms: A Survey. In: F¨ urnkranz, J., H¨ ullermeier, E. (eds.) Preference Learning. Springer, Heidelberg (2010)
Tracing Evolving Clusters by Subspace and Value Similarity Stephan G¨ unnemann1 , Hardy Kremer1 , Charlotte Laufk¨ otter2 , and Thomas Seidl1 1
2
Data Management and Data Exploration Group RWTH Aachen University, Germany {guennemann,kremer,seidl}@cs.rwth-aachen.de Institute of Biogeochemistry and Pollutant Dynamics ETH Z¨ urich, Switzerland
[email protected]
Abstract. Cluster tracing algorithms are used to mine temporal evolutions of clusters. Generally, clusters represent groups of objects with similar values. In a temporal context like tracing, similar values correspond to similar behavior in one snapshot in time. Each cluster can be interpreted as a behavior type and cluster tracing corresponds to tracking similar behaviors over time. Existing tracing approaches are designed for datasets satisfying two specific conditions: The clusters appear in all attributes, i.e. fullspace clusters, and the data objects have unique identifiers. These identifiers are used for tracking clusters by measuring the number of objects two clusters have in common, i.e. clusters are traced based on similar object sets. These conditions, however, are strict: First, in complex data, clusters are often hidden in individual subsets of the dimensions. Second, mapping clusters based on similar objects sets does not reflect the idea of tracing similar behavior types over time, because similar behavior can even be represented by clusters having no objects in common. A tracing method based on similar object values is needed. In this paper, we introduce a novel approach that traces subspace clusters based on object value similarity. Neither subspace tracing nor tracing by object value similarity has been done before.
1
Introduction
Temporal properties of patterns and their analysis are under active research [5]. A well known type of pattern are clusters, corresponding to similarity-based groupings of data objects. A good example for clusters are customer groups. Clusters can change in the course of time and understanding this evolution can be used to guide future decisions [5], e.g. predicting whether a specific customer behavior will occur. The evolution can be mined by cluster tracing algorithms that find mappings between clusters of consecutive time steps [8,13,14]. The existing algorithms have a severe limitation: Clusters are mapped if the corresponding object sets are similar, i.e. the algorithms check whether the possibly matching clusters have a certain fraction of objects in common; they are J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 444–456, 2011. c Springer-Verlag Berlin Heidelberg 2011
Tracing Evolving Clusters by Subspace and Value Similarity
445
unable to map clusters with different objects, even if the objects have similar attribute values. Our novel method, however, maps clusters only if their corresponding object values are similar, independently of object identities. That is, we trace similar behavior types, which is a fundamentally different concept. This is a relevant scenario, as the following two examples illustrate. Consider scientific data of the earth’s surface with the attributes temperature and smoke degree. The latter correlates with forest fire probability. The attribute values are recorded over several months. In this dataset, at some point in time a high smoke degree and high temperatures occur in the northern hemisphere; sixth months later the same phenomenon occurs in the southern hemisphere, as the seasons on the hemispheres are shifted half-yearly to each other. Another example is the customer behavior of people in different countries. Often it is similar, but shifted in time. For example, the customer behavior in Europe is similar to the behavior in North America, but only some months later. Obviously, a cluster tracing algorithm should detect these phenomena; however, existing methods do not, since the observed populations, i.e. the environment and the people respectively, stay at the same place, and thus there are no shared objects between clusters — only the behavior migrates. With today’s complex data, patterns are often hidden in different subsets of the dimensions; for detecting these clusters with locally relevant dimensions, subspace clustering was introduced. However, despite that many temporal data sets are of this kind, e.g. gridded scientific data, subspace clustering has never been used in a cluster tracing scenario. The existing cluster tracing methods can only cope with fullspace clusters, and thus cannot exploit the information mined by subspace clustering algorithms. Our novel tracing method measures the subspace similarity of clusters and thus handles subspace clusters by design. Summarized, we introduce a method for tracing behavior types in temporal data; the types are represented by clusters. The decision, which clusters of consecutive time steps are mapped is based on a novel distance function that tackles the challenges of object value similarity and subspace similarity. Our approach can handle the following developments: emerging or disappearing behavior as well as distinct behaviors that converge into uniform behavior and uniform behavior that diverges into distinct behaviors. By using subspaces, we enable the following evolutions: Behavior can gain or lose characteristics; i.e., the representing subspace clusters can gain or lose dimensions over time, and clusters that have different relevant dimensions can be similar. Varying behavior can be detected; that is, to some extent the values of the representing clusters can change. Clusterings of three time steps are illustrated in Fig. 1. The upper part shows the objects; the lower part abstracts from the objects and illustrates possible clusterings of the datasets and tracings between the corresponding clusters. Note that the three time steps do not share objects, i.e. each time step corresponds to a different database from the same attribute domain {d1 , d2 }; to illustrate the different objects, we used varying object symbols. An example for behavior that gains characteristics is the mapping of Cluster C1,1 to C2,1 , i.e. the cluster gains
446
S. G¨ unnemann et al.
d2
d2
d2
d1
&
d1
d1
& &
&
t
t+1
w1
C1,2
w2
t
& &
C1,1
&
w3
C3,1
w4
C3,2 w6
C2,2 w 5 t+1
C3,3 w7
C2,1
t+2
C4,1 C4,2 t+3
t+2
Fig. 1. Top: databases of consecutive time steps; bottom: possible clusterings and exemplary cluster tracings
Fig. 2. Example of a mapping graph, illustrating four time steps
one dimension. Varying behavior is illustrated by the mapping from C1,2 to C2,2 ; the values of the cluster have changed. If the databases were spatial, this could be interpreted as a movement. A behavior divergence can be seen from time step t + 1 to t + 2: the single cluster C2,1 is mapped to the two clusters C3,1 and C3,2 .
2
Related Work
Several temporal aspects of data are regarded in the literature [5]. In stream clustering scenarios, clusters are adapted to reflect changes in the observed data, i.e. the distribution of incoming objects changes [2]. A special case of stream clustering is for moving objects [10], focusing on spatial attributes. Stream clustering in general, however, gives no information about the actual cluster evolution over time [5]. For this, cluster tracing algorithms were introduced [8,13,14]; they rely on mapping clusters of consecutive time steps. These tracing methods map clusters if the corresponding object sets are similar, i.e. they are based on shared objects. We, in contrast, map clusters only if their corresponding object values are similar, independently of shared objects. That is, we trace similar types of behavior, which is a fundamentally different concept. Clustering of trajectories [7,15] can be seen as an even more limited variant of cluster tracing with similar object sets, as trajectory clusters have constant object sets that do not change over time. The work in [1] analyzes multidimensional temporal data based on dense regions that can be interpreted as clusters. The approach is designed to detect substantial changes of dense regions; however, tracing of evolving clusters that slightly change their position or subspace is not possible, especially when several time steps are observed. A further limitation of existing cluster tracing algorithms is that they can only cope with fullspace clusters. Fullspace clustering models use all dimensions in the data space [6]. For finding clusters hidden in individual dimensions, subspace clustering was introduced [4]. An overview of different subspace clustering
Tracing Evolving Clusters by Subspace and Value Similarity
447
approaches can be found in [9], and the differences between subspace clustering approaches are evaluated in [11]. Until now, subspace clusters were only applied in streaming scenarios [3], but never in cluster tracing scenarios; deciding whether subspace clusters of varying dimensionalities are similar is a challenging issue. Our algorithm is designed for this purpose.
3
A Novel Tracing Model
Our main objective is to trace behavior types and their developments over time. First, some basic notations: For each time step t ∈ {1, . . . , T } of our temporal data we have a D-dim. database DBt ⊆ RD . We assume the data to be normalized between [0, 1]. A subspace cluster Ct,i = (Ot,i , St,i ) at time step t is a set of objects Ot,i ⊆ DBt along with a set of relevant dimensions St,i ⊆ {1, . . . , D}. The objects are similar within these relevant dimensions. The set of all subspace clusters {Ct,1 , . . . , Ct,k } at time step t is denoted subspace clustering Clust , and each included subspace cluster represents a behavior type (e.g. a person group). 3.1
Tracing of Behavior Types
In this section, we determine whether a typical behavior in time step t continues in t + 1. Customer behavior, for example, can be imitated by another population in the next time step. Other kinds of temporal developments are the disappearance of a behavior or a split-up into different behaviors. We have to identify these temporal developments for effective behavior tracing. Formally, we need a mapping function that maps each cluster at a given time step to a set of its successors in the next time step; we denote these successors as temporal continuations. Two clusters Ct,i and Ct+1,j are mapped if they are identified as similar behaviors. We use a cluster distance function, introduced in Sec. 3.2, to measure these similarities. If the distance is small enough, the mapping is performed. Definition 1. Mapping function. Given a distance function dist for two clusters, the mapping function Mt : Clust → P(Clust+1 ) that maps a cluster to its temporal continuations is defined by Mt (Ct,i ) = {Ct+1,j | dist(Ct,i , Ct+1,j ) < τ } A cluster can be mapped to zero, one, or several clusters (1:n), and several mappings to the same cluster are possible (m:1), enabling detection of disappearance or convergence of behaviors. We describe pairs of mapped clusters by a binary relation: Rt = {(Ct,i , Ct+1,j ) | Ct+1,j ∈ Mt(Ct,i )} ⊆ Clust × Clust+1 . Each tuple corresponds to one cluster mapping, i.e. for a behavior type in t we have a similar one in the next time step t + 1. These mappings and the clusters can be represented by a mapping graph. Reconsider that it is possible to map a behavior to several behaviors in the next time step (cf. Fig. 1, t+1 → t+2). These mappings, however, are not equally important. We represent this by edge weights within the mapping graph; the weights indicate the strength of the temporal continuation. We measure similarity based on distances, and thus small weights denote a strong continuation and high weights reflect a weaker one. Formally,
448
S. G¨ unnemann et al.
Definition 2. Mapping graph. A mapping graph G = (V, E, w) is a directed and weighted graph with the following properties: T – Nodes represent clusters, i.e. V = i=1 Clust T −1 – Edges represent cluster mappings, i.e. E = i=1 Rt – Edge weights indicate the strength of the temporal continuations, i.e. ∀(Ci , Cj ) ∈ E : w(Ci , Cj ) = dist(Ci , Cj ) Figure 2 illustrates an exemplary mapping graph with edge weights. A mapping graph allows us to categorize temporal developments: Definition 3. Kinds of temporal developments. Given a mapping graph G = (V, E, w), the behaviors represented by clusters C ∈ V can be categorized: – – – –
a behavior disappears, if outdegree(C) = 0 a behavior emerges, if indegree(C) = 0 a behavior diverges, if outdegree(C) > 1 different behaviors converge into a single behavior, if indegree(C) > 1
These categories show whether a behavior appears in similar ways in the subsequent time step. Since the characteristics of a behavior can naturally change over time, we also trace single behaviors over several time steps, denoted as an evolving cluster and described by a path in the mapping graph. Evolving clusters are correctly identified if specific evolution criteria are accounted in our distance function. These are presented in the following section. 3.2
Cluster Distance Measure
Our objective is to identify similar behaviors. Technically, a distance measure is needed that determines the similarity of two given clusters. Keep in mind that measuring the similarity simply based on the fraction of shared objects cannot satisfy our objective, since even totally different populations can show a similar behavior in consecutive time steps. We have to distinguish two kinds of evolution: A cluster can gain or lose characteristics, i.e. the relevant dimensions of a subspace cluster can evolve, and within the relevant dimensions the values can change. Our distance function has to reflect both aspects for effective similarity measurement of evolving clusters. Similarity based on subspaces. Each cluster represents a behavior type, and because we are considering subspace clusters, the characteristics of a behavior are restricted to a subset of the dimensions. If a behavior remains stable over time, its subspace remains also unchanged. The relevant dimensions of the underlying clusters are identical. Consider the clusters Ct,i = (Ot,i , St,i ) and Ct+1,j = (Ot+1,j , St+1,j ) of time steps t and t + 1: the represented behaviors are very similar if the dimensions St,i are also included in St+1,j . However, a behavior can lose some of its characteristics. In Fig. 1, for example, the dimension d1 is no longer relevant in time step t+2 for the behavior depicted on the bottom. Accordingly, a distance measure is reasonable if behavior types
Tracing Evolving Clusters by Subspace and Value Similarity
449
are considered to be similar even if they lose some relevant dimensions. That is, |S ∩S | the smaller the term 1 − t,i|St,it+1,j , the more similar are the clusters. | This formula alone, however, would prevent an information gain: If a cluster Ct,i evolves to Ct+1,j by spanning more relevant dimensions, this would not be assessed positively. We would get the same distance for a cluster with the same shared dimensions as Ct,i and without additional relevant dimensions as Ct+1,j . Since more dimensions mean more information, we do consider this. |S \St,i | Consequently, the smaller the term 1 − t+1,j |St+1,j | , the more similar the clusters. Usually it is more important for tracing that we retain relevant dimensions. Few shared dimensions and many new ones normally do not indicate similar behavior. Thus, we need a trade-off between retained dimensions and new (gained) dimensions. This is achieved by a linear combination of the two introduced terms: Definition 4. Distance w.r.t. subspaces. The similarity w.r.t. to subspaces between two clusters Ct,i = (Ot,i , St,i ) and Ct+1,j = (Ot+1,j , St+1,j ) is defined by
|St+1,j \St,i | S(Ct,i , Ct+1,j ) = α · 1 − |St+1,j |
|St,i ∩ St+1,j | + (1 − α) · 1 − |St,i |
with the trade-off factor α ∈ [0, 1]. By choosing α 1 − α we achieve that the similarity between two behaviors is primarily rated based on their shared dimensions. Similarity based on statistical characteristics. Besides the subspace similarity, the actual values within these dimensions are important. E.g., although two clusters share a dimension like ’income’, they can differ in their values extremely (high vs. low income); these behaviors should not be mapped. A small change in the values, however, is possible for evolving behaviors. For a spatial dimension, this change would correspond to a slight cluster movement. Given a cluster C = (O, S), we denote the set of values in dimension d with v(C, d) = {o[d] | o ∈ O}. The similarity between two clusters Ct,i and Ct+1,j is thus achieved by analyzing the corresponding sets v(Ct,i , d) and v(Ct+1,j , d). By deducing two normal distribution Xd and Yd with means μx , μy and variances σx , σy from the two sets, the similarity can be measured by the information theoretic Kullback-Leibler divergence (KL). Informally, we calculate the expected number of bits required to encode a new distribution of values at time step t + 1 (Yd ) given the original distribution of the values at time step t (Xd ). Formally, KL(Yd Xd ) = ln(
σy2 + (μy − μx )2 σx 1 )+ − =: KL(Ct,i , Ct+1,j , d) 2 σy 2σx 2
By using the KL, we do not just account for the absolute deviation of the means, but we have also the advantage of including the variances. A behavior with a high variance in a single dimension allows a higher evolution of the means for successive similar behaviors. A small variance of the values, however, only permits a smaller deviation of the means.
W
^GG`
W
^GG`
&
FRUH ^G«G`
&
^G«G`
^G«G`
^G«G`
S. G¨ unnemann et al.
^G«G`
450
&
FRUH ^G«G`
W
^GG`
W
& ^GG`
Fig. 3. Core dimensions in a 7-dim. space. (left: databases; right: clusters)
We use the KL for the similarity per dimension, and the overall similarity is attained by cumulating over several dimensions. Apparently, we just have to use dimensions that are in the intersection of both clusters. The remaining dimensions are non-relevant for at least one cluster and hence are already penalized by our subspace distance function. Our first approach for computing the similarity based on statistical characteristics is V (Ct,i , Ct+1,j , I) = ( KL(Ct,i , Ct+1,j , d))/|I| (1) d∈I
with I = St,i ∩ St+1,j for averaging. In a perfect scenario this distance is a good way to trace behaviors. In practice, however, we face the following problem: Consider Fig. 3 (note the 7-dim. space) with the cluster C1,2 at time step t and the cluster C2,2 with the same relevant dimensions in t+1. However, C2,2 is shifted in dimensions d1 and d2 ; the distance function proposed above (Eq. 1) would determine a very high value and hence the behaviors would not be mapped. A large part {d3 , ..., d7 } of the shared relevant dimensions {d1 , ..., d7 }, however, show nearly the same characteristics in both clusters. The core of the behaviors is completely identical, and thus a mapping is reasonable; as illustrated by the mapping in the right part of Fig. 3. Consider another example: the core of the customer behaviors of North Americans and Europeans is identical; however, North Americans and Europeans have further typical characteristics like their favorite sport (baseball vs. soccer). These additional, non-core, dimensions provide us with further informations about the single clusters at their current time step. They are mainly induced by the individual populations. For the continuation of the behavior, however, these dimensions are not important. Note that non-core dimensions are a different concept than non-relevant ones; non-core dimensions are shared relevant ones with differing values. In other words, there are two different kinds of relevant dimensions: one for subspace clusters and one for tracing of subspace clusters. An effective distance function between clusters has to identify the core of the behaviors and incorporate it into the distance. We achieve this by using a subset Core ⊆ St,i ∩ St+1,j for comparing the values in Eq. 1 instead of the whole intersection. Unfortunately, this subset is not known in advance, and it is not reasonable to exclude dimensions from the distance calculation by a fixed threshold if the corresponding dissimilarity is too large. Thus, we develop a variant to automatically determine the core. We choose the ’best’ core among all possible cores for the given two clusters. That is, for each possible core we determine the
Tracing Evolving Clusters by Subspace and Value Similarity
451
distance w.r.t. their value distributions, and we additionally penalize dimensions not included in the core. The core with the smallest overall distance is selected, i.e. we trade off the size of the core against the value V (Ct,i , Ct+1,j , Core): Definition 5. The core-based distance function w.r.t. values for two clusters Ct,i = (Ot,i , St,i ) and Ct+1,j = (Ot+1,j , St+1,j ) is defined by |N onCore| V (Ct,i , Ct+1,j ) = min β· +(1−β)·V (Ct,i , Ct+1,j , Core) Core⊆St,i ∩St+1,j |St,i ∩ St+1,j | ∧|Core|>0
with the penalty factor β ∈ [0, 1] for dimensions N onCore = (St,i ∩St+1,j )\Core. By selecting a smaller core, the first part of the distance formula enlarges. The second part, however, gains the possibility of determining a smaller value. The core must comprise at least one dimension; otherwise, we could map two clusters even if they have no dimensions with similar characteristics. Overall distance function. To correctly identify the evolving clusters in our temporal data we have to consider evolutions in the relevant dimensions as well as in the value distributions. Thus, we have to use both distance measures simultaneously. Again, we require that two potentially mapped clusters share at least one dimension; otherwise, these clusters cannot represent similar behaviors. Definition 6. The Overall distance function for clusters Ct,i = (Ot,i , St,i ) and Ct+1,j = (Ot+1,j , St+1,j ) with |St,i ∩ St+1,j | > 0 is defined by dist(Ct,i , Ct+1,j ) = γ · V (Ct,i , Ct+1,j ) + (1 − γ) · S(Ct,i , Ct+1,j ) with γ ∈ [0, 1]. In the case of |St,i ∩ St+1,j | = 0, the distance is set to ∞. 3.3
Clustering for Improved Tracing Quality
Until now, we assume a given clustering per time step such that we can determine the distances and the mapping graph. In general, our tracing model is independent of the used clustering method. However, since there are temporal relations between consecutive time steps, we develop a clustering method whose accuracy is improved by these relations and that avoids totally different clusterings in consecutive time steps. A direct consequence is an improved tracing effectiveness. We adapt the effective cell-based clustering paradigm [12,16,11], where clusters are approximated by hypercubes with at least minSup many objects. The extent of a hypercube is restricted to w in its relevant dimensions. Definition 7. Hypercube and valid subspace cluster. A hypercube HS with the relevant dimensions S is defined by lower and upper bounds HS = [low1 , up1 ] × [low2 , up2 ] × . . . × [lowD , upD ] with upi − lowi ≤ w ∀i ∈ S and lowi = −∞, upi = ∞ ∀i ∈ S. The mean of HS is called mHS . The hypercube HS represents all objects Obj(HS ) ⊆ DB with o ∈ Obj(HS ) ⇔ ∀d ∈ {1, . . . , D} : lowd ≤ o[d] ≤ upd . A subspace cluster C = (O, S) is valid ⇔ ∃HS with Obj(HS ) = O and |Obj(HS )| ≥ minSup.
452
S. G¨ unnemann et al.
We now introduce how temporal relations between time steps can be exploited. Predecessor information. We assume an initial clustering at time step t = 1. (We discuss this later.) Caused by the temporal aspect of the data, clusters at a time step t occur with high probability in t + 1 — not identical, but similar. Given a cluster and the corresponding hypercube HS at time step t, we try to find a cluster at the next time step in a similar region. We use a Monte Carlo approach, i.e. we draw a random point mt+1 ∈ RD that represents the initiator of a new hypercube and that is nearby the mean mHS of HS . After inducing an hypercube by an initiator, the corresponding cluster’s validity is checked. The quantity of initiators is calculated by a formula introduced in [16]. Definition 8. Initiator of a hypercube. A point p ∈ RD , called initiator, together with a width w and a subspace S induces a hypercube HSw (p) defined by ∀d ∈ S : lowd = p[d] − w2 , upd = p[d] + w2 and ∀i ∈ S : lowi = −∞, upi = ∞. Formally, the initiator mt+1 is drawn from the region HS2w (mHS ), permitting a change of the cluster. The new hypercube is then HSw (mt+1 ). With this method we detect changes in the values; however, also the relevant dimensions can change: The initiator mt+1 can induce different hypercubes for different relevant dimensions S. Accordingly, beside the initiator, we have to determine the relevant subspace of the new cluster. The next section discusses both issues. Determining the best cluster. A first approach is to use a quality function [12,16]: μ(HS ) = Obj(HS ) · k |S| . The more objects or the more relevant dimensions are covered by the cluster, the higher is its quality. These objectives are contrary: a trade-off is realized with the parameter k. In time step t + 1 we could choose the subspace S that maximizes μ(HSw (mt+1 )). This method, however, optimizes the quality of each single cluster; it is not intended to find good tracings. Possibly, the distance between each cluster from the previous clustering Clust and our new cluster is large, and we would find no similar behaviors. Our solution is to directly integrate the distance function dist into the quality function. Consequently, we choose the subspace S such that the hypercube HSw (mt+1 ) maximizes our novel distance based quality function. Definition 9. Distance based quality function. Given the hypercube HS in subspace S and a clustering Clust , the distance based quality function is q(HS ) = μ(HS ) · (1 −
min {dist(Ct , CS )})
Ct ∈Clust
where CS indicates the induced subspace cluster of the hypercube HS . We enhance the quality of the clustering by selecting a set of possible initiators M from the specified region; this is also important as the direction of a cluster change is not known in advance. From the resulting set of potential clusters, we select the one that has the highest quality. Overall we realize that for each cluster C ∈ Clust a potential temporal continuation is identified in time step t + 1. Nonetheless it is also possible that no valid hypercube is identified for a single cluster C ∈ Clust . This indicates that a behavior type has disappeared in the current time step.
Tracing Evolving Clusters by Subspace and Value Similarity
453
Uncovered objects and the initial clustering. When behavior emerges or disappears, there will be some objects of the current time step that are not part of any identified cluster: if we denote the set of clusters generated so far by Clust+1 , the set Remaint+1 := DBt+1 \ Ci =(Oi ,Si ) Oi can still contain objects Ci ∈Clust+1
and therefore clusters. Especially for the initial clustering at time step t = 1 we have no predecessor information and hence Clus1 = ∅. To discover as many patterns as possible, we have to check if the objects within Remaint+1 induce novel clusters. We draw a set of initiators M ⊆ Remaint+1 , where each m ∈ M induces a set of hypercubes HSw (m) in different subspaces. Finally, we choose the hypercube that maximizes our quality function. If this hypercube is a valid cluster, we add it to Clust+1 , and thus Remaint+1 is reduced. This procedure is repeated until no valid cluster is identified or the set Remaint+1 is empty. Note that our method has the advantage of generating overlapping clusters.
4
Experiments
Setup. We use real world and synthetic data for evaluation. Real world data are scientific grid data reflecting oceanographic characteristics as temperature and salinity of the oceans1 . It contains 20 time steps, 8 dimensions, and 71,430 objects. The synthetic data cover 24 time steps and 20 dimensions. In average, each time step contains 10 clusters with 5-15 relevant dimensions. We hide developments (emerge, converge, diverge, or disappear) and evolutions (subspace and value changes) within the data. In our experiments we concentrate on the quality of our approach. For synthetic data, the correct mappings between the clusters are given. Based on the detected mappings we calculate the precision and recall values: we check whether all but only the true mappings between clusters are detected. For tracing quality we use the F1 value corresponding to the harmonic mean of recall and precision. Our approach tackles the problem of tracing clusters with varying subspaces and is based on object-value-similarity. Even if we would constrain our approach to handle only full-space clusters as existing solutions, such a comparison is only possible when we artificially add object ids to the data (to be used by these solutions). Tracing clusters based on these artificial object ids, however, cannot reflect the ground truth in the data. Summarized, comparisons to other approaches are not performed since it would be unfair. We use Opteron 2.3GHz CPUs and Java6 64bit. Tracing quality. First, we analyze how the parameters affect the tracing effectiveness. For lack of space, we only present a selection of the experiments. For α, a default value of 0.1 was empirically determined. γ is evaluated in Fig. 4 for three different τ values using synthetic data. By γ we determine the trade-off between subspace similarity and value similarity in our overall distance function. Obviously we want to prevent extreme cases for effective tracing, i.e. subspace similarity with no attribute similarity at all (γ → 0), or vice versa. This is confirmed by the figure, as the tracing quality highly degrades, when γ reaches 0 or 1 for all τ values. As γ = 0.3 enables a good tracing quality for all three τ , 1
Provided by the Alfred Wegener Institute for Polar and Marine Research, Germany.
S. G¨ unnemann et al. ʏ=0.1
ʏ=0.3
ʏ=0.6
quality tracingquality
tracingquality
1.0 0.8 0.6
#ofnonͲcoredimensions
1
20
0.96
16 12
0.92
8
0.88
4 0
0.84
0.4 0
0.2 0.4 0.6 0.8 1 ɶ (tradeͲoffbetweenvalues&subspaces)
Fig. 4. Tracing quality for different γ & τ
0
0.5 1 ɴ (penaltyfornonͲcoredimensions)
#ofnonͲcoredimensions
454
Fig. 5. Eval. of core dimension concept
we use this as default. Note that with the threshold τ we can directly influence how many cluster mappings are created. τ = 0.1 is a good trade-off and is used as default. With a bigger τ the tracing quality worsens: too many mappings are created and we cannot distinguish between meaningful or meaningless mappings. The same is true for τ → 0: no clusters are mapped and thus the clustering quality reaches zero; thus we excluded plots for τ → 0. The core dimension concept is evaluated in Fig. 5. We analyze the influence on the tracing quality (left axis) with a varying β on the x-axis; i.e., we change the penalty for non-core dimensions. Note, non-core dimensions are a different concept than non-relevant ones; non-core dimensions are shared relevant dimensions with differing values. The higher the penalty, the more dimensions are included in the dimension core; i.e., more shared dimensions are used for the value-based similarity. In a second curve, we show the absolute number of non-core dimensions (right axis) for the different penalties: the number decreases with higher penalties. In this experiment the exact number of non-core dimensions in the synthetic data is 10. We can draw the following conclusions regarding tracing quality: A forced usage of a full core (β → 1) is a bad choice, as there can be some shared dimensions with different values. By lowering the penalty we allow some dimensions to be excluded from the core and thus we can increase the tracing quality. With β = 0.1 the highest tracing quality is obtained; this is plausible as the number of non-core dimensions corresponds to the number that is existent in the data. A too low penalty, however, results in excluding nearly all dimensions from the core (many non-core dimensions, β → 0) and dropping quality. In the experiments, we use β = 0.1 as default. Detection of behavior developments. Next we analyze whether our model is able to detect the different behavior developments. Up to now, we used our enhanced clustering method that utilizes the predecessor information and the distance based quality function. Now, we additionally compare this method with a variant that performs clustering of each step independently. In Fig. 6 we use the oceanographic dataset and we determine for each time step the number of disappeared behaviors for each clustering method. The experiment indicates that the number of unmapped clusters for the approach without any predecessor or distance information is larger than for our enhanced approach. By transferring the clustering information between the time steps, a larger number of clusters from one time step to the next can be mapped. We map clusters over a longer time period, yielding a more effective tracing of evolving clusters.
with predecessor&distanceinformation without
100
455
intended
12 8 4
20
Fig. 6. Effects of predecessor information & distance quality function
Fig. 7. Cumulated number of evolutions & developments over 24 time steps
100000
100 10 diverge
converge
disappear
emerge
dimension loss
dimension gain
1
numberofoccurences
10000 1000
diverge
15
converge
10
timestep
disappear
5
dimension loss
0
emerge
0
1
dimension gain
10
numberofoccurences
calculated
16
numberofoccurences
#ofdissapearedclusters
Tracing Evolving Clusters by Subspace and Value Similarity
dimensiongain dimensionloss 10000
emerge disappear
converge diverge
1000 100 10 1 0
5
10 timestep
15
20
Fig. 8. Number of evolutions & developments on real world data; left: cumulated over 20 time steps, right: for each time step
The aim of tracing is not just to map similar clusters but also to identify different kinds of evolution and development. In Fig. 7 we plot the number of clusters that gain or lose dimensions and the four kinds of development cumulated over all time steps. Beside the numbers our approach detects, we show the intended number based on this synthetic data. The first four bars indicate that our approach is able to handle dimension gains or losses; i.e., we enable subspace cluster tracing, which is not considered by other models. The remaining bars show that also the developments can be accurately detected. Overall, the intended transitions are found by our tracing. In Fig. 8 we perform a similar experiment on real world data. We report only the detected number of patterns because exact values are not given. On the left we cumulate over all time steps. Again, our approach traces clusters with varying dimensions. Accordingly, on real world data it is a relevant scenario that subspace clusters lose some of their characteristics, and it is mandatory to use a tracing model that handle these cases. The developments are also identified in this real world data. To show that the effectiveness is not restricted to single time steps, we analyze the detected patterns for each time step individually on the right. Based on the almost constant slopes of all curves, we can see that our approach performs effectively.
5
Conclusion
In this paper, we proposed a model for tracing evolving subspace clusters in high dimensional temporal data. In contrast to existing methods, we trace clusters
456
S. G¨ unnemann et al.
based on their behavior; that is, clusters are not mapped based on the fraction of objects they have in common, but on the similarity of their corresponding object values. We enable effective tracing by introducing a novel distance measure that determines the similarity between clusters; this measure comprises subspace and value similarity, reflecting how much a cluster has evolved. In the experimental evaluation we showed that high quality tracings are generated. Acknowledgments. This work has been supported by the UMIC Research Centre, RWTH Aachen University.
References 1. Aggarwal, C.C.: On change diagnosis in evolving data streams. TKDE 17(5), 587– 600 (2005) 2. Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: VLDB, pp. 81–92 (2003) 3. Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for projected clustering of high dimensional data streams. In: VLDB, pp. 852–863 (2004) 4. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: SIGMOD, pp. 94–105 (1998) 5. B¨ ottcher, M., H¨ oppner, F., Spiliopoulou, M.: On exploiting the power of time in data mining. SIGKDD Explorations 10(2), 3–11 (2008) 6. Ester, M., Kriegel, H.P., J¨ org, S., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, pp. 226–231 (1996) 7. Gaffney, S., Smyth, P.: Trajectory clustering with mixtures of regression models. In: KDD, pp. 63–72 (1999) 8. Kalnis, P., Mamoulis, N., Bakiras, S.: On discovering moving clusters in spatiotemporal data. In: Anshelevich, E., Egenhofer, M.J., Hwang, J. (eds.) SSTD 2005. LNCS, vol. 3633, pp. 364–381. Springer, Heidelberg (2005) 9. Kriegel, H.P., Kr¨ oger, P., Zimek, A.: Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. TKDD 3(1), 1–58 (2009) 10. Li, Y., Han, J., Yang, J.: Clustering moving objects. In: KDD, pp. 617–622 (2004) 11. M¨ uller, E., G¨ unnemann, S., Assent, I., Seidl, T.: Evaluating clustering in subspace projections of high dimensional data. In: VLDB, pp. 1270–1281 (2009) 12. Procopiuc, C.M., Jones, M., Agarwal, P.K., Murali, T.M.: A monte carlo algorithm for fast projective clustering. In: SIGMOD, pp. 418–427 (2002) 13. Rosswog, J., Ghose, K.: Detecting and tracking spatio-temporal clusters with adaptive history filtering. In: ICDM Workshops, pp. 448–457 (2008) 14. Spiliopoulou, M., Ntoutsi, I., Theodoridis, Y., Schult, R.: MONIC - modeling and monitoring cluster transitions. In: KDD, pp. 706–711 (2006) 15. Vlachos, M., Gunopulos, D., Kollios, G.: Discovering similar multidimensional trajectories. In: ICDE, pp. 673–684 (2002) 16. Yiu, M.L., Mamoulis, N.: Frequent-pattern based iterative projected clustering. In: ICDM, pp. 689–692 (2003)
An IFS-Based Similarity Measure to Index Electroencephalograms Ghita Berrada and Ander de Keijzer MIRA - Institute for Biomedical Technology and Technical Medicine University of Twente, P.O. Box 217,7500 AE Enschede, The Netherlands {g.berrada,a.dekeijzer}@utwente.nl
Abstract. EEG is a very useful neurological diagnosis tool, inasmuch as the EEG exam is easy to perform and relatively cheap. However, it generates large amounts of data, not easily interpreted by a clinician. Several methods have been tried to automate the interpretation of EEG recordings. However, their results are hard to compare since they are tested on different datasets. This means a benchmark database of EEG data is required. However, for such a database to be useful, we have to solve the problem of retrieving information from the stored EEGs without having to tag each and every EEG sequence stored in the database (which can be a very time-consuming and error-prone process). In this paper, we present a similarity measure, based on iterated function systems, to index EEGs. Keywords: clustering, indexing, electroencephalograms (EEG), iterated function systems (IFS).
1
Introduction
An electroencephalogram (EEG) captures the brain’s electric activity through several electrodes placed on the scalp1 . The result is a multidimensional time series2 . An EEG signal can be classified into several types of cerebral waves characterised by their frequencies, amplitudes, morphology, stability, topography and reactivity. The interpretation of the sequence of cerebral waves, their localisation and context of occurrence (eg eyes closed EEG or sleep EEG) leads to a diagnosis. The complexity of the sequences of cerebral waves, the non-specificity of EEG recordings (for example, without any context being given, the EEG recording of a chewing artifact can be mistaken as that of a seizure (see figure 1)) and the amount of data generated make the interpretation process a difficult, time-consuming and error-prone one. Consequently, the interpretation process is being automated, in part at least, through several methods mostly consisting in extracting features from EEGs and applying classification algorithms to the sets 1 2
Usually 21 in the International 10/20 System. 19 channels in the International 10/20 System.
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 457–468, 2011. c Springer-Verlag Berlin Heidelberg 2011
458
G. Berrada and A. de Keijzer
Fig. 1. EEG with a chewing artifact
of extracted features to discriminate between two different patient states (usually the ”normal” state and a pathological state). For example, empirical mode decomposition and Fourier-Bessel expansion are used in [13] to discriminate between ictal EEGs (i.e EEGs of an epileptic seizure) and seizure-free EEGs. The interpretation methods are usually tested on different datasets. To make them comparable, a benchmark database of EEGs is required. Such a database has to designed so as to be able to handle queries in natural language such as the following sample queries: 1. find EEGs of non-convulsive status epilepticus 2. find EEGs showing rhythms associated with consumption of benzodiazepines and remove all artefacts from them Obtaining a simple answer to this set of queries would require the EEG dataset to be heavily and precisely annotated and tagged. But what if the annotations are scarce or not available? Furthermore, the whole process of annotating and tagging each and every sequence of the EEG dataset is time-consuming and errorprone. This means that feature extraction techniques are necessary to solve all of these queries since they can help define a set of clinical features representative of a particular pathology (query 1) or detect particular sets of patterns and process the EEG based on them (query 2). EEG recordings correspond to very diverse conditions ( eg. ”normal” state, seizure episodes, Alzheimer disease). Therefore, a generic method to index EEGs without having to deal with disease-specific features is required3 . Generic methods to index time series often rely on the definition of a similarity measure. Some of the similarity measures proposed include a function interpolation step, be it piecewise linear interpolation or interpolation with AR (as in [8] to distinguish between normal EEGs and EEGs originating from the injured brain undergoing transient global ischemia) or ARIMA models, that can followed by a feature extraction step (eg. computation of LPC cepstral coefficients from the ARIMA model of the time series as in [9]). However, ARIMA/AR methods assume that the EEG signal is stationary, which is not a valid assumption. In fact, EEG signals can only be considered as stationary during short intervals, especially intervals of normal background activity, but 3
As the number of disease-specific classifiers grows exponentially.
An IFS-Based Similarity Measure to Index EEGs
459
the stationarity assumption does not hold during episodes of physical or mental activity, such as changes in alertness and wakefulness, during eye blinking and during transitions between various ictal states. Therefore, EEG signals are quasi-stationary. In view of that, we propose a similarity measure based on IFS interpolation to index EEGs in this paper, as fractal interpolation does not assume stationarity of the data and can adequately model complex structures. Moreover, using fractal interpolation makes computing features such as the fractal dimension simple (see theorem 21 for the link between fractal interpolation parameters and fractal dimension) and the fractal dimension of EEGs is known to be a relevant marker for some pathologies such as dementia (see [7]).
2
Background
2.1
Fractal Interpolation
Fractal dimension. Any given time series can be viewed as the observed data generated by an unknown manifold or attractor. One important property of this attractor is its fractal dimension. The fractal dimension of an attractor counts the effective number of degrees of freedom in the dynamical system and therefore quantifies its complexity. It can also be seen as the statistical quantity that gives an indication of how completely a fractal object appears to fill space, as one zooms down to finer and finer scales. Another dimension, called the topological dimension or Lebesgue Covering dimension, is also defined for any object and a fortiori for the attractor. A space has Lebesgue Covering dimension n if for every open cover4 of that space, there is an open cover that refines it such that the refinement5 has order at most n + 1. For example, the topological dimension of the Euclidean space Rn is n. The attractor of a time series can be fractal (ie its fractal dimension is higher than its topological dimension) and is then called a strange attractor. The fractal dimension is generally a non-integer or fractional number. Typically, for a time series, the fractal dimension is comprised between 1 and 2 since the (topological) dimension of a plane is 2 and that of a line is 1. The fractal dimension has been used to: – uncover patterns in datasets and cluster data ([10,2,15]) – analyse medical time series ([14,6]) such as EEGs ([1,7]) – determine the number of features to be selected from a dataset for a similarity search while obviating the ”dimensionality curse” ([12]) Iterated function systems. We denote as K a compact metric space for which a distance function d is defined and as C(K) the space of continuous functions 4
5
A covering of a subset S is a collection C of open subsets in X whose union contains all of S at least. A subset S ⊂ X is open if it is an arbitrary union of open balls in X. This means that every point in S is surrounded by an open ball which is entirely contained in X. An open ball in a in metric space X is defined as a subset of X of the form B(x0, ) = {x ∈ X|d(x, x0 ) < } where x0 is a point of X and a radius. A refinement of a covering C of S is another covering C of S such that each set B in C . is contained in some set A in C.
460
G. Berrada and A. de Keijzer
on K. We define over K a finite collection of mappings W = wi i∈[1,n] and their associated probabilities pii∈[1,n] such that pi ≥ 0
and
n
pi = 1
i=1
n We also define an operator T on C(K) as (T f )(x) = i=1 pi (f ◦ wi )(x). If T maps C(K) into itself, then the pair (wi , pi) is called an iterated function system on (K, d). The condition on T is satisfied for any set of probabilities pi if the transformations wi are contracting, in other words, if, for any i, there exists a δi < 1 such that: d(wi (x), wi (y)) ≤ δi d(x, y) ∀x, y ∈ K . The IFS is also denoted as hyperbolic in this case. Principle of fractal interpolation. If we define a set of points (xi , Fi ) ∈ R2 : i = 0, 1, ..., n with x0 < x1 < ... < xn , then an interpolation function corresponding to this set of points is a continuous function f : [x0 , xn ] → R such that f (xi ) = Fi for i ∈ [0, n]. In fractal interpolation, the interpolation function is often constructed with n affine maps of the form: wi
x y
=
ai 0 ci d i
x y
+
ei fi
i = 1, 2, ..., n
where di is constrained to satisfy: −1 ≤ di ≤ 1. Furthermore, we have the following constraints: wi
x0 y0
=
xi−1 yi−1
and
wi
xn yn
=
xi yi
After determining the contraction parameter di , we can estimate the four remaining parameters (namely ai ,ci ,ei ,fi ): xi − xi−1 xn − x0 xn xi−1 − x0 xi
ai =
ci = xn − x0 yi − yi−1 yn − y0 ei = − di xn − x0 xn − x0 xn yi−1 − x0 yi xn y0 − x0 yn fi = − di xn − x0 xn − x0
(1) (2) (3) (4)
di can be determined using the geometrical approach given in [11]. Let t be a time-series with end-points (x0 , y0 ) and (xn , yn ), and (xp , yp ) and (xq , yq ) two consecutive interpolation points so that the map parameters desired are those defined for wp . We also define α as the maximum height of the entire function measured from the line connecting the end-points (x0 , y0 ) and (xn , yn ) and β as the maximum height of the curve measured from the line connecting (xp , yp ) and (xq , yq ). α and β is positive (respectively negative) if the maximum value is reached above the line (respectively below the line). The contraction factor dp β is then defined as α . This procedure is also valid when the contraction factor is computed for an interval instead of for the whole function. The end-points are then taken as being the end-points of the interval. For more details on fractal interpolation, see [3,11].
An IFS-Based Similarity Measure to Index EEGs
461
Estimation of the fractal dimension from a fractal interpolation. The theorem that links the fractal interpolation function and its fractal dimension is given in [3].The theorem is as follows: Theorem 21. Let n be a positive integer greater than 1, (xi , Fi )∈ R2 : i = 1, 2, ..., n a set of points and R2 ; wi , i = 1, 2, .., n an IFS associated with the set of points where:
wi
x y
=
ai 0 ci di
for
x y
+
ei fi
i = 1, 2, .., n.
The vertical scaling factors di satisfy 0 ≤ di < 1 and the constants ai ,ci ,ei and fi are defined as in section 2.1 (in equations 1,2,3 and 4) for i = 1, 2, ..., n. We denote G the attractor of the IFS such that G is the graph of a fractal interpolation function associated with the set of points. If ni=1 |di | > 1 and the interpolation points do not lie on a straight line, then = 1. the fractal dimension of G is the unique real solution D of i=1 |di |aD−1 i 2.2 K-Medoid Clustering An m × m symmetric similarity matrix S can be associated to the EEGs to be indexed (with m being the number of EEGs to index): ⎛
d11 d12 ⎜ d12 d22 ⎜ S=⎜ . . . ⎝ . . . d1m d2m
⎞ . . . d1m . . . d2m ⎟ ⎟ . ⎟ .. . ⎠ . . . . . dmm
where dnm is the distance between EEGs n and m
(5)
Given the computed similarity matrix S (defined by equation 5), we can use the k-medoids algorithm to cluster the EEGs. This algorithm requires the number of clusters k to be known. We describe our choice of the number of clusters below, in section 2.3. The k-medoids algorithm is similar to k-means and can be applied through the use of the EM algorithm. k random elements are, initially, chosen as representatives of the k clusters. At each iteration, a representative element of a cluster is replaced by a randomly chosen nonrepresentative element of the cluster if the selected criterion (e.g. mean-squared error) is improved by this choice. The data points are then reassigned to their closest cluster, given the new cluster representative elements. The iterations are stopped when no reassignments is possible. We use the PyCluster function kmedoids described in [5] to make our k-medoids clustering. 2.3
Choice of Number of Clusters
The number of clusters in the dataset is estimated based on the similarity matrix obtained following the steps in section 3 and using the method described in [4]. The method described in [4] takes the similarity matrix and outputs a vector called envelope intensity associated to the similarity matrix. The number of distinct regions in the plot of the envelope intensity versus the index gives an estimation of the number of clusters. For details on how the envelope intensity vector is computed, see [4].
462
3
G. Berrada and A. de Keijzer
An IFS-Based Similarity Measure
3.1
Fractal Interpolation Step
We interpolate each channel of each EEG (except the annotations channel) using piecewise fractal interpolation. For this purpose, we split each EEG channel into windows and then estimate the IFS for each window. The previous description implies that a few parameters, namely the window size and therefore the embedding dimension, have to be determined before estimating the piecewise fractal interpolation function for each channel. The embedding dimension is determined thanks to Takens’ theorem which states that, for the attractor of a time series to be reconstructed correctly (i.e the same information content is found in the state (latent) and observation spaces), the embedding dimension denoted m satisfies : m > 2D + 1 where D is the dimension of the attractor, in other words its fractal dimension. Since the fractal dimension of a time series is between 1 and 2, we can get a satisfactory embedding dimension as long as m > 2 ∗ 2 + 1 i.e m > 5. We therefore choose an embedding dimension equal to 6. And we choose the lag τ between different elements of the delay vector to be equal to the average duration of an EEG data record i.e 1s. Therefore, we split our EEGs in (non-overlapping) windows of 6 seconds. A standard 20-minutes EEG (which therefore contains about 1200 data records of 1 second) would then be split in about 200 windows of 6 seconds. Each window is subdivided into intervals of one second each and the end-points of these intervals are taken as interpolation points. This means there are 7 interpolation points per interval: the starting point p0 of the window, the point one second away from p0 , the point two seconds from p0 , the point three seconds away from p0 , the point four seconds away from p0 , the point five seconds away from p0 and the last point of the window. The algorithm6 to compute the fractal interpolation function per window is as follows: 1. Choose, as an initial point, the starting point of the interval considered (the first interval considered is the interval corresponding to the first second of the window). 2. Choose, as the end point of the interval considered, the next interpolation point. 3. Compute the contraction factor d for the interval considered. 4. If |d| > 1 go to 2, otherwise go to 5. 5. Form the map wi associated with the interval considered. In other words, compute the a, c, e and f parameters associated to the interval (see equations). Apply the map to the entire window (i.e six seconds window) to yield x wi for all x in the window. 6. Compute and store the distance between y the original values of the time series on the interval considered (i.e the interval constructed in steps 2 and 3) and the values given by wi on that interval. A possible distance is the Euclidean distance. 6
Inspired from [11].
An IFS-Based Similarity Measure to Index EEGs
463
6. Go to 2 until the end of the window is reached. 7. Store the interpolation points and contraction factor which yield the minimum distance between the original values on the interval and the values yielded by the computed map under the influence of each individual map in steps 5 and 6. 8. Repeat steps from 1 to 8 for each window of the EEG channel. 9. Apply steps 1 to 9 to all EEG channels. 3.2
Fractal Dimensions Estimation
After this fractal interpolation step, each window of each signal is represented by 5 parameters instead of by signal frequency.window duration points. The dimension of the analysed time series is therefore reduced in this step. For a standard 20-minutes EEG containing 23 signals of frequency 250 Hz, this amounts to representing each signal with 1000 values instead 50000 and the whole EEG with 23000 values instead of 1150000, thus to reducing the number of signal values by almost 98%. This dimension reduction may be exploited in future work to compress EGGs and store compressed representations of EEGs in the database instead of raw EEGs as the whole EEGs can be reconstructed from their fractal interpolations. Further work needs to be done on the compression of EEG data using fractal interpolation and the loss of information that may result from this compression. Then, for each EEG channel and for each window, we compute the fractal dimension thanks to theorem 21. The equation of theorem 21 is solved heuristically for each 6-second interval of each EEG signal using a bisection algorithm. As we know that the fractal dimension for a time series is between 1 and 2, we search a root of the equation of theorem 21 in the interval [1,2] and split the search interval by half at each iteration until the value of the root is approached by an -margin ( 7 being the admissible error on the desired root). Therefore, for each EEG channel, we have the same number of computed fractal dimensions as the number of windows. This feature extraction extraction step (fractal dimension computations) further reduces the dimensionality of the analysed time series. In fact, the number of values representing the time series is divided by 5 in this step. This leads to representing a standard 20-minute EEG containing 23 signals of frequency 250 Hz by 4600 values instead of the initial 1150000 points. 3.3
Similarity Matrix Computation
We only compare EEGs that have at least a subset of identical channels (i.e having the same labels). When two EEGs don’t have any channels (except the annotations channel) in common, the similarity measure between them is set to 1 (as the farther (resp. closer) the distance between two EEGs, the higher (resp. lower) and the closer to 1 (resp. closer to 0) the similarity measure). If, for the two EEGs compared, the matching pairs of feature vectors (i.e vectors made of 7
We choose = 0.0001 in our experiments.
464
G. Berrada and A. de Keijzer
the fractal dimensions computed for each signal) do not have the same dimension then the vector of highest dimension is approximated by a histogram and the M most frequent values according to the histogram (M being the dimension of the shortest vector) are taken as representatives of that vector and the distance between the two feature vectors is approximated by the distance between the shortest feature vector and the vector formed with the M most frequent values of the longest vector. The similarity measure between two EEGs is given by: N
EEG1
1 d(chi i=1 N
EEG
2 ,chi )−dmin dmax −dmin
1 2 where N is the number of EEG channels, d(chEEG , chEEG ) the distance bei i tween the fractal dimensions extracted from channels with the same label in the two EEGs compared and dmin and dmax respectively the minimum and maximum distances between two EEGs in the analysed set. We choose as metrics (d) the Euclidean distance and the normalized mutual information.
4
Description of the Dataset and Experiments
We interpolate (with fractal interpolation, as described in section 3) 476 EEGs8 whose durations range from 1 minute 50 seconds to 5 hours 21 minutes and whose sizes are between 1133KB and 138 MB. All signals in all these files have a frequency of 250Hz. Of the files used, 260 have a duration between 15 and 30 minutes (54.6%)-which is the most frequent duration range for EEGs-, 40 files (8.4%) a duration below 15 minutes and 176 files (37%) a duration higher than 30 minutes. Moreover, 386 files contain 23 signals (81.1 %), 63 20 signals (13.2 %), 13 19 signals (2.7 %), 7 25 signals (1.5 %), 3 28 signals (0.6 %), 1 12 signals (0.2 %), 2 13 signals (0.4 %) and 1 2 signals (0.2 %). The experiments were run on an openSuSe 10.3(x86-64) (kernel version 2.6.22.5-31) server (RAM 32GB, R R Intel Quad-Core Xeon
[email protected] processor). The files for which the diagnosis conclusion is either unknown or known to be abnormal without any further details are not considered in the distance computation and clustering steps described in section 3. This means that the distance computation and clustering steps are performed on a subset of 328 files of the original 476 files. The similarity matrice obtained is a 328 × 328 matrix . The files contained in the subset chosen for clustering can be separated in 4 classes: normal EEG (195 files i.e 59.5%), EEG of epilepsy(64 files i.e 19.5%), EEG of encephalopathy(31 files i.e 9.5%) and EEG of brain damage (vascular damage, infarct, or ischemia)(34 files i.e 10.4%). Figure 2 shows the plot of the envelope intensity versus the index for the euclidean-distance-based similarity measure and the plot of the envelope intensity versus the index for the mutual-information-based similarity measure. The plot for the Euclidean-distance based similarity matrix exhibits 2 distinct regions whereas the plot for the mutual-information based similarity matrix exhibits 4 distinct regions. We therefore cluster the data first in 2 different clusters using the Euclidean-based similarity matrix and then in 4 clusters using 8
Unprocessed and unnormalised.
An IFS-Based Similarity Measure to Index EEGs
465
(a) Euclidean distance-based matrix
(b) Mutual information-based matrix Fig. 2. Envelope intensity of the dissimilarity matrices
the mutual-information based matrix. As we can see, the mutual informationbased measure yields the correct number of clusters while the Euclidean distancebased similarity measure isn’t spread enough to yield the correct number of clusters. We compare the performance of the IFS-based similarity measure with an autoregressive (AR)-based similarity measure inspired from [9]: – An AR model is fitted to each of the signals of each of the EEG files considered (at this stage 476). The order of the AR model fitted is selected using the AIC criterion. The order is equal to 4 for our dataset. – The LPC cepstrum coefficients are computed based on the AR model fitted to each signal using the formulas given in [9]. The number of coefficients selected is the PGCD of the number of points for all signals from all files. – The Euclidean distance, as well as the mutual information between the computed cepstral coefficients are computed in the same way as with the fractal dimension-based distances for the subset of 328 files for which the diagnosis are known. The resulting similarity matrices (328 × 328 matrices) are used to perform k-medoid clustering. Finally, we use the similarity matrices to cluster the EEGs (see Section 3.3).
466
5
G. Berrada and A. de Keijzer
Results
Figure 3 illustrates the relation between the duration of the EEG and the time it takes to interpolate EEGs. It shows that the increase of the fractal interpolation time with respect to the interpolated EEG’s duration is less than linear.
Fig. 3. Execution times of the fractal interpolation in function of the EEG duration compared to the AR modelling of the EEGs. The red triangles represent the fractal interpolation execution times and the blue crosses the AR modelling execution times. the black stars the fitting of the fractal interpolation measured execution times with function 1.14145161064 ∗ (1 − exp(−(0.5 ∗ x)2.0 )) + 275.735500586 ∗ (1 − exp(−(0.000274218988011 ∗ (x))2.12063087537 )) using the Levenberg-Marquardt algorithm
In comparison, AR modelling execution times increase almost linearly with the EEG duration. Therefore, fractal interpolation is a scalable method and is more scalable than AR modelling. In particular, the execution times for files of durations between 15 and 30 minutes are between 8.8 seconds and 131.7 seconds, that is execution times between 6.8 to 204.5 times lower than the duration of the original EEGs. Furthermore, the method doesn’t impose any condition on the signals to be compared as it handles the cases where EEGs to be compared have no or limited common channels and have signals of different lengths. Moreover, fractal interpolation doesn’t require model selection as AR modelling does, which considerably speeds up EEG interpolation. Moreover, with our dataset, the computation of the Euclidean distance between the cepstrum coefficients calculated based on the EEGs AR models leads to a matrix of NaN9 : the AR modelling method is therefore less stable than the fractal interpolation-based method. Table 1 summarises the clustering results for all similarity matrices. The low sensitivity obtained for the abnormal EEGs (epilepsy,encephalopathy,brain damage) can be be explained through the following reasons: 9
The same happens when the mutual information is used instead of the Euclidean distance (all programs are written in Python 2.6).
An IFS-Based Similarity Measure to Index EEGs
467
Table 1. Specificity and sensivity of the EEG clusterings Specificity Sensitivity normal EEG 0.312 0.770833333333 abnormal EEG 0.770833333333 0.312
normal EEG epilepsy encephalopathy brain damage
Specificity 0.297752808989 0.65564738292 0.838709677419 0.818713450292
Sensitivity 0.657534246575 0.183006535948 0.051724137931 0.114285714286
– most of the misclassified abnormal EEGs are EEGs representing mild forms of the pathology represented therefore their deviation from a normal EEG is minimal – most of the misclassified abnormal EEGs (in particular for epilepsy and brain damage) exhibit abnormalities on only a restricted number of channels (localised version of the pathologies considered). The similarity measures, giving equal weights to all channels, are not sensitive enough to abnormalities affecting one channel. In future work, we will explore the influence of weights on the clustering performance. About 76% of the normal EEGs are well classified. The remaining misclassified EEGs are misclassified because they exhibit artifacts, age-specific patterns and/or sleep-specific patterns that distort the EEGs significantly enough to make the EEGs seem abnormal. Filtering artifacts before computing the similarity measures and incorporating metadata knowledge in the similarity measure would improve the clustering results.
6
Conclusion
In this paper, we considered the problem of defining a similarity measure for EEGs that would be generic enough to cluster EEGs without having to build an exponential number of disease-specific classifiers. We use fractal interpolation followed by fractal dimension computation to define a similarity measure. Not only does the fractal interpolation provide a very compact representation of EEGs (which may be used later on to compress EEGs) but it also yields execution times that grow less than linearly with the EEG duration and is therefore a highly scalable method. It is a method that can compare EEGs of different lengths containing at least a common subset of channels. It also overcomes several of the shortcomings of an AR modelling-based measure as it doesn’t require model selection and is more stable and scalable than AR modelling-based measures. Furthermore, the mutual-information based measure is more sensitive to the correct number of clusters than the Euclidean distance-based one. In future work, we will explore other entropy-based measures. It was also shown that the shortcomings of the similarity measure when it comes to clustering abnormal EEGs can be overcome through pre-processing the EEGs before interpolation to remove artifacts, tuning the weight parameters in the measure to account for small localised abnormalities and incorporating qualitative metadata knowledge to the measure. All those solutions constitute future work.
468
G. Berrada and A. de Keijzer
References 1. Accardo, A., Affinito, M., Carrozzi, M., Bouquet, F.: Use of the fractal dimension for the analysis of electroencephalographic time series. Biological Cybernetics 77(5), 339–350 (1997) 2. Barbar´ a, D., Chen, P.: Using the fractal dimension to cluster datasets. In: KDD, pp. 260–264 (2000) 3. Barnsley, M.: Fractals everywhere. Academic Press Professional, Inc., San Diego (1988) 4. Climescu-Haulica, A.: How to Choose the Number of Clusters: The Cramer Multiplicity Solution. In: Decker, R., Lenz, H.J. (eds.) Advances in Data Analysis, Proceedings of the 30th Annual Conference of the Gesellschaft f¨ ur Klassifikation e.V., Freie Universit¨ at Berlin, March 8-10. Studies in Classification, Data Analysis, and Knowledge Organization, pp. 15–22. Springer, Heidelberg (2006) 5. De Hoon, M., Imoto, S., Nolan, J., Miyano, S.: Open source clustering software. Bioinformatics 20, 1453–1454 (2004), http://portal.acm.org/citation.cfm?id=1092875.1092876 6. Eke, A., Herman, P., Kocsis, L., Kozak, L.: Fractal characterization of complexity in temporal physiological signals. Physiological measurement 23(1), R–R38 (2002) 7. Goh, C., Hamadicharef, B., Henderson, G.T., Ifeachor, E.C.: Comparison of Fractal Dimension Algorithms for the Computation of EEG Biomarkers for Dementia. In: Proceedings of the 2nd International Conference on Computational Intelligence in Medicine and Healthcare (CIMED 2005), Costa da Caparica, Lisbon, Portugal, June 29-July 1 (2005) 8. Hao, L., Ghodadra, R., Thakor, N.V.: Quantification of Brain Injury by EEG Cepstral Distance during Transient Global Ischemia. In: Proceedings - 19th International Conference - IEEE/EMBS, Chicago, IL., USA, October 30-November 2 (1997) 9. Kalpakis, K., Gada, D., Puttagunta, V.: Distance Measures for Effective Clustering of ARIMA Time-Series. In: ICDM 2001: Proceedings of the 2001 IEEE International Conference on Data Mining, pp. 273–280. IEEE Computer Society, Washington, DC (2001) 10. Lin, G., Chen, L.: A Grid and Fractal Dimension-Based Data Stream Clustering Algorithm. In: International Symposium on Information Science and Engieering, vol. 1, pp. 66–70 (2008) 11. Mazel, D.S., Hayes, M.H.: Fractal modeling of time-series data. In: Conference Record of the Twenty-Third Asilomar Conference of Signals, Systems and Computers, pp. 182–186 (1989) 12. Malcok, M., Aslandogan, Y.A., Yesildirek, A.: Fractal dimension and similarity search in high-dimensional spatial databases. In: IRI, pp. 380–384 (2006) 13. Pachori, R.B.: Discrimination between ictal and seizure-free EEG signals using empirical mode decomposition. Res. Let. Signal Proc. 2008, 1–5 (2008) 14. Sarkar, M., Leong, T.Y.: Characterization of medical time series using fuzzy similarity-based fractal dimensions. Artificial Intelligence in Medicine 27(2), 201– 222 (2003) 15. Yan, G., Li, Z.: Using cluster similarity to detect natural cluster hierarchies. In: FSKD (2), pp. 291–295 (2007)
DISC: Data-Intensive Similarity Measure for Categorical Data Aditya Desai, Himanshu Singh, and Vikram Pudi International Institute of Information Technology-Hyderabad, Hyderabad, India {aditya.desai,himanshu.singh}@research.iiit.ac.in,
[email protected] http://iiit.ac.in
Abstract. The concept of similarity is fundamentally important in almost every scientific field. Clustering, distance-based outlier detection, classification, regression and search are major data mining techniques which compute the similarities between instances and hence the choice of a particular similarity measure can turn out to be a major cause of success or failure of the algorithm. The notion of similarity or distance for categorical data is not as straightforward as for continuous data and hence, is a major challenge. This is due to the fact that different values taken by a categorical attribute are not inherently ordered and hence a notion of direct comparison between two categorical values is not possible. In addition, the notion of similarity can differ depending on the particular domain, dataset, or task at hand. In this paper we present a new similarity measure for categorical data DISC - Data-Intensive Similarity Measure for Categorical Data. DISC captures the semantics of the data without any help from domain expert for defining the similarity. In addition to these, it is generic and simple to implement. These desirable features make it a very attractive alternative to existing approaches. Our experimental study compares it with 14 other similarity measures on 24 standard real datasets, out of which 12 are used for classification and 12 for regression, and shows that it is more accurate than all its competitors. Keywords: Categorical similarity measures, cosine similarity, knowledge discovery, classification, regression.
1
Introduction
The concept of similarity is fundamentally important in almost every scientific field. Clustering, distance-based outlier detection, classification and regression are major data mining techniques which compute the similarities between instances and hence choice of a particular similarity measure can turn out to be a major cause of success or failure of the algorithm. For these tasks, the choice of a similarity measure can be as important as the choice of data representation or feature selection. Most algorithms typically treat the similarity computation as an orthogonal step and can make use of any measure. Similarity measures can be broadly divided in two categories: similarity measures for continuous data and categorical data. J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 469–481, 2011. c Springer-Verlag Berlin Heidelberg 2011
470
A. Desai, H. Singh, and V. Pudi
The notion of similarity measure for continuous data is straightforward due to inherent numerical ordering. M inkowski distance and its special case, the Euclidean distance are the two most widely used distance measures for continuous data. However, the notion of similarity or distance for categorical data is not as straightforward as for continuous data and hence is a major challenge. This is due to the fact that the different values that a categorical attribute takes are not inherently ordered and hence a notion of direct comparison between two categorical values is not possible. In addition, the notion of similarity can differ depending on the particular domain, dataset, or task at hand. Although there is no inherent ordering in categorical data, there are other factors like co-occurrence statistics that can be effectively used to define what should be considered more similar and vice-versa. This observation has motivated researchers to come up with data-driven similarity measures for categorical attributes. Such measures take into account the frequency distribution of different attribute values in a given data set but most of these algorithms fail to capture any other feature in the dataset apart from frequency distribution of different attribute values in a given data set. One solution to the problem is to build a common repository of similarity measures for all commonly occurring concepts. As an example, let the similarity values for the concept “color” be determined. Now, consider 3 colors red, pink and black. Consider the two domains as follows: – Domain I: The domain is say determining the response of cones of the eye to the color, then it is obvious that the cones behave largely similarly to red and pink as compared to black. Hence similarity between red and pink must be high compared to the similarity between red and black or pink and black. – Domain II: Consider another domain, for example the car sales data. In such a domain, it may be known that the pink cars are extremely rare as compared to red and black cars and hence the similarity between red and black must be larger than that between red and pink or black and pink in this case. Thus, the notion of similarity varies from one domain to another and hence the assignment of similarity must involve a thorough understanding of the domain. Ideally, the similarity notion is defined by a domain expert who understands the domain concepts well. However, in many applications domain expertise is not available and the users don’t understand the interconnections between objects well enough to formulate exact definitions of similarity or distance. In the absence of domain expertise it is conceptually very hard to come up with a domain independent solution for similarity. This makes it necessary to define a similarity measure based on latent knowledge available from data instead of a fit-to-all measure and is the major motivation for this paper. In this paper we present a new similarity measure for categorical data, DISC – Data-Intensive Similarity Measure for Categorical Data. DISC captures the semantics of the data without any help from domain experts for defining similarity. It achieves this by capturing the relationships that are inherent in the data itself, thus making the similarity measure “data-intensive”. In addition, it is generic and simple to implement.
DISC: Data-Intensive Similarity Measure for Categorical Data
471
The remainder of the paper is organized as follows. In Section 2 we discuss related work and problem formulation in Section 3. We present the DISC algorithm in Section 4 followed by experimental evaluation and results in Section 5. Finally, in Section 6, we summarize the conclusions of our study and identify future work. 1.1
Key Contributions
– Introducing a notion of similarity between two values of a categorical attribute based on co-occurrence statistics. – Defining a valid similarity measure for capturing such a notion which can be used out-of-the-box for any generic domain. – Experimentally validating that such a similarity measure provides a significant improvement in accuracy when applied to classification and regression on a wide array of dataset domains. The experimental validation is especially significant since it demonstrates a reasonably large improvement in accuracy by changing only the similarity measure while keeping the algorithm and its parameters constant.
2
Related Work
Determining similarity measures for categorical data is a much studied field as there is no explicit notion of ordering among categorical values. Sneath and Sokal were among the first to put together and discuss many of the categorical similarity measures and discuss this in detail in their book [2] on numerical taxonomy. The specific problem of clustering categorical data has been actively studied. There are several books [3,4,5] on cluster analysis that discuss the problem of determining similarity between categorical attributes. The problem has also been studied recently in [17,18]. However, most of these approaches do not offer solutions to the problem discussed in this paper, and the usual recommendation is to “binarize” the data and then use similarity measures designed for binary attributes. Most work has been carried out on development of clustering algorithms and not similarity functions. Hence these works are only marginally or peripherally related to our work. Wilson and Martinez [6] performed a detailed study of heterogeneous distance functions (for categorical and continuous attributes) for instance based learning. The measures in their study are based upon a supervised approach where each data instance has class information in addition to a set of categorical/continuous attributes. There have been a number of new data mining techniques for categorical data that have been proposed recently. Some of them use notions of similarity which are neighborhood-based [7,8,9], or incorporate the similarity computation into the learning algorithm[10,11]. These measures are useful to compute the neighborhood of a point and neighborhood-based measures but not for calculating similarity between a pair of data instances. In the area of information retrieval, Jones et al. [12] and Noreault et. al [13] have studied several similarity measures.
472
A. Desai, H. Singh, and V. Pudi
Another comparative empirical evaluation for determining similarity between fuzzy sets was performed by Zwick et al. [14], followed by several others [15,16]. In our experiments we have compared our approach with the methods discussed in [1]; which provides a recent exhaustive comparison of similarity measure for categorical data.
3
Problem Formulation
In this section we discuss the necessary conditions for a valid similarity measure. Later, in Section 4.5 we describe how DISC satisfies these requirements and prove the validity of our algorithm. The following conditions need to hold for a distance metric “d” to be valid where d(x, y) is the distance between x and y. 1. d(x, y) ≥ 0 2. d(x, y) = 0 if and only if x=y 3. d(x, y) = d(y, x) 4. d(x, z) ≤ d(x, y) + d(y, z) In order to come up with conditions for a valid similarity measure we use sim = 1 , a distance-similarity mapping used in [1]. Based on this mapping we come 1+dist up with the following definitions for valid similarity measures: 1. 0 ≤ Sim(x, y) ≤ 1 2. Sim(x, y) = 1 if and only if x = y 3. Sim(x, y) = Sim(y, x) 1 1 1 4. Sim(x,y) + Sim(y,z) ≥ 1 + Sim(x,z) Where, Sim(x, y) is the similarity between x and y:
4
DISC Algorithm
In this section we present the DISC algorithm. First in Section 4.1 we present the motivation for our algorithm followed by data-structure description in Section 4.2 and a brief overview of the algorithm in Section 4.3. We then describe the algorithm for similarity matrix computation in Section 4.4. Finally in Section 4.5 we validate our similarity measure. 4.1
Motivation and Design
As can be seen from the related work, current similarity (distance) measures for categorical data only examine the number of distinct categories and their counts without looking at co-occurrence statistics with other dimensions in the data. Thus, there is a high possibility that, the latent information that comes along is lost during the process of assigning similarities. Consider the example in Table 1, let there be a 3-column dataset where the Brand of a car and Color are independent variables and the P rice of the car is a dependant variable. Now there are three brands a, b, c with average price 49.33, 32.33, 45.66. It can be intuitively said that based on the available information that similarity between a and c is greater than that between the categories a and b. This is true in
DISC: Data-Intensive Similarity Measure for Categorical Data
473
Table 1. Illustration Brand a a a b b b c c c
Color red green blue red green blue red green blue
Price 50 48 50 32 30 35 47 45 45
real life where a, c, b may represent low, medium and high end cars and hence the similarity between a low-end and a medium-end car will be more than the similarity between a low-end and a high-end car. Now the other independent variable is Color. The average prices corresponding to the three colors namely red, green and blue are 43, 41, 43.33. As can be seen, there is a small difference in their prices which is in line with the fact that the cost of the car is very loosely related to its color. It is important to note that a notion of similarity for categorical variables has a cognitive component to it and as such each one is debatable. However, the above explained notion of similarity is the one that best exploits the latent information for assigning similarity and will hence give predictors of high accuracy. This claim is validated by the experimental results. Extracting these underlying semantics by studying co-occurrence data forms the motivation for the algorithm presented in this section. 4.2
Data Structure Description
We first construct a data structure called the categorical information table (CI). The function of the CI table is to provide a quick-lookup for information related to the co-occurrence statistics. Thus, for the above example CI[Brand:a] [Color:red] = 1, as for only a single instance, Brand:a co-occurs with Color:red. For a categorical-numeric pair, e.g. CI[Brand:a][Price] = 49.33 the value is the mean value of the attribute P rice for instances whose value for Brand is a. Now, for every value v of each categorical attribute Ak a representative point τ (Ak : v) is defined. The representative point is a vector consisting of the means of all attributes other than Ak for instances where attribute Ak takes value v: τ (Ak : v) =< μ(Ak : v, A1 ), . . . , μ(Ak : v, Ad ) >
(1)
It may be noted that the term μ(Ak : v, Ak ) is skipped in the above expression. As there is no standard notion of mean for categorical attributes we define it as μ(Ak : v, Ai ) =< CI[Ak : v][Ai : vi1 ], . . . , CI[Ak : v][Ai : vin ] >
(2)
474
A. Desai, H. Singh, and V. Pudi
where domain(Ai ) = {vi1 , . . . , vin } It can thus be seen that, the mean itself is a point in a n-dimensional space having dimensions as vi1 ,. . . ,vin with magnitudes: < CI[Ak : v][Ai : vi1 ], . . . , CI[Ak : v][Ai : vin ] >. Initially all distinct values belonging to the same attribute are conceptually vectors perpendicular to each other and hence the similarity between them is 0. For, the given example, the mean for dimension Color when Brand : a is denoted as μ(Brand : a, Color). As defined above, the mean in a categorical dimension is itself a point in a n-dimensional space and hence, the dimensions of mean for the attribute Color are red, blue, green and hence μ(Brand : a, Color) = {CI[Brand : a][Color : red], CI[Brand : a][Color : blue], CI[Brand : a][Color : green]} Similarly, μ(Brand : a, P rice) = {CI[Brand : a][P rice]} Thus the representative point for the value a of attribute Brand is given by, τ (Brand : a) =< μ(Brand : a, Color), μ(Brand : a, P rice) > 4.3
Algorithm Overview
Initially we calculate the representative points for all values of all attributes. We then initialize similarity in a manner similar to the overlap similarity measure where matches are assigned similarity value 1 and the mismatches are assigned similarity value 0. Using the representative points calculated above, we assign a new similarity between each pair of values v, v belonging to the same attribute Ak as equal to the average of cosine similarity between their means for each dimension. Now the cosine similarity between v and v in dimension Ai is denoted by CS(v : v , Ai ) and is equal to the cosine similarity between vectors μ(Ak : v, Ai ) and μ(Ak : v , Ai ). Thus, similarity between Ak : v and Ak : v is: d l=0,l=i CS(v : v , Al ) d−1 Thus, for the above example, the similarity between Brand:a and Brand:b is the average of cosine similarity between their respective means in dimensions Color rice) and Price. Thus Sim(a, b) is given as: CS(a:b,Color)+CS(a:b,P 2 An iteration is said to have been completed, when similarity between all pairs of values belonging to the same attribute (for all attributes) are computed using the above methodology. These, new values are used for cosine similarity computation in the next iteration. 4.4
DISC Computation
In this section, we describe the DISC algorithm and hence the similarity matrix construction. The similarity matrix construction using DISC is described as follows: 1. The similarity matrix is initialized in a manner similar to overlap similarity measure where ∀i,j,k Sim(vij , vik ) = 1, if vij = vik and Sim(vij , vik ) = 0, if vij = vik
DISC: Data-Intensive Similarity Measure for Categorical Data
475
Table 2. Cosine Similarity computation between vij , vik Similaritym =
1−
|CI[Ai :vij ][Am ]−CI[Ai :vik ][Am ]| M ax[Am ]−M in[Am ]
; if Am is N umeric
CosineP roduct(CI[Ai : vij ][Am ], CI[Ai : vik ][Am ]); if Am is Categorical where CosineP roduct(CI[Ai : vij ][Am ], CI[Ai : vik ][Am ]) is def ined as f ollows : l ] ∗ Sim(vm¯ l , vml ) vml ,v ¯Am CI[Ai : vij ][Am : vml ] ∗ CI[Ai : vik ][Am : Vm¯ ml
N ormalV ector1 ∗ N ormalV ector2
N ormalV ector1 = ( vml ,v ¯Am CI[Ai : vij ][Am : vml ] ∗ CI[Ai : vij ][Am , vm¯l ] ∗ Sim(vml , vm¯l ))1/2 ml N ormalV ector2 = ( vml ,v ¯Am CI[Ai : vik ][Am : vml ] ∗ CI[Ai : vik ][Am , vm¯l ] ∗ Sim(vml , vm¯l ))1/2 ml d 1 Sim(vij , vik ) = d−1 m=1,m=i Similaritym
2. Consider a training dataset to be consisting of n tuples. The value of the feature variable Aj corresponding to the ith tuple is given as T rainij . We construct a data-structure “Categorical Information” which for any categorical value (vil ) of attribute Ai returns number of co-occurrences of value vjk taken by feature variable Aj if Aj is categorical and returns the mean value of feature variable Aj for the corresponding set of instances if it is numeric. Let this data-structure be denoted by CI. The value corresponding to the number of co-occurrences of categorical value vjk when feature variable Ai takes value vil is given by CI[Ai , vil ][Aj , vjk ] when Aj is categorical. Also, when Aj is numeric, CI[Ai , vil ][Aj ] corresponds to the mean of values taken by attribute Aj when Ai takes value vil . 3. The Sim(vij ,vik ) (Similarity between categorical values vij and vik ) is now calculated as the average of the per-attribute cosine similarity between their means (Similaritym ), where the means have a form as described above. The complicated nature of cosine product arises due to the fact that, the transformed space after the first iteration has dimensions which are no longer orthogonal (i.e. Sim(vij , vik ) is no longer 0). 4. The matrix is populated using the above equation for all combinations ∀i,j,k Sim(vij , vik ). To test the effectiveness of the similarity matrix, we plug the similarity values in a classifier (the nearest neighbour classifier in our case) in case of classification and compute its accuracy on a validation set. If the problem domain is regression we plug in the similarity values into a regressor (the nearest neighbour regressor in our case) and compute the corresponding root mean square error. Thus, such an execution of 3 followed by 4 is termed an iteration. 5. The step 3 is iterated on again using the new similarity values until the accuracy parameter stops increasing. The matrix obtained at this iteration is the final similarity matrix that is used for testing. (In case of regression we stop when the root mean square error increases.) In addition, the authors have observed that, most of the improvement takes place in the first iteration and hence in domains like clustering
476
A. Desai, H. Singh, and V. Pudi
(unsupervised) or in domains with tight limits on training time the algorithm can be halted after the first iteration. 4.5
Validity of Similarity Measure
The similarity measure proposed in this paper is basically a mean of cosine similarities derived for individual dimensions in non-orthogonal spaces. The validity of the similarity measure can now be argued as follows: 1. As the similarity measure is a mean of cosine similarities which have a range from 0-1, it is implied that the range of values output by the similarity measure will be between 0-1 thus satisfying the first constraint. 2. For the proposed similarity measure Sim(X, Y ) = 1, if and only if Simk (Xk , Yk ) = 1 for all feature variables Ak . Now, constraint 2 will be violated if X = Y and Sim(X, Y ) = 1. This implies that there exists an Xk , Yk such that Xk = Yk and for which Sim(Xk , Yk ) = 1. Now for Sim(Xk , Yk ) = 1 implies cosine product of CI[Ak : Xk ][Am ] and CI[Ak : Yk ][Am ] is 1 for all Am which implies that CI[Ak : Xk ][Am ], CI[Ak : Yk ][Am ] are parallel and hence can be considered to be equivalent with respect to the training data. 3. As cosine product is commutative, the third property holds implicitly. 4. It may be noted that the resultant similarity is a mean of similarities computed for each dimension. Also, the similarity for each dimension is in essence a cosine product and hence, the triangle inequality holds for each component of the sum. Thus the fourth property is satisfied.
5
Experimental Study
In this section, we describe the pre-processing steps and the datasets used in Section 5.1 followed by experimental results in Section 5.2. Finally in Section 5.3 we provide a discussion on the experimental results. 5.1
Pre-processing and Experimental Settings
For our experiments we used 24 datasets out of which 12 were used for classification and 12 for regression. We compare our approach with the approaches discussed in [1], which provides a recent exhaustive comparison of similarity measures for categorical data. Eleven of the datasets used for classification were purely categorical and one was numeric (Iris). Different methods can be used to handle numeric attributes in datasets like discretizing the numeric variables using the concept of minimum description length [20] or equi-width binning. Another possible way to handle a mixture of attributes is to compute the similarity for continuous and categorical attributes separately, and then do a weighted aggregation. For our experiments we used MDL for discretizing numeric variables for classification datasets. Nine of the datasets used for regression were purely numeric, two (Abalone and Auto Mpg) were mixed while one (Servo) was purely categorical. It may be noted that the datasets used for regression were discretized using equi-width
DISC: Data-Intensive Similarity Measure for Categorical Data
477
binning using the following weka setting: “weka.f ilters.unsupervised.attribute. Discretize − B10 − M − 1.0 − Rf irst − last” The k-Nearest Neighbours (kN N ) was implemented with number of neighbours 10. The weight associated with each neighbour was the similarity between the neighbour and the input tuple. The class with the highest weighted votes was the output class for classification while the output for regression was a weighted sum of the individual responses. The results have been presented for 10-folds cross-validation. Also, for our experiments we used the entire train set as the validation set. The numbers in brackets indicate the rank of DISC versus all other competing similarity measures. For classification, the values indicate the accuracy of the classifier where a high value corresponds to high percentage accuracy and hence such a similarity measure is assigned a better (higher) rank. On the other hand, for regression Root Mean Square Error (RMSE) value has been presented where a comparatively low value indicates lower error and better performance of the predictor and hence such a similarity measure is assigned a better rank. It may be noted that a rank of 1 indicates best performance with the relative performance being poorer for lower ranks. 5.2
Experimental Results
The experimental results for classification and regression are presented in Table 3, 4 and Table 5, 6 respectively. In these tables each row represents competing similarity measure and the column represents different datasets. In Table 3 and 4, each cell represents the accuracy for the corresponding dataset and similarity measure respectively. In Table 5 and 6, each cell represents the root mean square error (RMSE) for the corresponding dataset and similarity measure respectively. 5.3
Discussion of Results
As can be seen from the experimental results, DISC is the best similarity measure for classification for all datasets except Lymphography, Primary Tumor and Hayes Roth Test where it is the third best for the first two and the second best for the last one. On the basis of overall mean accuracy, DISC outperforms the nearest competitor by about 2.87% where we define overall mean accuracy as as the mean of accuracies over all classification datasets considered for our experiments. For regression, DISC is the best performing similarity measure on the basis of Root Mean Square Error (RMSE) for all datasets. For classification datasets like Iris, Primary Tumor and Zoo the algorithm halted after the 1st iteration while for datasets like Balance, Lymphography, Tic-Tac-Toe, Breast Cancer the algorithm halted after the 2nd iteration. Also, for Car-Evaluation, Hayes Roth, Teaching Assistant and Nursery the algorithm halted after the 3rd iteration while it halted after the 4th iteration for Hayes Roth Test. For regression, the number of iterations was less than 5 for all datasets except Compressive Strength for which it was 9. Thus, it can be seen that the number of iterations for all datasets is small. Also, the authors observed that
478
A. Desai, H. Singh, and V. Pudi Table 3. Accuracy for k-NN with k = 10
Dataset Sim. Measure DISC Overlap Eskin IOF OF Lin Lin1 Goodall1 Goodall2 Goodall3 Goodall4 Smirnov Gambaryan Burnaby Anderberg
Balance 90.4(1) 81.92 81.92 79.84 81.92 81.92 81.92 81.92 81.92 81.92 81.92 81.92 81.92 81.92 81.92
Breast Cancer 76.89(1) 75.81 73.28 76.89 74.0 74.72 75.09 74.36 73.28 73.64 74.72 71.48 76.53 70.39 72.2
Car Evaluation 96.46(1) 92.7 91.2 91.03 90.85 92.7 90.85 90.85 90.85 90.85 91.03 90.85 91.03 90.85 90.85
Hayes Roth 77.27(1) 64.39 22.72 63.63 17.42 71.96 18.93 72.72 59.09 39.39 53.78 59.84 53.03 19.69 25.0
Iris 96.66(1) 96.66 96.0 96.0 95.33 95.33 94.0 95.33 96.66 95.33 96.0 94.0 96.0 95.33 94.0
Lymphography 85.13(3) 81.75 79.72 81.75 79.05 84.45 82.43 86.48 81.08 85.13 81.08 85.81 82.43 75.0 80.4
Table 4. Accuracy for k-NN with k = 10 Dataset Primary Hayes Sim. Measure Tumor Roth Test DISC 41.66(3) 89.28(2) Overlap 41.66 82.14 Eskin 41.36 75.0 IOF 38.98 71.42 OF 40.17 60.71 Lin 41.66 67.85 Lin1 42.26 42.85 Goodall1 43.15 89.28 Goodall2 38.09 92.85 Goodall3 41.66 71.42 Goodall4 32.73 82.14 Smirnov 42.55 78.57 Gambaryan 39.58 89.28 Burnaby 3.86 60.71 Anderberg 37.79 53.57
Tic Tac Toe 100.0(1) 92.48 94.46 100.0 84.96 95.82 82.56 97.07 91.54 95.51 96.24 98.74 98.74 83.29 89.14
Zoo
Teaching Nursery Mean Assist. Accuracy
91.08(1) 91.08 90.09 90.09 89.1 90.09 91.08 89.1 88.11 89.1 89.1 89.1 90.09 71.28 90.09
58.94(1) 50.33 50.33 47.01 43.7 56.95 54.96 51.65 52.98 50.99 55.62 54.3 50.33 40.39 50.33
98.41(1) 94.75 94.16 94.16 95.74 96.04 93.54 95.74 95.74 95.74 94.16 95.67 94.16 90.85 95.74
83.51(1) 78.81 74.19 77.57 71.08 79.13 70.87 80.64 78.52 75.89 77.38 78.57 78.59 65.30 71.75
the major bulk of the accuracy improvement is achieved with the first iteration and hence for domains with time constraints in training the algorithm can be halted after the first iteration. The reason for the consistently good performance can be attributed to the fact that a similarity computation is a major component in nearest neighbour classification and regression techniques, and DISC captures similarity accurately and efficiently in a data driven manner.
DISC: Data-Intensive Similarity Measure for Categorical Data
479
Table 5. RMSE for k-NN with k = 10 Dataset DISC Overlap Eskin IOF OF Lin Lin1 Goodall1 Goodall2 Goodall3 Goodall4 Smirnov Gambaryan Burnaby Anderberg
Comp. Strength 4.82(1) 6.3 6.58 6.18 6.62 6.03 7.3 6.66 6.37 6.71 5.98 6.89 6.01 6.63 7.15
Flow
Abalone
Bodyfat
Housing
Whitewine
13.2(1) 15.16 16.0 15.53 14.93 16.12 16.52 14.97 15.09 14.96 15.67 15.5 15.46 15.23 15.16
2.4(1) 2.44 2.45 2.42 2.41 2.4 2.41 2.41 2.43 2.41 2.47 2.4 2.46 2.41 2.42
0.6(1) 0.65 0.66 0.76 0.66 0.63 0.87 0.64 0.66 0.65 0.71 0.67 0.67 0.65 0.67
4.68(1) 5.4 6.0 5.48 5.27 5.3 5.41 5.27 5.33 5.27 6.4 5.17 5.73 5.32 5.84
0.74(1) 0.74 0.77 0.75 0.75 0.74 0.74 0.74 0.75 0.74 0.78 0.74 0.76 0.74 0.75
Table 6. RMSE for k-NN with k = 10 Dataset DISC Overlap Eskin IOF OF Lin Lin1 Goodall1 Goodall2 Goodall3 Goodall4 Smirnov Gambaryan Burnaby Anderberg
Slump 6.79(1) 7.9 8.12 8.11 7.72 8.33 8.42 7.76 7.82 7.75 7.87 8.22 7.8 7.89 7.94
Servo 0.54(1) 0.78 0.77 0.77 0.8 0.76 1.1 0.77 0.81 0.78 0.95 0.78 0.83 0.8 0.9
Redwine 0.65(1) 0.67 0.68 0.68 0.68 0.67 0.68 0.66 0.67 0.67 0.71 0.69 0.69 0.68 0.7
Forest Fires 65.96(1) 67.13 67.49 67.95 67.76 67.16 67.96 67.97 68.64 68.48 70.28 67.07 69.54 67.73 66.63
Concrete 10.29(1) 11.61 11.15 11.36 12.55 10.99 12.16 11.45 12.21 11.52 12.96 11.59 12.38 12.62 12.66
Auto Mpg 2.96(1) 3.58 3.98 3.71 3.3 3.74 3.89 3.5 3.39 3.39 3.92 3.39 3.75 3.28 3.53
The computational complexity for determining the similarity measure is equivalent to the computational complexity of computing cosine similarity for each pair of values belonging to the same categorical attribute. Let the number of pairs of values, the number of tuples, number of attributes and the average number of values per attribute be V , n, d and v respectively. It can be seen that, construction of categorical collection is O(nd). Also, for all pairs of values V, we compute the similarity as the mean of cosine similarity of their representative points for each dimension. This is essentially (v 2 d) for each pair and hence the computationally complexity is O(V v 2 d) and hence the overall complexity
480
A. Desai, H. Singh, and V. Pudi
is O(nd + V v 2 d). Once, the similarity values are computed, using them in any classification, regression or a clustering task is a simple table look up and is hence O(1).
6
Conclusion
In this paper we have presented and evaluated DISC, a similarity measure for categorical data. DISC is data intensive, generic and simple to implement. In addition to these features, it doesn’t require any domain expert’s knowledge. Finally our algorithm was evaluated against 14 competing algorithms on 24 standard real-life datasets, out of which 12 were used for classification and 12 for regression. It outperformed all competing algorithms on almost all datasets. The experimental results are especially significant since it demonstrates a reasonably large improvement in accuracy by changing only the similarity measure while keeping the algorithm and its parameters constant. Apart from classification and regression, similarity computation is a pivotal step in a number of application such as clustering, distance-based outliers detection and search. Future work includes applying our algorithm for these techniques also. We also intend to develop a weighing measure for different dimensions for calculating similarity which will make the algorithm more robust.
References 1. Boriah, S., Chandola, V., Kumar, V.: Similarity Measures for Categorical Data: A Comparative Evaluation. In: Proceedings of SDM 2008. SIAM, Atlanta (2008) 2. Sneath, P.H.A., Sokal, R.R.: Numerical Taxonomy: The Principles and Practice of Numerical Classification. W. H. Freeman and Company, San Francisco (1973) 3. Anderberg, M.R.: Cluster Analysis for Applications. Academic Press, London (1973) 4. Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988) 5. Hartigan, J.A.: Clustering Algorithms. John Wiley & Sons, New York (1975) 6. Wilson, D.R., Martinez, T.R.: Improved heterogeneous distance functions. J. Artif. Intell. Res. (JAIR) 6, 1–34 (1997) 7. Biberman, Y.: A context similarity measure. In: Bergadano, F., De Raedt, L. (eds.) ECML 1994. LNCS, vol. 784, pp. 49–63. Springer, Heidelberg (1994) 8. Das, G., Mannila, H.: Context-based similarity measures for categorical databases. ˙ In: Zighed, D.A., Komorowski, J., Zytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 201–210. Springer, Heidelberg (2000) 9. Palmer, C.R., Faloutsos, C.: Electricity based external similarity of categorical attributes. In: Whang, K.-Y., Jeon, J., Shim, K., Srivastava, J. (eds.) PAKDD 2003. LNCS (LNAI), vol. 2637, pp. 486–500. Springer, Heidelberg (2003) 10. Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery 2(3), 283–304 (1998) 11. Ganti, V., Gehrke, J., Ramakrishnan, R.: CACTUS–clustering categorical data using summaries. In: KDD 1999. ACM Press, New York (1999) 12. Jones, W.P., Furnas, G.W.: Pictures of relevance: a geometric analysis of similarity measures. J. Am. Soc. Inf. Sci. 38(6), 420–442 (1987)
DISC: Data-Intensive Similarity Measure for Categorical Data
481
13. Noreault, T., McGill, M., Koll, M.B.: A performance evaluation of similarity measures, document term weighting schemes and representations in a boolean environment. In: SIGIR 1980: Proceedings of the 3rd Annual ACM Conference on Research and Development in Information Retrieval, Kent, UK, pp. 57–76. Butterworth & Co. (1981) 14. Zwick, R., Carlstein, E., Budescu, D.V.: Measures of similarity among fuzzy concepts: A comparative analysis. International Journal of Approximate Reasoning 1(2), 221–242 (1987) 15. Pappis, C.P., Karacapilidis, N.I.: A comparative assessment of measures of similarity of fuzzy values. Fuzzy Sets and Systems 56(2), 171–174 (1993) 16. Wang, X., De Baets, B., Kerre, E.: A comparative study of similarity measures. Fuzzy Sets and Systems 73(2), 259–268 (1995) 17. Gibson, D., Kleinberg, J.M., Raghavan, P.: Clustering categorical data: An approach based on dynamical systems. VLDB Journal 8(34), 222–236 (2000) 18. Guha, S., Rastogi, R., Shim, K.: ROCK–a robust clusering algorith for categorical attributes. In: Proceedings of IEEE International Conference on Data Engineering (1999) 19. Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005) 20. Fayyad, U.M., Irani, K.B.: On the handling of continuous-valued attributes in decision tree generation. Machine Learning 8, 87–102 (1992)
ListOPT: Learning to Optimize for XML Ranking Ning Gao1 , Zhi-Hong Deng1 2 , Hang Yu1 , and Jia-Jian Jiang1
1 Key Laboratory of Machine Perception (Ministry of Education), School of Electronic Engineering and Computer Science, Peking University 2 The State Key Lab of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China
Abstract. Many machine learning classification technologies such as boosting, support vector machine or neural networks have been applied to the ranking problem in information retrieval. However, since the purpose of these learning-torank methods is to directly acquire the sorted results based on the features of documents, they are unable to combine and utilize the existing ranking methods proven to be e ective such as BM25 and PageRank. To solve this defect, we conducted a study on learning-to-optimize, which is to construct a learning model or method for optimizing the free parameters in ranking functions. This paper proposes a listwise learning-to-optimize process ListOPT and introduces three alternative di erentiable query-level loss functions. The experimental results on the XML dataset of Wikipedia English show that these approaches can be successfully applied to tuning the parameters used in an existing highly cited ranking function BM25. Furthermore, we found that the formulas with optimized parameters indeed improve the e ectiveness compared with the original ones. Keywords: learning-to-optimize, ranking, BM25, XML.
1 Introduction Search engines have become an indispensable part of life and one of the key issues on search engine is ranking. Given a query, the ranking modules can sort the retrieval documents for maximally satisfying the user’s needs. Traditional ranking methods aim to compute the relevance of a document to a query, according to the factors, term frequencies and links for example. The search result is a ranked list in which the documents are sequenced by their relevance score in descending order. These kinds of methods include the content based functions such as TF*IDF [1] and BM25 [2], and link based functions such as PageRank [3] and HITS [4]. Recently, machine learning technologies have been successfully applied to information retrieval, known and named as “learning-to-rank”. The main procedure of “learningto-rank” is as follow: In learning module, a set of queries is given, and each of the queries is associated with a ground-truth ranking list of documents. The process targets Corresponding author. J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 482–492, 2011. c Springer-Verlag Berlin Heidelberg 2011
ListOPT: Learning to Optimize for XML Ranking
483
at creating a ranking model that can precisely predict the order of documents in the ground-truth list. Many learning-to-rank approaches have been proposed and based on the di erences of their learning samples, these methods can be classified into three categories [5]: pointwise, pairwise and listwise. Taking single document as learning object, the pointwise based methods intent to compute the relevance score of each document with respect to their closeness to the ground-truth. On the other side, pairwise based approaches take the document pair as learning sample, and rephrase the learning problem as classification problem. Lisewise based approaches take a ranked list as learning sample, and measure the di erences between the current result list and the ground-truth list via using a loss function. The learning purpose of listwise methods is to minimize the loss. The experimental results in [5] [11] [12] show that the listwise based methods perform the best among these three kinds of methods. It is worth noting that, from the perspective of ranking, the aforementioned learningto-rank methods belong to the learning based ranking technologies. Here the search results are directly obtained from the learning module, without considering the traditional content based or link based ranking functions. However, there is no evidence to confirm that the learning based methods perform better than all the other classic content based or link based methods. Accordingly, to substitute the other two kinds of ranking technologies with the learning based methods might not be appropriate. We hence consider a learning-to-optimize method ListOPT that can combine and utilize the benefits of learning-to-rank methods and traditional content based methods. Here the ranking method is the extension to the widely known ranking function BM25. Due to previous studies, experiments are conducted on selecting the parameters of BM25 with the best performance, typically after thousands of runs. However, this simple but exhaustive procedure is only applicable to the functions with few free parameters. Besides, whether the best parameter values are in the testing set is also under suspect. To attack this defect, a listwise learning method to optimize the free parameters is introduced. Same as learning-to-rank methods, the key issue of learning-to-optimize method is the definition of loss function. In this paper, we discuss the e ect of three distinct definition of loss in the learning process and the experiments show that all three loss functions converge. The experiments also reveal that the ranking function using tuned parameter set indeed performs better. The primary contributions of this paper include: (1) proposed a learning-to-optimize method which combine and utilize the traditional ranking function BM25 and listwise learning-to-rank method, (2) introduced the definition of three query-level loss functions on the basis of cosine similarity, Euclidean distance and cross entropy, confirmed to converge by experiments, (3) the verified the e ectiveness of the learning-to-optimize approach on a large XML dataset Wikipedia English[6]. The paper is organized as follows. In section 2, we introduce the related work. Section 3 gives the general description on learning-to-optimize approach ListOPT. The definition of the three loss functions are discussed in section 4. Section 5 reports our experimental results. Section 6 is the conclusion and future work.
484
N. Gao et al.
2 Related Work 2.1 Learning-to-Rank In recent years, many machine learning methods were applied to the problem of ranking for information retrieval. The existing learning-to-rank methods fall into three categories, pointwise, pairwise and listwise. The pointwise approaches [7] are firstly proposed, transforming the ranking problem into regression or classification on single candidate documents. On the other side, pairwise approaches, published later, regard the ranking process as a classification of document pairs. For example, given a query Q and an arbitrary document pair P (d1 d2 ) in the data collection, where di means the i-th candidate document, if d1 shows higher relevance than d2 , then object pair P is set as (p) 0, otherwise P is marked as (p) 0. The advantage of pointwise and pairwise approaches is that the existing classification or regression theories can be directly applied. For instance, borrowing support vector machine, boosting and neural network as the classification model leads to the methods of Ranking SVM [8], RankBoost [9] and RankNet [10]. However, the objective of pointwise and pairwise learning methods is to minimize errors in classification of single document or document pairs rather than to minimize errors in ranking of documents. To overcome this drawback of the aforementioned two approaches, listwise methods, such as ListNet [5], RankCosine [11] and ListMLE [12], are proposed. In lisewise approaches, the learning object is the result list and various kinds of loss functions are defined to measure the similarity of the predict result list and the ground-truth result list. ListNet, the first listwise approach proposed by Cao et al., uses the cross entropy as loss function. Qin et al. discussed about another listwise method called RankCosine, where the cosine similarity is defined as loss function. Xia et al. introduced likelihood loss as loss function in the listwise learning-to-rank method ListMLE. 2.2 Ranking Function BM25 In information retrieval, BM25 is a highly cited ranking function used by search engines to rank matching documents according to their relevance to a given search query. It is based on the probabilistic retrieval framework developed in the 1970s and 1980s. Though BM25 is proposed to rank the HTML format documents originally, it was introduced to the area of XML documents ranking in recent years. In the last three years of INEX1 [6] Ad Hoc track 2 [17] [18] [19], all the search engines that perform the best use BM25 as basic ranking function. To improve the performance of BM25, Taylor et al. introduced the pairwise learning-to-rank method RankNet to tune the parameters in BM25, named as RankNet Tuning method [13] in this paper. However, as mentioned in 2.1, the inherent disadvantages of pairwise methods had a pernicious influence on the 1
2
Initiative for the Evaluation of XML retrieval (INEX), a global evaluation platform, is launched in 2002 for organizations from Information Retrieval, Database and other relative research fields to compare the e ectiveness and eÆciency of their XML search engines. In Ad Hoc track, participates are organized to compare the retrieval e ectiveness of their XML search engines.
ListOPT: Learning to Optimize for XML Ranking
485
approach. Experiments in section 5 will compare the e ectiveness of RankNet Tuning with the other methods proposed in this paper.
3 ListOPT: A Learning-to-Optimize Approach In this section, we describe the details in the learning-to-optimize approach. Firstly, we give the formal definition of the ranking function BM25 used in XML retrieval and analyze the parameters in the formula. Then, the training process of the listwise learning-to-optimize approach ListOPT is proposed in 3.2. 3.1 BM25 in XML Retrieval Unlike the HTML retrieval, the searching retrieval results are elements in XML retrieval, the definition of BM25 is thus di erent from the traditional BM25 formula used in HTML ranking. The formal definition is as follow: ps(e Q)
Wt t¾ Q
(k 1) t f (t e)
k (1 b b Wt
log
len(e) avel )
t f (t e)
(1)
Nd n(t)
In the formula, t f (t e) is the frequency of keyword t appeared in element e; Nd is the number of files in the collection; n(t) is the number of files that contains keyword t; len(e) is the length of element e; avel is average length of elements in the collection; Q is a set of keywords; ps(e Q) is the predict relevance score of element e corresponding to query Q; b and k are two free parameters. As observed, the parameters in BM25 fall into three categories: constant parameters, fixed parameters and free parameters. For example, parameters describing the features of data collection like avel and Nd are defined as constant parameters. Given a certain query and a candidate element, t f (t e) and len(e) in the formula are fixed values. This kind of parameters is called fixed parameters. Moreover, free parameters, such as k and b in the function, are set to make the formula more adaptable to various kinds of data collections. Therefore, the ultimate objective of learning-to-optimize approach is to learn the optimal set of free parameters. 3.2 Training Process In training, there is a set of query Q q1 q2 qm . Each query qi is associated with a list of candidate elements E i (ei1 ei2 ein(i) ), where eij denotes the the j-th candidate element to query qi and n(i) is the size of E i . The candidate elements are defined as the elements that contain at least one occurrence of each keyword in the query. Moreover, each candidate elements list E i is associated with a ground-truth list Gi (gi1 gi2 gin(i) ), indicating the relevance score of each elements in E i . Given that the data collection we used only contains information of whether or not the passages in a document are
486
N. Gao et al.
relevant, we apply the F measure cite14 to evaluate the ground truth score. Given a query qi, the ground-truth score of the j-th candidate element is defined as follow: relevant relevant irrelevant relevant recall REL (1 012 ) precision recall i gj 012 precision recall precision
(2)
In the formula, relevant is the length of relevant contents highlighted by user in e, while irrelevant stands for the length of irrelevant parts. REL indicates the total length of relevant contents in the data collection. The general bias parameter is set as 0.1, denoting that the weight of precision is ten times as much as recall. Furthermore, for each query qi , we use the ranking function BM25 mentioned in 3.1 i to get the predict relevant score of each candidate element, recorded in Ri (r1i r2i rn(i) ). i i Then each ground-truth score list G and predicted score list R form a ”instance”. The loss function is defined as the ”distance” between standard results lists Di and search results lists Ri . m
L(Gi Ri )
(3)
i 1
In each training epoch, the ranking function BM25 was used to compute the predicted score Ri . Then the learning module replaced the current free parameters with the new parameters tuned according to the loss between Gi and Ri . Finally the process stops either while reaching the limit cycle index or when the parameters do not change.
4 Loss Functions In this section, three query level loss functions and the corresponding tuning formulas are discussed. Here the three definitions of loss are based on cosine similarity, Euclidean distance and cross entropy respectively. After computing the loss between the groundtruth Gi and the predicted Ri , the two free parameters k and b in BM25 are tuned as formula (4). Especially, and are set to control the learning speed. k k bb
k b
(4)
4.1 Cosine Similarity Widely used in text mining and information retrieval, cosine similarity is a measure of similarity between two vectors by finding the cosine of the angle between them. The definition of the query level loss function based on cosine similarity is: L(G R ) i
i
1 (1 2
n(i) i 1
i j
n(i) i 2 j 1 (g j )
gij rij n(i) i 2 j 1 (r j )
)
(5)
ListOPT: Learning to Optimize for XML Ranking
487
Note that in large data collection, given a query, the amount of relevant documents is regularly much less than the number of irrelevant documents. So that a penalty function i is set to avoid the learning bias on irrelevant documents. Formula (6) is the weight of j relevant documents in learning procedure, while formula (7) is the weight of irrelevant document. The formal definition is as follow:
i j
NRi NIRi NRi i NR NIRi NIRi
if (gij
0)
(6)
if (gij
0)
(7)
Where NRi is the number of relevant elements according to query qi and NIRi is the number of irrelevant ones. After measuring the loss between the ground-truth results and the predicted results, the adjustment range parameters k and b are determined according to the derivatives of k and b: With respect to k: m
k
L(G i Ri ) k
q 1
1 ( ) 2
m
In which: rqj k
Wt
jq (
q 1
n(i) j 1
gqj
n(i) q 2 j 1 (r j )
rqj k
(
n(i) q 2 j 1 (g j )
t f (t e) (t f (t e) k (1 b b
n(i) q j 1 rj
(
gqj )(
n(i) q 2 j 1 (r j )
n(i) q 2 j 1 (g j )
q r j n(i) q r ¡ k j 1 j n(i) q 2 (r ) j 1 j
(8)
) )
n(i) q 2 2 j 1 (g j ) )
t f (t e) (k 1) (1 b b len(e) ) avel 2 (t f (t e) k (1 b b len(e) )) avel
t¾Q
len(e) )) avel
(9)
b analogously: b
m
q 1
1 2
L(G i Ri ) b
( )
m
jq (
q 1
n(i) j 1
n(i) q 2 j 1 (r j )
rqj
gqj b
n(i) q 2 j 1 (g j )
(
n(i) q j 1 rj
(
q g j )(
n(i) q 2 j 1 (r j )
n(i) q 2 j 1 (g j )
q r j n(i) q r ¡ b j 1 j n(i) q 2 (r ) j 1 j
) )
n(i) q 2 2 j 1 (g j ) )
(10)
In which:
q
rj b
Wt t¾ Q
t f (t e) (k 1) k (1 (t f (t e) k (1 b b
len ) avel len(e) 2 )) avel
(11)
4.2 Euclidean Distance The Euclidean distance is also used in the definition of loss function. The circumscription of penalty parameter ij is the same as in formula (6) and (7). Hence, the loss function based on Euclidean distance is defined in formula (12).
488
N. Gao et al.
L(Gi Ri )
n(i)
( ij )2 (rij gij )2
(12)
j 1
The same as cosine similarity loss, we derive the derivatives of the loss function based on Euclidean distance with respect to k and b. The definition of as in formula (9) and formula (11) respectively. With respect to k:
k
m
q 1
L(Gi Ri )
k
q 1
b analogously:
b
m
q 1
i
i
L(G R )
b
n(i) i 2 q j 1 ( j ) (r j
m
q 1
q
are the same
q
(13)
q
g j )2
gj)
n(i) i 2 q j 1 ( j ) (r j
q
rj b
rj k
q
and
gj)
n(i) i 2 q j 1 ( j ) (r j
n(i) i 2 q j 1 ( j ) (r j
m
q
rj k
q
rj b
(14)
q
g j )2
4.3 Cross Entropy L(Gi Ri )
n(i)
i j
rij log(gij)
(15)
j 1
When considering cross entropy as metric, the loss function turns to formula (15). Moreover, the penalty parameter ij in the formula is the same as in formula (6) and (7) and the detailed tuning deflection of k and b is defined in formula (16) and formula (17) respectively. Additionally, the definition of (9) and formula (11). With respect to k:
k
m
q 1
L(Gi Ri )
k
n(i)
m q j
q rj
(
q 1
j 1
q
rj k
q
rj
k
q
rj b
and
are the same as in formula n(i)
1 n(i) j 1
gqj
q gj j 1
q
rj
k
)
(16)
)
(17)
b analogously:
b
m q 1
L(Gi Ri )
b
n(i)
m q j q 1
q rj
( j 1
q
rj
b
n(i)
1
n(i) j 1
gqj
q gj j 1
q
rj
b
5 Experiment In this section, the XML data set used in comparison experiments is first introduced. Then in section 5.2 we compare the e ectiveness of the optimized ranking function BM25 under two evaluation criterions: MAP [15] and NDCG [16]. Additionally
ListOPT: Learning to Optimize for XML Ranking
489
in section 5.3, we focus on testing the association between the number of training queries and the optimizing performance under the criterion of MAP. 5.1 Data Collection The data collection used in the experiments consists of 2,666,190 English XML files from Wikipedia, used by INEX Ad Hoc Track. The total size of these files is 50.7GB. The query set consists of 68 di erent topics from competition topics of Ad Hoc track, in which 40 queries are considered as training queries and others are test queries. Each query in the evaluation system is bound to a standard set of highlighted ”relevant content”, which is recognized manually by the participants of INEX. In the experiments, the training regards these highlighted ”relevant content” as ground truth results. 5.2 E«ect of BM25 Tuning To explore the e ect of the learning-to-optimize method ListOPT, we evaluate the effectiveness of di erent parameter sets. In the comparison experiments, Traditional Set stands for a highly used traditional set: k 2 b 075; RankNet Tuning stands for the tuning method proposed in [10]; cosine similarity, Euclidean distance and cross entropy are the learning-to-optimize methods using cosine similarity, Euclidean distance and cross entropy as loss function respectively. Evaluation System of Ad Hoc is the standard experiment platform in e ectiveness comparison here. We evaluate the searching e ectiveness of the aforementioned five methods in two criterions: MAP and NDCG. In MAP evaluation, we choose interpolated precision at 1% recall (iP[0.01]), precision at 10% recall (iP[0.10]) and MAiP as the evaluation criterions. While in NDCG evaluation, we test the retrieval e ect on NDCG@1 to NDCG@ 10. Figure 1 illustrates the comparison results under MAP measure. As is shown, the three learning-to-optimize methods proposed in this paper perform the best. It might looks confusing since that INEX 2009 Ad Hoc Focused track used to have search engine reaching ip[0.01] 0.63, compared to 0.34 the highest in our plot. However, this e ectiveness is regularly obtained by combining various ranking technologies together, such as two-layer strategy, re-ranking strategy and title tag bias, but not only BM25 itself. Given that our study is focused on BM25, it is therefore pointless to do such a combination. In this condition, the searching e ectiveness scores (ip[0.01]) in this experiment are unable to show the high level as the competitive engines in INEX did. The result presented in figure 2 show that the learning-to-optimize methods are indeed more robust in ranking tasks. The performance of the ranking methods becomes better when more results are returned. From the perspective of users, this phenomenon could be explained by the fact that INEX queries are all information query, meaning that the user’s purpose is to find more relevant information. On contrary, if the query is a navigation query, the user only needs the exact webpage. The first result is hence of highest importance and the evaluation score might decreases accordingly.
490
N. Gao et al.
Fig. 1. E ective Comparison on MAiP
Fig. 2. E ectiveness Comparison on NDCG
5.3 Number of Training Queries Figure 3 shows the relationship between the tuning e ectiveness and the quantity of training queries. In this experiment, the number of training queries changes from 1 to 40. As illustrated, the MAiP score lines all share an obvious ascent during the first several query numbers. After that, the pulsation of the performance keeps in a low level. This situation corresponds with the learning theory: when there are few queries in the training set, the learning is overfitting. With the increasing of query samples, the performance of learning procedure gets better and better till finally the most proper parameters for the data collection are found.
ListOPT: Learning to Optimize for XML Ranking
491
Fig. 3. Number of Training Queries
6 Conclusions and Future Work In this paper, we proposed a learning-to-optimize method ListOPT. ListOPT combines and utilizes the benefits of the listwise learn-to-rank technology and the traditional ranking function BM25. In the process of learning, three query level loss functions based on cosine similarity, Euclidean distance and cross entropy respectively are introduced. The experiments on a XML data set Wikipedia English confirm that the learning-to-optimize method indeed leads to a better parameter set. As future work, we will firstly try to tune Wt in the formula. Then we would like to extend the learning-to-optimize method ListOPT approach to other tuning fields, like tuning the parameters in other ranking functions or ranking methods in the future. In a further, the comparison of ListOPT and some other learning and ranking methods, such as ListNet, XRank, XReal and so on, will be done on benchmark data sets.
Acknowledgement This work was supported by the National High-Tech Research and Development Plan of China under Grant No.2009AA01Z136.
References 1. Carmel, D., Maarek, Y.S., Mandelbrod, M., et al.: Searching XML documents via XML fragments. In: SIGIR, pp. 151–158 (2003) 2. Theobald, M., Schenkel, R., Wiekum, G.: An EÆcient and Versatile Query Engine for TopX Search. In: VLDB, pp. 625–636 (2005) 3. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: Bringing order to the web. Technical report, Stanford University (1998) 4. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. JACM, 604–632 (1998) 5. Cao, Z., Qin, T., Liu, T.Y., Tsai, M.F., Li, H.: Learning to rank: from pairwise approach to listwise approach. In: ICML, pp. 129–136 (2007) 6. INEX, 7. Nallapati, R.: Discriminative models for information retrieval. In: SIGIR, pp. 64–71 (2004)
492
N. Gao et al.
8. Cao, Y., Xu, J., Liu, T., Li, H., Huang, Y., Hon, H.: Adapting ranking SVM to document retrieval. In: SIGIR, pp. 186–193 (2006) 9. Freund, Y., Iyer, R., Schapire, R.E., Singer, Y.: An eÆcient boosting algorithm for combining preferences. JMLR, 933–969 (2003) 10. Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., Hullender, G.: Learning to Rank using Gradient Descent. In: ICML, pp. 89–96 (2005) 11. Qin, T., Zhang, X.D., Tsai, M.F., Wang, D.S., Liu, T.Y., Li, H.: Query-level loss functions for information retrieval. Information Processing and Management, 838–855 (2007) 12. Xia, F., Liu, T.Y., Wang, J., Zhang, W., Li, H.: Listwise approach to learning to rank: theory and algorithm. In: ICML, pp. 1192–1199 (2008) 13. Taylor, M., Zaragoza, H., Craswell, N., Robertson, S., Burges, C.: Optimisation Methods for Ranking Functions with Multiple Parameters. In: CIKM, pp. 585–593 (2006) 14. van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1979) 15. Baeza-Yates, R., Ribeiro-Neto, B.: Modern information retrieval (1999) 16. Jarvelin, K., Kekalainen, J.: IR evaluation methods for retrieving highly relevant documents. In: SIGIR, pp. 41–48 (2000) 17. Geva, S., Kamps, J., Lethonen, M., Schenkel, R., Thom, J.A., Trotman, A.: Overview of the INEX 2009 Ad Hoc Track. In: Geva, S., Kamps, J., Trotman, A. (eds.) INEX 2009. LNCS, vol. 6203, pp. 4–25. Springer, Heidelberg (2010) 18. Itakura, K.Y., Clarke, C.L.A.: University of waterloo at INEX 2008: Adhoc, book, and linkthe-wiki tracks. In: Geva, S., Kamps, J., Trotman, A. (eds.) INEX 2008. LNCS, vol. 5631, pp. 132–139. Springer, Heidelberg (2009) 19. Liu, J., Lin, H., Han, B.: Study on Reranking XML Retrieval Elements Based on Combining Strategy and Topics Categorization. In: INEX, pp. 170–176 (2007)
Item Set Mining Based on Cover Similarity Marc Segond and Christian Borgelt European Centre for Soft Computing Calle Gonzalo Guti´errez Quir´ os s/n, E-33600 Mieres (Asturias), Spain {marc.segond,christian.borgelt}@softcomputing.es
Abstract. While in standard frequent item set mining one tries to find item sets the support of which exceeds a user-specified threshold (minimum support) in a database of transactions, we strive to find item sets for which the similarity of their covers (that is, the sets of transactions containing them) exceeds a user-specified threshold. Starting from the generalized Jaccard index we extend our approach to a total of twelve specific similarity measures and a generalized form. We present an efficient mining algorithm that is inspired by the well-known Eclat algorithm and its improvements. By reporting experiments on several benchmark data sets we demonstrate that the runtime penalty incurred by the more complex (but also more informative) item set assessment is bearable and that the approach yields high quality and more useful item sets.
1
Introduction
Frequent item set mining and association rule induction are among the most intensely studied topics in data mining and knowledge discovery in databases. The enormous research efforts devoted to these tasks have led to a variety of sophisticated and efficient algorithms, among the best-known of which are Apriori [1], Eclat [27,28] and FP-growth [13]. However, these approaches, which find item sets whose support exceeds a user-specified minimum in a given transaction database, have the disadvantage that the support does not say much about the actual strength of association of the items in the set: a set of items may be frequent simply because its elements are frequent and thus their frequent co-occurrence can even be expected by chance. As a consequence, the (usually few) interesting item sets drown in a sea of irrelevant ones. In order to improve this situation, we propose in this paper to change the selection criterion, so that fewer irrelevant items sets are produced. For this we draw on the insight that for associated items their covers—that is, the set of transactions containing them—are more similar than for independent items. Starting from the Jaccard index to illustrate this idea, we explore a total of twelve specific similarity measures that can be generalized from pairs of sets (or, equivalently, from pairs of binary vectors) as well as a generalized form. By applying an Eclat-based mining algorithm to standard benchmark data sets and to the 2008/2009 Wikipedia Selection for schools, we demonstrate that the search times are bearable and that high quality item sets are produced. J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 493–505, 2011. c Springer-Verlag Berlin Heidelberg 2011
494
2
M. Segond and C. Borgelt
Frequent Item Set Mining
Frequent item set mining was originally developed for market basket analysis, aiming at finding regularities in the shopping behavior of the customers of supermarkets, mail-order companies and online shops. Formally, we are given a set B of items, called the item base, and a database T of transactions. Each item represents a product, and the item base represents the set of all products on offer. The term item set refers to any subset of the item base B. Each transaction is an item set and represents a set of products that has been bought by an actual customer. Note that two or even more customers may have bought the exact same set of products. Note also that the item base B is usually not given explicitly, but only implicitly as the union of all transactions. We write T = (t1 , . . . , tn ) for a database with n transactions, thus distinguishing equal transactions by their position in the vector. In order to refer to the index set, we introduce the abbreviation INn := {k ∈ IN | k ≤ n} = {1, . . . , n}. Given an item set I ⊆ B and a transaction database T , the cover KT (I) of I w.r.t. T is defined as KT (I) = {k ∈ INn | I ⊆ tk }, that is, as the set of indices of transactions that contain I. The support sT (I) of an item set I ⊆ B is the number of transactions in the database T it is contained in, that is, sT (I) = |KT (I)|. Given a user-specified minimum support smin ∈ IN, an item set I is called frequent in T iff sT (I) ≥ smin . The goal of frequent item set mining is to identify all item sets I ⊆ B that are frequent in a given transaction database T . A standard approach to find all frequent item sets w.r.t. a given database T and a support threshold smin , which is adopted by basically all frequent item set mining algorithms (except those of the Apriori family), is a depth-first search in the subset lattice of the item base B. Viewed properly, this approach can be interpreted as a simple divide-and-conquer scheme. All subproblems that occur in this scheme can be defined by a conditional transaction database and a prefix. The prefix is a set of items that has to be added to all frequent item sets that are discovered in the conditional database, from which all items in the prefix have been removed. Formally, all subproblems are tuples S = (TC , P ), where TC is a conditional transaction database and P ⊆ B is a prefix. The initial problem, with which the recursion is started, is S = (T, ∅), where T is the given transaction database to mine and the prefix is empty. A subproblem S0 = (T0 , P0 ) is processed as follows: Choose an item i ∈ B0 , where B0 is the set of items occurring in T0 . This choice is arbitrary, but usually follows some predefined order of the items. If sT0 (i) ≥ smin , then report the item set P0 ∪ {i} as frequent with the support sT0 (i), and form the subproblem S1 = (T1 , P1 ) with P1 = P0 ∪{i}. The conditional transaction database T1 comprises all transactions in T0 that contain the item i, but with the item i removed. This also implies that transactions that contain no other item than i are entirely removed: no empty transactions are ever kept. If T1 is not empty, process S1 recursively. In any case (that is, regardless of whether sT0 (i) ≥ smin or not), form the subproblem S2 = (T2 , P2 ), where P2 = P0 and the conditional transaction database T2 comprises all transactions in T0 (including those that do not contain the item i), but again with the item i removed. If T2 is not empty, process S2 recursively.
Item Set Mining Based on Cover Similarity
3
495
Jaccard Item Sets
We base our item set mining approach on the similarity of item covers rather than on item set support. In order to measure the similarity of a set of item covers, we start with the Jaccard index [16], which is a well-known statistic for comparing sets. For two arbitrary sets A and B it is defined as J(A, B) = |A ∩ B|/|A ∪ B|. Obviously, J(A, B) is 1 if the sets coincide (i.e. A = B) and 0 if they are disjoint (i.e. A∩B = ∅). For overlapping sets its value lies between 0 and 1. The core idea of using the Jaccard index for item set mining lies in the insight that the covers of (positively) associated items are likely to have a high Jaccard index, while a low Jaccard index indicates independent or even negatively associated items. However, since we consider also item sets with more than two items, we need a generalization to more than two sets (here: item covers). In order to achieve this, we define the carrier LT (I) of an item set I w.r.t. a transaction database T as LT (I) = {k ∈ INn | I ∩ tk = ∅} = {k ∈ INn | ∃i ∈ I : i ∈ tk } = i∈I KT ({i}). The extent rT (I) of an item set I w.r.t. a transaction database T is the size of its carrier, that is, rT (I) = |LT (I)|. Together with the notions of cover and support (see above), we can define the generalized Jaccard index of an item set I w.r.t. a transaction database T as its support divided by its extent, that is, as | KT ({i})| sT (I) |KT (I)| JT (I) = = = i∈I . rT (I) |LT (I)| | i∈I KT ({i})| Clearly, this is a very natural and straightforward generalization of the Jaccard index. Since for an arbitrary item a ∈ B it is obviously KT (I ∪ {a}) ⊆ KT (I) and equally obviously LT (I ∪ {a}) ⊇ LT (I), we have sT (I ∪ {a}) ≤ sT (I) and rT (I ∪ {a}) ≥ rT (I). From these two relations it follows JT (I ∪ {a}) ≤ JT (I) and thus that the generalized Jaccard index w.r.t. a transaction database T over an item base B is an anti-monotone function on the partially ordered set (2B , ⊆). Given a user-specified minimum Jaccard value Jmin , an item set I is called Jaccard-frequent if JT (I) ≥ Jmin . The goal of Jaccard item set mining is to identify all item sets that are Jaccard-frequent in a given transaction database T . Since the generalized Jaccard index is anti-monotone, this task can be addressed with the same basic scheme as the task of frequent item set mining. The only problem to be solved is to find an efficient scheme for computing the extent rT (I).
4
The Eclat Algorithm
Since we will draw on the basic scheme of the well-known Eclat algorithm for mining Jaccard item sets, we briefly review some of its core ideas. Eclat [27] uses a purely vertical representation of conditional transaction databases, that is, it uses lists of transaction indices, which represent the cover of an item or an item set. It then exploits the obvious relation KT (I ∪ {a, b}) = KT (I ∪ {a}) ∩ KT (I ∪ {b}), which allows to extend an item set by an item. This is used in the recursive
496
M. Segond and C. Borgelt
divide-and-conquer scheme described above by intersecting the list of transaction indices associated with the split item with the lists of transaction indices of all items that have not yet been considered in the recursion. An alternative to the intersection approach, which is particularly useful for mining dense transaction databases, relies on so-called difference sets (or diffsets for short) [28]. The diffset DT (a | I) of an item a w.r.t. an item set I and a transaction database T is defined as DT (a | I) = KT (I) − KT (I ∪ {a}). That is, a diffset DT (a | I) lists the indices of all transactions that contain I, but not a. Since sT (I ∪ {a}) = sT (I) − |DT (a | I)|, diffsets are equally effective for finding frequent item sets, provided one can derive a formula that allows to compute diffsets with a larger conditional item set I without going through covers (using the above definition of a diffset). However, this is easily achieved, because DT (b | I ∪ {a}) = DT (b | I) − DT (a | I) [28]. This formula allows to formulate the search entirely with the help of diffsets.
5
The JIM Algorithm (Jaccard Item Set Mining)
The diffset approach as it was reviewed in the previous section can easily be transferred in order to find an efficient scheme for computing the carrier and thus the extent of item sets. To this end we define the extra set ET (a | I) as ET (a | I) = KT ({a}) −
i∈I
KT ({i}) = {k ∈ INn | a ∈ tk ∧ ∀i ∈ I : i ∈ / tk }.
That is, ET (a | I) is the set of indices of all transactions that contain a, but no item in I, and thus identifies the extra transaction indices that have to be added to the carrier if item a is added to the item set I. For extra sets we have ET (a | I ∪ {b}) = ET (a | I) − ET (b | I), which corresponds to the analogous formula for diffsets reviewed above. This relation is easily verified as follows: ET (a | I) − ET (b | I) = {k ∈ INn | a ∈ tk ∧ ∀i ∈ I : i ∈ / tk } − {k ∈ INn | b ∈ tk ∧ ∀i ∈ I : i ∈ / tk } = {k ∈ INn | a ∈ tk ∧ ∀i ∈ I : i ∈ / tk ∧ ¬(b ∈ tk ∧ ∀i ∈ I : i ∈ / tk )} = {k ∈ INn | a ∈ tk ∧ ∀i ∈ I : i ∈ / tk ∧ (b ∈ / tk ∨ ∃i ∈ I : i ∈ tk )} = {k ∈ INn | (a ∈ tk ∧ ∀i ∈ I : i ∈ / tk ∧ b ∈ / tk ) ∨ (a ∈ tk ∧ ∀i ∈ I : i ∈ / tk ∧ ∃i ∈ I : i ∈ tk )} =false
= {k ∈ INn | a ∈ tk ∧ ∀i ∈ I : i ∈ / tk ∧ b ∈ / tk } = {k ∈ INn | a ∈ tk ∧ ∀i ∈ I ∪ {b} : i ∈ / tk } = ET (a | I ∪ {b}) In order to see how extra sets can be used to compute the extent of item sets, let I = {i1 , . . . , im }, with some arbitrary, but fixed order of the items that is indicated by the index. This will be the order in which the items are used as
Item Set Mining Based on Cover Similarity
497
Table 1. Quantities in terms of which the considered similarity measures are specified, together with their behavior as functions on the partially ordered set (2B , ⊆) quantity nT
behavior constant
sT (I) = |KT (I)| = i∈I KT ({i}) anti-monotone rT (I) = |LT (I)| = i∈I KT ({i}) monotone qT (I) = rT (I) − sT (I) zT (I) = nT
− rT (I)
monotone anti-monotone
split items in the recursive divide-and-conquer scheme. It is
m m k−1 LT (I) = k=1 KT ({ik }) = k=1 KT ({ik }) − l=1 KT ({il }) m = k=1 E(ik | {i1 , . . . , ik−1 }), and since the terms of the last union are clearly all disjoint, we have immediately rT (I) =
m
|E(ik | {i1 , . . . , ik−1 })| = rT (I − {im }) + |E(im | I − {im })|.
k=1
Thus we have a simple recursive scheme to compute the extent of an item set from its parent in the search tree (as defined by the divide-and-conquer scheme). The mining algorithm can now easily be implemented as follows: initially we create a vertical representation of the given transaction database. The only difference to the Eclat algorithm is that we have two transaction lists per item i: one represents KT ({i}) and the other ET (i | ∅), which happens to be equal to KT ({i}). (That is, for the initial transaction database the two lists are identical, which, however, will obviously not be maintained in the recursive processing.) In the recursion the first list for the split item is intersected with the first list of all other items to form the list representing the cover of the corresponding pair. The second list of the split item is subtracted from the second lists of all other items, thus yielding the extra sets of transactions for these items given the split item. From the sizes of the resulting lists the support and the extent of the enlarged item sets and thus their generalized Jaccard index can be computed.
6
Other Similarity Measures
Up to now we focused on the (generalized) Jaccard index to measure the similarity of sets (covers). However, there is a large number of alternatives. Recent extensive overviews for the pairwise case include [5] and [6]. The JIM algorithm (as presented above) allows us to easily compute the quantities listed in Table 1. With these quantities a wide range of similarity measures for sets or binary vectors can be generalized. Exceptions are those measures that refer explicitly to the number of cases in which a vector x is 1 while the other vector y is 0, and distinguish this number from the number
498
M. Segond and C. Borgelt Table 2. Considered similarity measures for sets/binary vectors
Measures derived from inner product: SR =
s n
s = r+z
Kulczynski [19]
SK =
s q
=
Jaccard [16] Tanimoto [26]
SJ =
Russel & Rao [21]
s = s+q
s r−s s r
Dice [8] 2s 2s Sørensen [25] SD = = 2s + q r+s Czekanowski [7] Sokal & Sneath 1 s s SS = = [24,22] s + 2q r+q
Measures derived from Hamming distance: Sokal & Michener S = Hamming [23,15] M
s+z n
Faith [10]
SF =
2s + z 2n
AZZOO [5] σ ∈ [0, 1]
SZ =
s + σz n
Rogers & Tanimoto [20]
ST =
s+z n+q
n−q n s + 12 z = n =
=
n−q n+q
Sokal & Sneath 2 2(s + z) n−q SN = = [24,22] n+s+z n − 12 q s+z n−q Sokal & Sneath 3 SO = = [24,22] q q √ sz + s Baroni-Urbani SB = √ & Buser [3] sz + r
of cases in which y is 1 and x is 0. This distinction is difficult to generalize beyond the pairwise case, because the number of possible assignments of zeros and ones to the different vectors, each of which one would have to consider for a generalization, grows exponentially with the number of these vectors (here: covers, and thus: items) and therefore becomes quickly infeasible. By collecting from [6] similarity measures that are specified in terms of the quantities listed in Table 1, we compiled Table 2. Note that the index T and the argument I are omitted to make the formulas more easily readable. Note also that the Hamann measure SH = x+z−s = n−2s [14] listed in [6] is equivalent to n n the Sokal& Michener measure SM , because SH + 1 = 2SM ,√and hence omitted. xz+x−q Likewise, the second Baroni-Urbani& Buser measure SU = √ listed in [6] xz+o is equivalent to the one given in Table 2, because SU + 1 = 2SB . Finally, note that all of the measures listed in Table 2 have range [0, 1] except SK (Kulczynski) and SO (Sokal& Sneath 3), which have range [0, ∞). Table 2 is split into two parts depending on whether the numerator of a measure refers only to the support s or to both the support s and the number z of transactions that do not contain any of the items in the considered set. The former are often referred to as based on the inner product, because in the pairwise case s is the value of the inner (or scalar) product of the binary vectors that are compared. The latter measures (that is, those referring to both s and z) are referred to as based on the Hamming distance, because in the pairwise case q is the Hamming distance of the two vectors and n − q = s + z their Hamming similarity. The decision whether for a given application the term z should be considered in the numerator of a similarity measure or not is difficult. Discussions of this issue for the pairwise case can be found in [22] and [9].
Item Set Mining Based on Cover Similarity
499
Note that the Russel& Rao measure is simply normalized support, demonstrating that our framework comprises standard frequent item set mining as a special case. The Sokal& Michener measure is simply the normalized Hamming similarity. The Dice/Sørensen/Czekanowski measure may be defined without the factor 2 in the numerator, changing the range to [0, 0.5]. The Faith measure is equivalent to the AZZOO measure (alter zero zero one one) for σ = 0.5 and the Sokal& Michener measure results for σ = 1. AZZOO is meant to introduce flexibility in how much weight should be placed on z, the number of transactions which lack all items in I (zero zero) relative to s (one one). All measures listed in Table 2 are anti-monotone on the partially ordered set (2B , ⊆), where B is the underlying item base. This is obvious if in at least one of the formulas given for a measure the numerator is (a multiple of) a constant or anti-monotone quantity or a (weighted) sum of such quantities, and the numerator is (a multiple of) a constant or monotone quantity or a (weighted) sum of such quantities (see Table 1). This is the case for all but SD , SN and SB . That SD is anti-monotone can be seen by considering its reciprocal value q −1 −1 SD = 2s+q 2s = 1 + 2s . Since q is monotone and s is anti-monotone, SD is clearly monotone and thus S√D is anti-monotone. Applying the same approach to SB , √ sz+s+q q √ we arrive at SB−1 = √sz+r = = 1 + √sz+s . Since q is monotone and sz+s sz+s √ −1 both s and sz are anti-monotone, SB is clearly monotone and thus SB is antiq q monotone. Finally, SN can be written as SN = 2n−2q = 1 − 2n−q = 1 − n+s+z . 2n−q Since q is monotone, the numerator is monotone, and since n is constant and s and z are anti-monotone, the denominator is anti-monotone. Hence the fraction is monotone and since it is subtracted from 1, SN is anti-monotone. Note that all measures in Table 2 can be expressed as √ c0 s + c1 z + c2 n + c3 sz √ S= (1) c4 s + c5 z + c6 n + c7 sz by specifying appropriate coefficients c0 , . . . , c7 . For example, we obtain SJ for s c0 = c6 = 1, c5 = −1 and c1 = c2 = c3 = c4 = c7 = 0, since SJ = sr = n−z . Similarly, we obtain SO for c0 = c1 = c6 = 1, c4 = c5 = −1 and c2 = c3 = c7 = 0, s+z since SO = s+z = n−s−z . This general form allows for a flexible specification of q various similarity measures. Note, however, that not all selections of coefficients lead to an anti-monotone measure and hence one has to carefully check this property before using a measure that differs from the pre-specified ones.
7
Experiments
We implemented the described item set mining approach as a C program that was derived from an Eclat implementation by adding the second transaction identifier list for computing the extent of item sets. All similarity measures listed in Table 2 are included as well as the general form (1). This implementation has been made publicly available under the GNU Lesser (Library) Public License.1 1
See http://www.borgelt.net/jim.html
500
M. Segond and C. Borgelt
In a first set of experiments we applied the program to five standard benchmark data sets, which exhibit different characteristics, and compared it to a standard Eclat search. We used BMS-Webview-1 (a web click stream from a leg-care company that no longer exists, which has been used in the KDD cup 2000 [17]), T10I4D100K (an artificial data set generated with IBM’s data generator [29]), census (a data set derived from an extract of the US census bureau data of 1994, which was preprocessed by discretizing numeric attributes), chess (a data set listing chess end game positions for king vs. king and rook), and mushroom (a data set describing poisonous and edible mushrooms by different attributes). The first two data sets are available in the FIMI repository [11], the last three in the UCI machine learning repository [2]. The discretization of the numeric attributes in the census data set was done with a shell/gawk script that can be found on the web page given in footnote 1 (previous page). For the experiments we used an Intel Core 2 Quad Q9650 (3GHz) machine with 8 GB main memory running Ubuntu Linux 10.4 (64 bit) and gcc version 4.4.3. The goal of these experiments was to determine how much the computation of the carrier/extent of an item set affected the execution time. Therefore we ran the JIM algorithm with Jmin = 0, using only a minimum support threshold. As a consequence, JIM and Eclat always found exactly the same set of frequent item sets and any difference in execution time comes from the additional costs of the carrier/extent computation. In addition, we checked which item order (ascending or descending w.r.t. their frequency) yields the shortest search times. The results are depicted in the diagrams in Figure 1. We observe that processing the items in increasing order of frequency always works better for Eclat (black and grey curves)—as expected. For JIM, however, the best order depends on the data set: on census, BMS-Webview-1 and T10I4D100K descending order is better (red curve is lower than blue), on chess ascending order is better (blue curve is lower than red), while on mushroom it depends on the minimum support which order yields the shorter time (red curve intersects blue). We interpret these findings as follows: for the support computation (which is all Eclat does) it is better to process the items in ascending order, because this reduces the average length of the transaction id lists. By intersecting with short lists early, the lists processed in the recursion tend to be shorter and thus are processed faster. However, for the extent computation the opposite order is preferable. Since it works on extra sets, it is advantageous to add frequent items as early as possible to the carrier, because this increases the size of the already covered carrier and thus reduces the average length of the extra lists. Therefore, since there are different preferences, it depends on the data set which operation governs the complexity and thus which item order is better. From Figure 1 we conjecture that dense data sets (high fraction of ones in a bit matrix representation), like chess and mushroom, favor ascending order, while sparse data sets, like census, BMS-Webview-1 and T10I4D100K, favor descending order. This is plausible, because in dense data sets intersection lists tend to be long, so it is important to reduce them. In sparse data sets, however, extra lists tend to be long, so here it is more important to focus on them.
census jim asc. jim desc. eclat asc. eclat desc.
2
1
0 2 10
20
30
40
50
60
70
80
3
log of execution time
log of execution time
Item Set Mining Based on Cover Similarity
chess jim asc. jim desc. eclat asc. eclat desc.
2 1 0
90 100
1000
1200
webview1 jim asc. jim desc. eclat asc. eclat desc.
1
0
–1 33
34
35
36
37
38
39
40
minimum support log of execution time
1400
1600
1800
2000
minimum support log of execution time
log of execution time
minimum support 2
501
mushroom jim asc. jim desc. eclat asc. eclat desc.
2
1
0
100
200
300
400
500
600
700
800
minimum support
T10I4D100K jim asc. jim desc. eclat asc. eclat desc.
2
1
0 2 5
10
15
20
25
30
35
40
45
50
minimum support
Fig. 1. Logarithms of execution times, measured in seconds, over absolute minimum support for Jaccard item set mining compared to standard Eclat frequent item set mining. Items were processed in ascending or descending order w.r.t. their frequency. Jaccard item set mining was executed with Jmin = 0, thus ensuring that exactly the same item sets are found.
Naturally, the execution times of JIM are always greater than those of the corresponding Eclat runs (with the same order of the items), but the execution times are still bearable. This shows that even if one does not use a similarity measure to prune the search, this additional information can be computed fairly efficiently. However, it should be kept in mind that the idea of the approach is to set a threshold for the similarity measure, which can effectively prune the search, so that the actual execution times found in applications are much lower. In our own practice we basically always achieved execution times that were lower than for the Eclat algorithm (but, of course, with a different output).
502
M. Segond and C. Borgelt
Table 3. Jaccard item sets found in the 2008/2009 Wikipedia Selection for schools item set Reptiles, Insects phylum, chordata, animalia planta, magnoliopsida, magnoliophyta wind, damag, storm, hurrican, landfal tournament, doubl, tenni, slam, Grand Slam dinosaur, cretac, superord, sauropsida, dinosauria decai, alpha, fusion, target, excit, dubna conserv, binomi, phylum, concern, animalia, chordata
sT 12 34 14 23 10 10 12 14
JT 1.0000 0.7391 0.6667 0.1608 0.1370 0.1149 0.1121 0.1053
In another experiment we used an extract from the 2008/2009 Wikipedia Selection for schools2 , which consisted of 4861 web pages. Each of these web pages was taken as a transaction and processed with standard text processing methods (name detection, stemming, stop word removal etc.) to extract a total of 59330 terms/keywords. The terms occurring on a web page are the items occurring in the corresponding transaction. The resulting data file was then mined for Jaccard item sets with a threshold of Jmin = 0.1. Some examples of found term associations are listed in Table 3. Clearly, there are several term sets with surprisingly high Jaccard indices and thus strongly associated terms. For example, “Reptiles” and “Insects” always appear together (on a total of 12 web pages) and never alone. A closer inspection revealed, however, that this is an artifact of the name detection, which extracts these terms from the Wikipedia category title “Insects, Reptiles and Fish” (but somehow treats “Fish” not as a name, but as a normal word). All other item sets contain normal terms, though (only “Grand Slam” is another name), and are no artifacts of the text processing step. The second item set captures several biology pages, which describe different vertebrates, all of which belong to the phylum “chordata” and the kingdom “animalia”. The third set indicates that this selection contains a surprisingly high number of pages referring to magnolias. The remaining item sets show that term sets with five or even six terms can exhibit a quite high Jaccard index, even though they have a fairly low support. An impression of the filtering power can be obtained by comparing the size of the output to standard frequent item set mining: for smin = 10 there are 83130 frequent item sets and 19394 closed item sets with at least two items. A threshold of Jmin = 0.1 for the (generalized) Jaccard index reduces the output to 5116 (frequent) item sets. From manual inspection, we gathered the impression that the Jaccard item sets contained more meaningful sets and that the Jaccard index was a valuable additional piece of information. It has to be conceded, though, that whether item sets are more “meaningful” or “interesting” is difficult to assess, because this requires an objective measure, which is not available. However, the usefulness of our method is indirectly supported by a successful application of the Jaccard item set mining approach for concept detection, for 2
See http://schools-wikipedia.org/
Item Set Mining Based on Cover Similarity
503
which standard frequent item set mining did not yield sufficiently good results. This was carried out in the EU FP7 project BISON3 and is reported in [18].
8
Conclusions
We introduced the notion of a Jaccard item set as an item set for which the (generalized) Jaccard index of its item covers exceeds a user-specified threshold. In addition, we extended this basic idea to a total of twelve similarity measures for sets or binary vectors, all of which can be generalized in the same way and can be shown to be anti-monotone. By exploiting an idea that is similar to the difference set approach for the well-known Eclat algorithm, we derived an efficient search scheme that is based on forming intersections and differences of sets of transaction indices in order to compute the quantities that are needed to compute the similarity measures. Since it contains standard frequent item set mining as a special case, mining item sets based on cover similarity yields a flexible and versatile framework. Furthermore, the similarity measures provide highly useful additional assessments of found item sets and thus help us to select the interesting ones. By running experiments on standard benchmark data sets we showed that mining item sets based on cover similarity can be done fairly efficiently, and by evaluating the results obtained with a threshold for the cover similarity measure we demonstrated that the output is considerably reduced, while expressive and meaningful item sets are preserved.
Acknowledgements This work was supported by the European Commission under the 7th Framework Program FP7-ICT-2007-C FET-Open, contract no. BISON-211898.
References 1. Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules. In: Proc. 20th Int. Conf. on Very Large Databases (VLDB 1994), Santiago de Chile, pp. 487–499. Morgan Kaufmann, San Mateo (1994) 2. Asuncion, A., Newman, D.J.: UCI Machine Learning Repository. School of Information and Computer Science, University of California at Irvine, CA, USA (2007), http://www.ics.uci.edu/~ mlearn/MLRepository.html 3. Baroni-Urbani, C., Buser, M.W.: Similarity of Binary Data. Systematic Zoology 25(3), 251–259 (1976) 4. Bayardo, R., Goethals, B., Zaki, M.J. (eds.): Proc. Workshop Frequent Item Set Mining Implementations (FIMI 2004), Brighton, UK. CEUR Workshop Proceedings 126, Aachen, Germany (2004), http://www.ceur-ws.org/Vol-126/ 5. Cha, S.-H., Tappert, C.C., Yoon, S.: Enhancing Binary Feature Vector Similarity Measures. J. Pattern Recognition Research 1, 63–77 (2006) 3
See http://www.bisonet.eu/
504
M. Segond and C. Borgelt
6. Choi, S.-S., Cha, S.-H., Tappert, C.C.: A Survey of Binary Similarity and Distance Measures. Journal of Systemics, Cybernetics and Informatics 8(1), 43–48 (2010) 7. Czekanowski, J.: Zarys metod statystycznych w zastosowaniu do antropologii [An Outline of Statistical Methods Applied in Anthropology]. Towarzystwo Naukowe Warszawskie, Warsaw (1913) 8. Dice, L.R.: Measures of the Amount of Ecologic Association between Species. Ecology 26, 297–302 (1945) 9. Dunn, G., Everitt, B.S.: An Introduction to Mathematical Taxonomy. Cambridge University Press, Cambirdge (1982) 10. Faith, D.P.: Asymmetric Binary Similarity Measures. Oecologia 57(3), 287–290 (1983) 11. Goethals, B. (ed.): Frequent Item Set Mining Dataset Repository. University of Helsinki, Finland (2004), http://fimi.cs.helsinki.fi/data/ 12. Goethals, B., Zaki, M.J. (eds.): Proc. Workshop Frequent Item Set Mining Implementations (FIMI 2003), Melbourne, FL, USA. CEUR Workshop Proceedings 90, Aachen, Germany (2003), http://www.ceur-ws.org/Vol-90/ 13. Han, J., Pei, H., Yin, Y.: Mining Frequent Patterns without Candidate Generation. In: Proc. Conf. on the Management of Data (SIGMOD 2000), Dallas, TX, pp. 1–12. ACM Press, New York (2000) 14. Hamann, V.: Merkmalbestand und Verwandtschaftsbeziehungen der Farinosae. Ein Beitrag zum System der Monokotyledonen 2, 639–768 (1961) 15. Hamming, R.V.: Error Detecting and Error Correcting Codes. Bell Systems Tech. Journal 29, 147–160 (1950) ´ 16. Jaccard, P.: Etude comparative de la distribution florale dans une portion des Alpes et des Jura. Bulletin de la Soci´et´e Vaudoise des Sciences Naturelles 37, 547–579 (1901) 17. Kohavi, R., Bradley, C.E., Frasca, B., Mason, L., Zheng, Z.: KDD-Cup 2000 Organizers’ Report: Peeling the Onion. SIGKDD Exploration 2(2), 86–93 (2000) 18. K¨ otter, T., Berthold, M.R.: Concept Detection. In: Proc. 8th Conf. on Computing and Philosophy (ECAP 2010). University of Munich, Germany (2010) 19. Kulczynski, S.: Classe des Sciences Math´ematiques et Naturelles. Bulletin Int. de l’Acadamie Polonaise des Sciences et des Lettres S´erie B (Sciences Naturelles) (Supplement II), 57–203 (1927) 20. Rogers, D.J., Tanimoto, T.T.: A Computer Program for Classifying Plants. Science 132, 1115–1118 (1960) 21. Russel, P.F., Rao, T.R.: On Habitat and Association of Species of Anopheline Larvae in South-eastern Madras. J. Malaria Institute 3, 153–178 (1940) 22. Sneath, P.H.A., Sokal, R.R.: Numerical Taxonomy. Freeman Books, San Francisco (1973) 23. Sokal, R.R., Michener, C.D.: A Statistical Method for Evaluating Systematic Relationships. University of Kansas Scientific Bulletin 38, 1409–1438 (1958) 24. Sokal, R.R., Sneath, P.H.A.: Principles of Numerical Taxonomy. Freeman Books, San Francisco (1963) 25. Sørensen, T.: A Method of Establishing Groups of Equal Amplitude in Plant Sociology based on Similarity of Species and its Application to Analyses of the Vegetation on Danish Commons. Biologiske Skrifter / Kongelige Danske Videnskabernes Selskab 5(4), 1–34 (1948) 26. Tanimoto, T.T.: IBM Internal Report, November 17 (1957)
Item Set Mining Based on Cover Similarity
505
27. Zaki, M.J., Parthasarathy, S., Ogihara, M., Li, W.: New Algorithms for Fast Discovery of Association Rules. In: Proc. 3rd Int. Conf. on Knowledge Discovery and Data Mining (KDD 1997), Newport Beach, CA, pp. 283–296. AAAI Press, Menlo Park (1997) 28. Zaki, M.J., Gouda, K.: Fast Vertical Mining Using Diffsets. In: Proc. 9th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD 2003), Washington, DC, pp. 326–335. ACM Press, New York (2003) 29. Synthetic Data Generation Code for Associations and Sequential Patterns. Intelligent Information Systems, IBM Almaden Research Center, http://www.almaden.ibm.com/software/quest/Resources/index.shtml
Learning to Advertise: How Many Ads Are Enough? Bo Wang1 , Zhaonan Li2 , Jie Tang2 , Kuo Zhang3 , Songcan Chen1 , and Liyun Ru3 1
Department of Computer Science and Engineering, Nanjing University of Aeronautics and Astronautics, China
[email protected] 2 Department of Computer Science, Tsinghua University, Beijing, China
[email protected] 3 Sohu Inc. R&D center, Beijing, China
[email protected]
Abstract. Sponsored advertisement(ad) has already become the major source of revenue for most popular search engines. One fundamental challenge facing all search engines is how to achieve a balance between the number of displayed ads and the potential annoyance to the users. Displaying more ads would improve the chance for the user clicking an ad. However, when the ads are not really relevant to the users’ interests, displaying more may annoy them and even “train” them to ignore ads. In this paper, we study an interesting problem that how many ads should be displayed for a given query. We use statistics on real ads click-through data to show the existence of the problem and the possibility to predict the ideal number. There are two main observations: 1) when the click entropy of a query exceeds a threshold, the CTR of that query will be very near zero; 2) the threshold of click entropy can be automatically determined when the number of removed ads is given. Further, we propose a learning approach to rank the ads and to predict the number of displayed ads for a given query. The experimental results on a commercial search engine dataset validate the effectiveness of the proposed approach.
1
Introduction
Sponsored search places ads on the result pages of web search engines for different queries. All major web search engines (Google, Microsoft, Yahoo!) derive significant revenue from such ads. However, the advertisement problem is often treated as the same problem as traditional web search, i.e., to find the most relevant ads for a given query. One different and also usually ignored problem is “how many ads are enough for a sponsored search”. Recently, a few research works have been conducted on this problem [5,6,8,17]. For example, Broder et al. study the problem of “whether to swing”, that is, whether to show ads for an incoming query [3]; Zhu et al. propose a method to directly optimize the revenue in sponsored search [22]. In most existing search engines, the problem has been J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 506–518, 2011. c Springer-Verlag Berlin Heidelberg 2011
Learning to Advertise: How Many Ads Are Enough?
507
treated as an engineering issue. For example, some search engine always displays a fixed number of ads and some search engine uses heuristic rules to determine the number of displayed ads. However, the key question is still open, i.e., how to optimize the number of displayed ads for an incoming query?
0.017
70%
950
0.016
60%
900
0.015
850
0.014
800
0.013
750
0.012
700
0.011
20%
650
0.01
10%
0.009 7
0
600 0
1
2
3
4
5
The number of removed ads
(a) Sponsored search
6
4
x 10
(b) #removed ads vs. #clicks & CTR
Increase of CTR
1000
CTR
#Clicks
Motivation Example. Figure 1 (a) illustrates an example of sponsored search. The query is “house” and the first one is a suggested ad with yellow background, and the search results are listed in the bottom of the page. Our goal is to predict the number of displayed ads for a given query. The problem is not easy, as it is usually difficult to accurately define the relevance between an ad and the query. We conduct several statistical studies on the log data of a commercial search engine, the procedure is in two-stage: First, for each query, we obtain all the returned ads by the search engine; Second, we use some method to remove several unnecessarily displayed ads (detailed in Section 4). Figure 1 (b) and (c) are the statistical results on a large click-through data (DS BroadMatch dataset in section 3). The number of “removed ads” refers to the total number of ads cut off in the second stage for all the queries. Figure 1(b) shows how #clicks and Click-Through-Rate (CTR) vary with the number of removed ads. We see that with the number of removed ads increasing, #clicks decreases, while CTR clearly increases. This matches our intuition well, displaying more ads will gain more clicks, but if many of them are irrelevant, it will hurt CTR. Figure 1(c) further shows how CTR increases as #clicks decreases. This is very interesting. It is also reported that many clicks on the first displayed ad are done before the users realize that it is not the first search result. A basic idea here is that we can remove some displayed ads to achieve a better performance on CTR.
50% 40% 30%
0
−5%
−10% −15% Decline of clicks
−20%
(c) #clicks vs. CTR
Fig. 1. Motivation example
Thus, the problem becomes how to predict the number of displayed ads for an incoming query which is non-trivial and poses two unique challenges: • Ad ranking. For a given query, a list of related ads will be returned. Ads displayed at the top positions should be more relevant to the query. Thus, the first challenge is how to rank these ads. • Ad Number Prediction. After we get the ranking list of ads, it is necessary to answer the question “how many ads should we show?”.
508
B. Wang et al.
Contributions. To address the above two challenges, we propose a learningbased framework. To summarize, our contributions are three-fold: • We performed a deep analysis of the click-through data and found that when the click entropy of a query exceeds a threshold, the CTR of that query will be very near zero. • We developed a method to determine the number of displayed ads for a given query by an automatically selected threshold of click entropy. • We conducted experiments on a commercial search engine and experimental results validate the effectiveness of the proposed approach.
2
Problem Definition
Suppose we have the click-through data collected from a search engine, each record can be represented by a triple {q, adq (p), cq (p)}, where for each query q, adq (p) is the ad at the position p returned by the search engine and cq (p) is a binary indicator which is 1 if this ad is clicked under this query, otherwise 0. For each ad adq (p), there is an associated feature vector xq (p) extracted from a query-ad pair (q, adq (p)) and can be utilized for ranking model learning. Ad Ranking: Given the training data denoted by L = {q, ADq , Cq }q∈Q in which Q is the query collection, for each q ∈ Q, ADq = {adq (1), · · · , adq (nq )} is its related ad list and Cq = {cq (1), · · · , cq (nq )} is the click indicators where nq is the total number of displayed ads. Similarly, the test data can be denoted by T = {q , ADq }q ∈Q where Q is the query collection. In this task, we try to learn a ranking function for displaying the query-related ads by relevance. For each query q ∈ Q , the output of this task is the ranked ad list Rq = {adq (i1 ), · · · , adq (inq )} where (i1 , · · · , inq ) is a permutation of (1, · · · , nq ). Ad Number Prediction: Given the ranked ad list Rq for query q , in this task we try to determine the number of displayed ads k and then display the top-k ads. The output of this task can be denoted by a tuple O = {q , Rqk }q ∈Q where Rqk are the top-k ads from Rq . Our problem is quite different from existing works on advertisement recommendation. Zhu et al. propose a method to directly optimize the revenue in sponsored search [22]. However, they only consider how to maximize the revenue, but ignore the experience of users. Actually, when no ads are relevant to the users’ interests, displaying irrelevant ads may lead to much complains from the users and even train the user to ignore ads. Broder et al. study the problem of “whether to swing”, that is, whether to show ads for an incoming query [3]. However, they simplify the problem as a binary classification problem, while in most real cases, the problem is more complex and often requires a dynamic number for the displayed ads. Few works have been done about dynamically predicting the number of displayed ads for a given query.
Learning to Advertise: How Many Ads Are Enough?
3
509
Data Insight Analysis
3.1
Data Set
In this paper, we use one month click-through data collected from the log of a famous Chinese search engine Sogou1 , the search department of Sohu company which is a premier online brand in China and indispensable to the daily life of millions of Chinese. In the data set, each record consists of user’s query, ad’s keyword, ad’s title, ad’s description, displayed position of ad, and ad’s bidding price . For the training dataset DS BroadMatch, the total size is around 3.5GB which contains about 4 million queries, 60k keywords and 80k ads with 25 million records and 400k clicks. In this dataset, the ad is triggered when there are common words between keyword and user’s search query. We also have another little training dataset DS ExactMatch which a subset of DS BroadMatch and contains about 28k queries, 29k keywords and 53k ads with 4.4 million records and 150k clicks. In DS ExactMatch, the ad is triggered only when there are exactly matched words between keywords and user’s search query. For the test set, the total size is about 90MB with 430k records and 1k clicks. 3.2
Position vs. Click-Through Rate (CTR)
Figure 2 illustrates how CTR varies according to the positions on the dataset DS ExactMatch. We can see that the actions of clicks mainly fall into the top three positions of ad list for a query, so the clicks are position-dependent. 4
7
the number of removed ads
0.06 0.05
CTR
0.04 0.03 0.02 0.01 0 1
2
3
4
5
6
7
8
9
x 10
6 5 4 3 2 1 0 0.5
10
1
1.5
2
2.5
3
3.5
4
click entropy of query
Position
Fig. 2. How CTR varies with the positions Fig. 3. How the number of removed ads varies with the click entropy of a query
3.3
Click Entropy
In this section we conduct several data analyses based on the measure called click entropy. For a given query q, the click entropy is defined as follows [11]: ClickEntropy(q) =
−P (ad|q)log2P (ad|q)
(1)
ad∈P(q)
where P(q) is the collection of ads clicked on query q and P (ad|q) = |Clicks(q,ad)| |Clicks(q)| is the ratio of the number of clicks on ad to the number of clicks on query q. 1
http://www.sogou.com
510
B. Wang et al. 10
10
Max clicked position
Max clicked position
12
8 6 4 2 0 0
1
2
3
4
5
Click entropy of query
(a)DS ExactMatch
6
8 6 4
2 0 0
2
4
6
8
10
Click entropy of query
(b)DS BroadMatch
Fig. 4. How Max-Clicked-Position varies with the click entropy on the two datasets
A smaller click entropy means that the majorities of users agree with each other on a small number of ads while a larger click entropy indicates a bigger query diversity, that is, many different ads are clicked for the same query. Click Entropy vs. #Removed ads. Figure 3 shows how the number of removed ads varies with the click entropy of a query on the dataset DS BroadMatch. By this distribution, for a query, if we want to remove a given number of ads, we can automatically obtain the threshold of the click entropy which can be utilized for helping determine the number of displayed ads. Click Entropy vs. Max-Clicked-Position. For a query, Max-Clicked-Position is the last position of clicked ad. Figure 4 shows how the Max-Clicked-Position varies with the click entropy on the two datasets. The observations are as follows: • As the click entropy increases, the Max-Clicked-Position will be larger. • The values of click entropy on the dataset DS ExactMatch are in a smaller range than DS BroadMatch which implies that when the query and keywords are exactly matched, the click actions of users are more likely consistent. • On the dataset DS ExactMatch, the clicked positions vary from 1 to 10, while on the dataset DS BroadMatch, the clicked positions only vary from 1 to 4. The intuition behind this observation is that while query and keyword are exactly matched, users will scan all the ads because of the high relevance,but while query and keyword are broadly matched, users will only scan the top four ads and ignore the others. This is very interesting which implies that for a query broadly matched with the ads’ keywords, we should display fewer ads than those exactly matched with the ads’ keywords. Click Entropy vs. QueryCTR. Figure 5 shows how QueryCTR varies with the click entropy of a query. QueryCTR is the ratio of the number of clicks of a query to the number of impressions of this query. We can conclude that when the click entropy of a query is greater than 3, the QueryCTR will be very near zero. This observation is very interesting, the QueryCTR is the summation of the ads’ click entropy, so we can utilize this observation to help determine the number of displayed ads for a given query.
1
1
0.8
0.8
Query CTR
Query CTR
Learning to Advertise: How Many Ads Are Enough?
0.6 0.4
0.2 0 0
511
0.6 0.4
0.2
1
2
3
4
5
Click Entropy of Query
(a)DS ExactMatch
0
0
1
2
3
4
Click Entropy of Query
5
(b)DS BroadMatch
Fig. 5. How QueryCTR varies with the click entropy on the two datasets
4 4.1
Ad Ranking and Number Prediction Basic Idea
We propose a two-stage approach corresponding to the two challenges of our problem. First, we learn a function for predicting CTR based on the click-through data by which the ads can be ranked. Second, we propose a heuristic method to determine the number of displayed ads based on the click entropy of query. For a query, the click entropy is the summation of entropy of each clicked ads, so we consider the ads in a top-down mode, once the addition of one ad leads to the excess of a predefined threshold by the click entropy, we then cut off the rest ads. By this way, we can automatically determine the number of displayed ads. 4.2
Learning Algorithm
Ad Ranking: In this task, we aim to rank all the related ads of a given query by relevance. Specifically, given each record {q, adq (p), cq (p)} from the click-through data L = {q, ADq , Cq }q∈Q , we can first extract its associated feature vector xq (p) from the query-ad pair, then obtain one training instance {xq (p), cq (p)}. Similarly, we can generate the whole training data L = {xq (p), cq (p)}q∈Q,p=1,···,nq ∈ Rd × {0, 1} from the click-through data where d is the number of features. Let (x, c) ∈ L be an instance from the training data where x ∈ Rd is the feature vector and c ∈ {0, 1} is the associated click indicator. In order to predict the CTR of an ad, we can learn a logistic regression model as follows whose output is the probability of that ad being clicked: P (c = 1|x) =
1 1 + e− i wi xi
(2)
where xi is the i-th feature of x and wi is the weight for that feature. P (c = 1|x) is the predicted CTR of that ad whose feature vector is x. For training, we can use the maximum likelihood method for parameter learning; for test, given a query, we can use the learnt logistic regression model for predicting the CTR of one ad.
512
B. Wang et al. Algorithm 1. Learning to Advertise Input: Training set: L = {q, ADq , Cq }q∈Q Test set: T = {q , ADq }q ∈Q Threshold of click entropy: η Output: the number of displayed ads k and O = {q , Rq }q ∈Q Ad ranking: 1: Function learning for predicting CTRs from L P (c = 1|x) = 1+e− 1 i wi xi Ad Number Prediction: 2: for q ∈ Q do 3: Rank ADq by the predicted CTRs P (c = 1|x) 4: Let the number k = 0 and click entropy CE = 0 5: while CE ≤ η do 6: k=k+1 7: if adq (k) is predicted to be clicked 8: CE = CE − P (adq (k)|q )log2 P (adq (k)|q ) 9: end if 10: end while 11: Rq = ADq (1 : k) 12: Output k and O = {q , Rq }q ∈Q 13: end for
Ad Number Prediction: Given a query q, we can incrementally add one ad to the set of displayed ads in the top-down mode, and the clicked ads will contribute to the click entropy. We can repeat this process until the click entropy exceeds a predefined threshold, and then stop. By then, the size of that set is exactly the number of the displayed ads for that query. It is also worth noting that how to automatically determine the threshold of click entropy. Figure 3 demonstrates that when the click entropy of a query exceeds 3, the QueryCTR of that query will be very near zero. According to this relationship, we can learn a fitting model(eg. regression model) from the statistics of data, then for a given number of ads to be cut down, we can use the learned model to predict the threshold of click entropy. The method can also be applied to a new query. Based on the learned logistic model, we can first predict the CTR for each ad related to the new query [17], then predict the number of ads based on the click entropy for the new query. 4.3
Feature Definition
Table 1 lists all the 30 features extracted from query-ad title pair, query-ad pair, and query-keyword pair which can be divided into three categories: Relevancerelated, CTR-related and Ads-related. Relevance-related features. The relevance-related features consist of lowlevel and high-level ones. The low-level features include highlight, TF, TF*IDF
Learning to Advertise: How Many Ads Are Enough?
513
Table 1. Feature definitions between query and ads Category No. Feature Name 1 Highlight of title 2 Highlight of ad
Feature Description ratio of highlight terms of query within title to the length of title ratio of highlight terms of query within ad title+description to the length of ad title+description 3 TF of title term frequency between query and the title of ad 4 TF of ad term frequency between query and the title+description of ad 5 TF of keyword term frequency between query and the keyword of ad 6 TF*IDF of title TF*IDF between query and the title of ad 7 TF*IDF of ad TF*IDF between query and the title+description of ad 8 TF*IDF of keyword TF*IDF between query and the keyword of ad 9 Overlap of title 1 if query terms appear in the title of ad; 0 otherwise Relevance 10 Overlap of ad 1 if query terms appear in the title+description of ad; 0 otherwise 11 Overlap of keyword 1 if query terms appear in the keywords of ad; 0 otherwise 12 cos sim of title cosine similarity between query and the title of ad 13 cos sim of ad cosine similarity between query and the title+description of ad 14 cos sim of keyword cosine similarity between query and the keywords of ad 15 BM25 of title BM25 value between query and the title of ad 16 BM25 of ad BM25 value between query and the title+description of ad 17 BM25 of keyword BM25 value between query and the keywords of ad 18 LMIR of title LMIR value between query and the title of ad 19 LMIR of ad LMIR value between query and the title+description of ad 20 LMIR of keyword LMIR value between query and the keywords of ad 21 keyCTR CTR of keywords 22 titleCTR CTR of the title of ad CTR 23 adCTR CTR of title+description of ad 24 keyTitleCTR CTR of keyword+title of ad 25 keyAdCTR CTR of keyword+title+description of ad 26 title length the length of title of ad 27 ad length the length of title+description of ad Ads 28 biding price biding price of keyword 29 match type match type between query and ad (exact match, broad match) 30 position position of ad in the ad list
and the overlap, which can be used to measure the relevance based on keyword matching. The high-level features include cosine similarity, BM25 and LMIR, which can be used to measure the relevance beyond keyword matching. CTR-related features. AdCTR can be defined as the ratio of the number of ad clicks to the total number of ad impressions. Similarly, we can define keyCTR and titleCTR. KeyCTR corresponds to the multiple advertising for the specific keyword. And titleCTR corresponds to multiple advertising with the same ad title. We also introduce features keyTitleCTR and keyAdCTR, because usually the assignment of a keyword to an ad is determined by the sponsors and the search engine company, the quality of this assignment will affect the ad CTR. Ads-related features. We introduce some features for ads themself, such as the length of ad title, the bidding price, the match type and the position.
5 5.1
Experimental Results Evaluation, Baselines and Experiment Setting
Evaluation. We qualitatively evaluate all the methods nqby the total number of clicks for all queries in the test dataset: #click(q) = p=1 cq (p).
B. Wang et al. 4
LR_CTR LR_CE LR_RANDOM
900
Total number of click
the number of clicks
1000
800 700 600 500 0
1500
x 10 7
1400
6
1300
5 4
1200 1100 900 800 0.5
2
4
8
6
the number of removed ads
4
3 2
1000
1 1
1.5
2
2.5
0 3
Number of removed ads
514
entropy_threshold
x 10
(a)
(b)
Fig. 6. (a) How the total number of clicks varies with the number of removed ads for all three methods; (b) How the total number of clicks and the total number of removed ads vary with the threshold of click entropy
For evaluation, we first remove a certain number of ads for a query in the test dataset by different ways, and then find the way which leads to the least reduction of the number of clicks. Baselines. In order to quantitatively evaluate our approach, we compare our method with two other baselines. Assume that we want to cut down N ads in total. For the first baseline LR CTR, for each query in the test dataset, we predict the CTRs for the query-related ads, and then pool the returned ads for all the queries and re-rank them by the predicted CTRs, finally remove the last N ads with lowest CTRs. The major problem for LR CTR is that it cannot be updated in an online manner, that is, we need to know all the predicted CTRs for all the queries in the test dataset in advance. This is impossible for determining the removed ads for a given query. For the second baseline LR RANDOM, we predict the CTRs of the query-related ads for each query in the test dataset, and then only remove the last ad with some probability for each query. We can tune the probability for removing a certain number of ads, the disadvantage is that there is no explicit correspondence between these two. For our proposed approach LR CE, we first automatically determine the threshold of click entropy for a query and then use Algorithm 1 to remove the ads. Our approach does not suffer from the disadvantages of the above two baselines. Experiment Setting. All the experiments are carried out on a PC running Windows XP with AMD Athlon 64 X2 Processor(2GHz) and 2G RAM. We use the predicted CTRs from the ad ranking task to approximate the R(ad) term P (ad|q) in Eq. 1 in this way: P (ad|q) = CT where CT R(ad) and i CT R(adi ) CT R(adi ) are the predicted CTRs of the current ad and the i-th related ad for query q respectively. For the training, we use the feature “position”; while for testing, we set the feature “position” as zero for all instances. 5.2
Results and Analysis
#Removed ads vs. #Clicks. Figure 6(a) shows all the results of two baselines and our approach. From that, the main observations are as follows:
Learning to Advertise: How Many Ads Are Enough?
515
• Performance. The method LR CTR obtains the optimal solution by the measure #click. Our approach LR CE is near the optimal solution, and the baseline LR RANDOM is the worst. • User specification. From the viewpoint of the search engines, they may want to cut down a specific number of ads to reduce the number of irrelevant impressions while preserving the relevant ones. For addressing this issue, our approach LR CE can first automatically determine the threshold of click entropy via the relationship in Figure 3 and then determine the displayed ads. This case cannot be dealt with by LR CTR, because it needs to know all the click-through information in advance and then make a global analysis for removing irrelevant ads. Further, for a specific query, it can not determine exactly which ads should be displayed. #Removed ads vs. CTR and #Clicks. Figure 6(b) shows how the total number of clicks and the number of removed ads vary with the threshold of click entropy. As the threshold of click entropy increases, the total number of clicks increases while the number of removed ads decreases. 5.3
Feature Contribution Analysis
All the following analyses are conducted on the dataset DS ExactMatch. Features vs. keyTitleCTR. Figure 7 shows some statistics of the ad clickthrough data. When the values of features in 7 (a) and (c) increase, the keyTitleCTRs also increase, while the feature in 7 (b) increases, the keyTitleCTR first increases and then decreases.
(a)Highlight
(b)TF
(c)Cosine
Fig. 7. How keyTitleCTR varies with three different features
Feature Ranking. Recursive feature elimination(RFE) uses greedy strategy for feature selection [22]. At each step, the algorithm tries to find the most useless feature and eliminate it. In this analysis, we use the measure Akaike Information Criterion (AIC) to select useful features. After excluding one feature, the lower the increase of AIC is, the more useless the removed feature is. The process will be repeated until only one feature left. Finally, we can get a ranking list of our features, and the top three are keyTitleCTR, position and cos sim of title.
516
6
B. Wang et al.
Related Work
CTR-based advertisement. In this category, people try to predict CTRs, by which the query-related ads can be ranked. These methods can be divided into two main categories: click model [12,23] and regression model [15]. Regarding click model, Agarwal et al. propose a spatio-temporal model to estimate CTR by a dynamic Gamma-Poisson model [1]. Craswell et al. propose four simple hypotheses for explaining the position bias, and find that the cascade model is the best one [9]. Chapelle and Zhang propose a dynamic Bayesian network to provide an unbiased estimation of relevance from the log data [5]. Guo et al. propose the click chain model based on Bayesian modeling[13]. Regarding regression model, Richardson et al. propose a positional model and leverage logistic regression to predict the CTR for new ads [17]. Chen et al. design and implement a highly scalable and efficient algorithm based on a linear Poisson regression model for behavioral targeting in MapReduce framework [6]. There are also many other works [14]. For example, Dembczy´ nski et al. propose a method based on decision rule [10]. Revenue-based advertisement. In this category, people try to take relevance or revenue into consideration rather than CTR while displaying ads. Radlinski et al. propose a two-stage approach to select ads which are both relevant and profitable by rewriting queries [16]. Zhu et al. propose two novel learning-to-rank methods to maximize search engine revenue while preserving high quality of displayed ads [22]. Ciaramita et al. propose three online learning algorithms to maximize the number of clicks based on preference blocks [8]. Streeter et al. formalize the sponsored search problem as an assignment of items to positions which can be efficiently solved in the no-regret model [19]. Carterette and Jones try to predict document relevance from the click data [4]. Threshold-based methods. In this category, people try to utilize thresholds for determining whether to display ads or where to cut off the ranking list. Broder et al. propose a method based on global threshold to determine whether to show ads for a query because showing irrelevant ads will annoy the user [3]. Shanahan et al. propose a parameter free threshold relaxation algorithm to ensure that support vector machine will have excellent precision and relatively high recall [18]. Arampatzis et al. propose a threshold optimization approach for determining where to cut off a ranking list based on score distribution [2].
7
Conclusion
In this paper, we study an interesting problem that how many ads should be displayed for a given query. There are two challenges: ad ranking and ad number prediction. First, we conduct extensive analyses on real click-through data of ads and the two main observations are 1) when the click entropy of a query exceeds a threshold the CTR of that query will be very near zero; 2) the threshold of click entropy can be automatically determined when the number of removed ads
Learning to Advertise: How Many Ads Are Enough?
517
is given. Second, we propose a learning approach to rank the ads and to predict the number of displayed ads for a given query. Finally, the experimental results on a commercial search engine validate the effectiveness of our approach. Learning to recommend ads in sponsored search presents a new and interesting research direction. One interesting issue is how to predict the user intention before recommending ads [7]. Another interesting issue is how to exploit clickthrough data in different domains where the click distributions may be different for refining ad ranking [21]. It would also be interesting to study how collective intelligence (social influence between users for sentiment opinions on an ad) can help improve the accuracy of ad number prediction [20]. Acknowledgments. Songcan Chen and Bo Wang are supported by NSFC (60773061), Key NSFC(61035003). Jie Tang is supported by NSFC(61073073, 60703059, 60973102), Chinese National Key Foundation Research (60933013, 61035004) and National High-tech R&D Program(2009AA01Z138).
References 1. Agarwal, D., Chen, B.-C., Elango, P.: Spatio-temporal models for estimating clickthrough rate. In: WWW 2009, pp. 21–30 (2009) 2. Arampatzis, A., Kamps, J., Robertson, S.: Where to stop reading a ranked list?: threshold optimization using truncated score distributions. In: SIGIR 2009, pp. 524–531 (2009) 3. Broder, A., Ciaramita, M., Fontoura, M., Gabrilovich, E., Josifovski, V., Metzler, D., Murdock, V., Plachouras, V.: To swing or not to swing: learning when (not) to advertise. In: CIKM 2008, pp. 1003–1012 (2008) 4. Carterette, B., Jones, R.: Evaluating search engines by modeling the relationship between relevance and clicks. In: NIPS 2007 (2007) 5. Chapelle, O., Zhang, Y.: A dynamic bayesian network click model for web search ranking. In: WWW 2009, pp. 1–10 (2009) 6. Chen, Y., Pavlov, D., Canny, J.F.: Large-scale behavioral targeting. In: KDD 2009, pp. 209–218 (2009) 7. Cheng, Z., Gao, B., Liu, T.-Y.: Actively predicting diverse search intent from user browsing behaviors. In: WWW 2010, pp. 221–230 (2010) 8. Ciaramita, M., Murdock, V., Plachouras, V.: Online learning from click data for sponsored search. In: WWW 2008, pp. 227–236 (2008) 9. Craswell, N., Zoeter, O., Taylor, M., Ramsey, B.: An experimental comparison of click position-bias models. In: WSDM 2008, pp. 87–94 (2008) 10. Dembczy´ nski, K., Kotlowski, W., Weiss, D.: Predicting ads’ click-through rate with decision rules. In: TROA 2008, Beijing, China (2008) 11. Dou, Z., Song, R., Wen, J.: A large-scale evaluation and analysis of personalized search strategies. In: WWW 2007, pp. 581–590 (2007) 12. Dupret, G.E., Piwowarski, B.: A user browsing model to predict search engine click data from past observations. In: SIGIR 2008, pp. 331–338. ACM, New York (2008) 13. Guo, F., Liu, C., Kannan, A., Minka, T., Taylor, M., Wang, Y.-M., Faloutsos, C.: Click chain model in web search. In: WWW 2009, pp. 11–20 (2009) 14. Gupta, M.: Predicting click through rate for job listings. In: WWW 2009, pp. 1053–1054 (2009)
518
B. Wang et al.
15. K¨ onig, A.C., Gamon, M., Wu, Q.: Click-through prediction for news queries. In: SIGIR 2009, pp. 347–354 (2009) 16. Radlinski, F., Broder, A.Z., Ciccolo, P., Gabrilovich, E., Josifovski, V., Riedel, L.: Optimizing relevance and revenue in ad search: a query substitution approach. In: SIGIR 2008, pp. 403–410 (2008) 17. Richardson, M., Dominowska, E., Ragno, R.: Predicting clicks: estimating the clickthrough rate for new ads. In: WWW 2007, pp. 521–530 (2007) 18. Shanahan, J., Roma, N.: Boosting support vector machines for text classification through parameter-free threshold relaxation. In: CIKM 2003, pp. 247–254 (2003) 19. Streeter, M., Golovin, D., Krause, A.: Online learning of assignments. In: NIPS 2009, pp. 1794–1802 (2009) 20. Tang, J., Sun, J., Wang, C., Yang, Z.: Social influence analysis in large-scale networks. In: KDD 2009, pp. 807–816 (2009) 21. Wang, B., Tang, J., Fan, W., Chen, S., Yang, Z., Liu, Y.: Heterogeneous cross domain ranking in latent space. In: CIKM 2009, pp. 987–996 (2009) 22. Zhu, Y., Wang, G., Yang, J., Wang, D., Yan, J., Hu, J., Chen, Z.: Optimizing search engine revenue in sponsored search. In: SIGIR 2009, pp. 588–595 (2009) 23. Zhu, Z.A., Chen, W., Minka, T., Zhu, C., Chen, Z.: A novel click model and its applications to online advertising. In: WSDM 2010, pp. 321–330 (2010)
TeamSkill: Modeling Team Chemistry in Online Multi-player Games Colin DeLong, Nishith Pathak, Kendrick Erickson, Eric Perrino, Kyong Shim, and Jaideep Srivastava Department of Computer Science, University of Minnesota 200 Union St SE, Minneapolis, MN {delong,pathak,kjshim,srivasta}@cs.umn.edu, {kendrick,perr0273}@umn.edu http://www.cs.umn.edu
Abstract. In this paper, we introduce a framework for modeling elements of “team chemistry” in the skill assessment process using the performances of subsets of teams and four approaches which make use of this framework to estimate the collective skill of a team. A new dataset based on the Xbox 360 video game, Halo 3, is used for evaluation. The dataset is comprised of online scrimmage and tournament games played between professional Halo 3 teams competing in the Major League Gaming (MLG) Pro Circuit during the 2008 and 2009 seasons. Using the Elo, Glicko, and TrueSkill rating systems as “base learners” for our approaches, we predict the outcomes of games based on subsets of the overall dataset in order to investigate their performance given differing game histories and playing environments. We find that Glicko and TrueSkill benefit greatly from our approaches (TeamSkill-AllK-EV in particular), significantly boosting prediction accuracy in close games and improving performance overall, while Elo performs better without them. We also find that the ways in which each rating system handles skill variance largely determines whether or not it will benefit from our techniques. Keywords: Player rating systems, competitive gaming, Elo, Glicko, TrueSkill.
1
Introduction
Skill assessment has long been an active area of research. Perhaps the most wellknown application is to the game of chess, where the need to gauge the skill of one player versus another led to the development of the Elo rating system [1]. Although mathematically simple, Elo performed well in practice, treating skill assessment for individuals as a paired-comparison estimation problem, and was subsequently adopted by the US Chess Federation (USCF) in 1960 and the World Chess Federation (FIDE) in 1970. Other ranking systems have since been developed, notably Glicko [2], [3], a generalization of Elo which sought to address J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 519–531, 2011. c Springer-Verlag Berlin Heidelberg 2011
520
C. DeLong et al.
Elo’s ratings reliability issue, and TrueSkill [4], the well-known Bayesian model used for player/team ranking on Microsoft’s Xbox Live gaming service. With hundreds of thousands to millions of players competing on networks such as Xbox Live, accurate estimations of skill are crucial because unbalanced games - those giving a distinct advantage to one player or team over their opponent(s) ultimately lead to player frustration, reducing the likelihood they will continue to play. For multiplayer-focused games, this is a particularly relevant issue as their success or failure is tied to player interest sustained over a long period of time. While previous work in this area [4] has been evaluated using data from a general population of players, less attention has been paid to certain boundary conditions, such as the case where the entire player population is highly-skilled individually. As in team sports [5], [6], less tangible notions, such as “team chemistry”, are often cited as key differentiating factors, particularly at the highest levels of play. However, in existing skill assessment approaches, player performances are assumed to be independent from one another, summing individual player ratings in order to arrive at an overall team rating. In this work, we describe four approaches (TeamSkill-K, TeamSkill-AllK, TeamSkill-AllK-EV, and TeamSkill-AllK-LS) which make use of the observed performances of subsets of players on teams as a means of capturing “team chemistry” in the ratings process. These techniques use ensembles of ratings of these subsets to improve prediction accuracy, leveraging Elo, Glicko, and TrueSkill as “base learners” by extending them to handle entire groups of players rather than strictly individuals. To the best of our knowledge, no similar approaches exist in the domain of skill assessment. For evaluation, we introduce a rich dataset compiled over the course of 2009 based on the Xbox 360 game Halo 3, developed by Bungie, LLC in Kirkland, WA. Halo 3 is a first-person shooter (FPS) played competitively in Major League Gaming (MLG), the largest professional video game league in the world, and is the flagship game for the MLG Pro Circuit, a series of tournaments taking place throughout the year in various US cities. Our evaluation shows that, in general, predictive performance can be improved through the incorporation of subgroup ratings into a team’s overall rating, especially in high-level gaming contexts, such as tournaments, where teamwork is likely more prevalent. Additionally, the modeling of variance in each rating system is found to play a large role in determining the what gain (or loss) in performance one can expect from using subgroup rating information. Elo, which uses a fixed variance, is found to perform worse when used in concert with any TeamSkill approach. However, when the Glicko and TrueSkill rating systems are used as base learners (both of which model variance as player-level variables), several TeamSkill variants achieve the highest observed prediction accuracy, particularly TeamSkill-AllKEV. Upon further investigation, we find this performance increase is especially apparent for “close” games, consistent with the competitive gaming environment in which the matches occur. The paper is structured as follows. Section 2 reviews some of the relevant related work in the fields of player and team ratings/ranking systems and
TeamSkill: Modeling Team Chemistry in Online Multi-player Games
521
competitive gaming. In Section 3, we introduce our proposed approaches, TeamSkill-K, TeamSkill-AllK, TeamSkill-AllK-EV, and TeamSkill-AllK-LS. For Section 4, we describe the Halo 3 dataset, how it was compiled, its characteristics, and where it can be found should other researchers be interested in studying it. In Section 5, we evaluate the TeamSkill approaches and compare them to “vanilla” versions of Elo, Glicko, and TrueSkill, in game outcome prediction accuracy. Finally, in Section 6 we provide a number of conclusions and discuss our future work.
2
Related Work
In games, the question of how to rank (or provide ratings of) players is old, tracing its roots to the work of Louis Leon Thurstone in the mid-1920’s and BradleyTerry-Luce models in the 1950’s. In 1927 [7], Thurstone proposed the “law of comparitive judgement”, a means of measuring the mean distance between two physical stimuli, Sa and Sb . Thurstone, working with stimuli such as the sensedistance between levels of loudness, asserted that the distribution underlying each stimulus process is normal and that as such, the mean difference between the stimuli Sa and Sb can therefore be quantified in terms of their standard deviation. This work laid the foundation for the formulation of Bradley-Terry-Luce (BTL) models in 1952 [8], a logistic variant of Thurstone’s model which provided a rigorous mathematical examination of the paired comparison estimation problem, using taste preference measurements as its experimental example. The BTL model framework provided the basis for the Elo rating system, introduced by Arpad Elo in 1959 [1]. Elo, himself a master chess player, developed the Elo rating system to replace the US Chess Federation’s Harkness rating system with one more grounded in statistical theory. Like Thurstone, the Elo rating system assumes each player’s skill is normally distributed, where player i’s expected performance is pi ∼ N (μi , β 2 ). Notably, though, Elo also assumes players’ skill distributions share a constant variance β 2 , greatly simplifying the mathematical calculation at the expense of capturing the relative certainty of each player’s skill. In 1993 [3], Mark Glickman sought to improve upon the Elo rating system by addressing the ratings reliability issue in the Glicko rating system. By introducing a dynamic variance for each player, the confidence in a player’s skill rating could be adjusted to produce more conservative skill estimates. However, the inclusion of this information at the player level also incurred significant computational cost in terms of updates, and so an approximate Bayesian updating scheme was devised which estimates the marginal posterior distribution P r(θ|s), where θ and s correspond to the player strengths and the set of game outcomes observed thus far, respectively. With the advent of large-scale console-based multiplayer gaming on the Microsoft Xbox in 2002 via Xbox Live, there was a growing need for a more generalized ratings system not solely designed for individual players, but teams - and any number of them - as well. TrueSkill [4], published in 2006 by Ralf Herbrich and Thore Graepel of Microsoft Research, used a factor graph-based approach
522
C. DeLong et al.
to accomplish this. Like Glicko, TrueSkill also maintains a notion of variance for each player, but unlike it, TrueSkill samples an expected performance pi given a player’s expected skill, which is then summed for all players on i’s team to represent the collective skill of that team. This expected performance pi is also assumed to be distributed normally, but similar to Elo, a constant variance is assumed across all players. Of note, TrueSkill’s summation of expected player performances in quantifying a team’s expected performance assumes player performances are independent of one another. In the case of team games, especially those occurring at high levels of competition where team chemistry and cooperative strategies play much larger roles, this assumption may prove problematic in ascertaining which team has the true advantage a priori. We explore this topic in more depth later on. Other variants of the aforementioned approaches have also been proposed. Coulom’s Whole History Rating (WHR) method [9] is, like other rating systems such as Elo, based on the dynamic BTL model. Instead of incrementally updating the skill distributions of each player after a match, it approximates the maximum a posteri over all previous games and opponents, resulting in a more accurate skill estimation. This comes at the cost of some computational ease and efficiency, which the authors argue is still minimal if deployed on large-scale game servers. Others [10] have extended the BTL model to use group comparisons instead of paired comparisons, but also assume player performance independence by defining a team’s skill as the sum of its players’. Birlutiu and Heskes [11] develop and evaluate variants of expectation propagation techniques for analysis of paired comparison data by rating tennis players, stating that the methods are generalizable to more complex models such as TrueSkill. Menke, et al. [12] develop a BTL-based model based on the logistic distribution, asserting that weaker teams are more likely to win than what a normally-distributed framework would predict. They also conclude that models based on normal distributions, such as TrueSkill, lead to an exponential increase in team ratings when one team has more players than another. The field of game theory includes a number of related concepts, such as the Shapley value [13], which considers the problem of how to fairly allocate gains among a coalition of players in a game. In the traditional formulation of skill assessment approaches, however, gains or losses are implicitly assumed to be equal for all players given the limitation to win/loss/team formation history during model construction and evaluation. That is, no additional information is available to measure the contribution of each player to a team’s win or loss.
3
Proposed Approaches
As discussed, the characteristic common to existing skill assessment approaches is that the estimated skill of a team is quantified by summing the individual skill ratings of each player on the team. Though understandable from the perspective of minimizing computational costs and/or model complexity, the assumption is not well-aligned with either intuition or research in sports psychology [5], [6]. Only in cases where the configuration of players remains constant throughout a
TeamSkill: Modeling Team Chemistry in Online Multi-player Games
523
team’s game history can the summation of individual skill ratings be expected to closely approximate a team’s true skill. Where that assumption cannot be made, as is the case in the dataset under study in this paper, it is difficult to know how much of a player’s skill rating can be attributed to the individual and how much is an artifact of the players he/she has teamed with in the past. Closely related to this issue is the notion of team chemistry. “Team chemistry” or “synergy” is a well-known concept [5], [6] believed to be a critical component of highly-successful teams. It can be thought of as the overall dynamics of a team resulting from a number of difficult-to-quantify qualities, such as leadership, confidence, the strength of player/player relationships, and mutual trust. These qualities are also crucial to successful Halo teams, which is sometimes described by its players as “real-time chess”, where teamwork is believed to be the key factor separating good teams from great ones. The integration of any aspect of team chemistry into the modeling process doesn’t suggest an obvious solution, though. However, a key insight is that one need not maintain skill ratings only for individual players - they can be maintained for groups of players as well. The skill ratings of these groups can then be combined to estimate the overall skill of a team. Here, we describe four methods which make use of this approach - TeamSkill-K, TeamSkill-AllK, TeamSkillAllK-EV, and TeamSkill-AllK-LS. 3.1
TeamSkill-K
At a high level, this approach is simple: for a team of K players, choose a subgroup size k ≤ K, calculate the average skill rating for all k-sized player groups for that team using some “base learner” (such as Elo, Glicko, or TrueSkill), and finally scale this average skill rating up by K/k to arrive at the team’s skill rating. For k = 1, this approach is equivalent to simply summing the individual player skill ratings together. As such, TeamSkill-K can be thought of as a generalized approach for combining skill ratings for any K-sized team given player subgroup histories of size k. Formally, let s∗i be the estimated skill of team i and fi (k) be a function returning the set of skill ratings for player subgroups of size k in team i. Let each member of the set of skill ratings returned by fi (k) be denoted as sikl , corresponding to the l-ith configuration of size k for team i. Here, sikl is assumed to be a random variable drawn from some underlying distribution. Then, given some k, the collective strength of a team of size K can be estimated as follows: K E[fi (k)] k
s∗i = =
(k − 1)!(K − k)! (K − 1)!
K! k!(K−k)!
sikl
(3.1)
l=1
Though simple to implement and useful as a generalized approach for estimating a team’s skill given ratings for player subgroups of size k, this choice of k
524
C. DeLong et al.
introduces a potentially problematic trade-off between two desirable skill estimation properties - game history availability and player subgroup specificity. As k becomes larger, less history is available and as k becomes smaller, subgroups capture lower-level interaction information. 3.2
TeamSkill-AllK
To address this issue, a second approach was developed. Here, all available player subgroup information, 1 ≤ k ≤ K, is used to estimate the skill rating of a team. The general idea is to model a team’s skill rating as a recursive summation over all player subgroup histories, building in the (k − 1)-level interactions present in a player subgroup of size k in order to arrive at the final rating estimate. This approach can be expressed as follows. Let s∗ikl be the estimated skill rating of the l-ith configuration of size k for team i and gi (k) be a function returning K! the set of estimated skill ratings s∗ikl , where 1 ≤ l ≤ k!(K−k)! for player sets of size k in team i. When k = 0, gi (k) = {Ø} and s∗ikl = 0. As before, let sikl be the skill rating of the l-ith configuration of size k for team i. Additionally, let αk be a user-specified parameter in the range [0, 1] signifying the weight of the k-ith level of estimated skill ratings. Then, s∗ikl = =
k αk sikl + (1 − αk ) E[gi (k − 1)] k−1 ∗ s∗ ∈gi (k−1) sik−1l k ik−1l αk sikl + (1 − αk ) k−1 |gi (k − 1)|
(3.2)
To compute s∗i , let s∗i = s∗ikl where k = K and l = 1 (since there is only one player subset rating when k = K). This recursive approach ensures that all player subset history is used. 3.3
TeamSkill-AllK-EV
In TeamSkill-AllK, if no history is available for a particular subgroup, default values (scaled to size k) are used instead in order to continue the recursion. Problematically, cases where limited player subset history is available will produce team skill ratings largely dominated by default rating values, potentially resulting in inaccurate skill estimates. As such, another approach was developed, called TeamSkill-AllK-EV. The core idea behind TeamSkill-AllK - the usage of all available player subgroup histories - was retained, but the new implementation eschewed default values for all player subsets save those of individual players (consistent with existing skill assessment approaches), instead focusing on the evidence drawn solely from game history. Re-using notation, TeamSkill-AllK-EV is as follows:
TeamSkill: Modeling Team Chemistry in Online Multi-player Games
s∗i =
K
k=1
K
=
K K
1
|hi (k) = ∅| k=1 k
k=1
E[hi (k)]
K E[hi (k)]
K
|hi (k) = ∅| k=1
525
k
(3.3)
Here, hi (k) = fi (k) where there exists at least one player subset history of size k, else ∅ is returned. 3.4
TeamSkill-AllK-LS
In this context, it is natural to hypothesize that the most accurate team skill ratings could be computed using the largest possible player subsets covering all members of a team. That is, given some player subset X and its associated rating, ratings for subsets of X should be disregarded since they represent lower-level interation information X would have already captured in its rating. Formally, such an approach can be represented as follows: s∗i =
1 m=K
K m|{hi (m)⊆hi (m<j≤K)}=∅
1 E[hi (k)⊆hi (k<j≤K)]
(3.4)
k=K
One obvious advantage to this approach is its speed, since this method prunes away from consideration ratings of subsets of previously-used supersets.
4
Dataset
The data under study in this paper was collected throughout 2009 as part of a larger project to produce a high-quality, competitive gaming dataset. Halo 3, released in September 2007 on the Xbox 360 game console, is played professionally as the flagship game in Major League Gaming (as were its predecessors Halo:Combat Evolved and Halo 2). Major League Gaming (MLG) is the largest video gaming league in the world and has grown rapidly since its inception in 2002, with Internet viewership for 2009 events topping 750,000. After its release, Halo 3 replaced Halo 2 beginning with the 2008 season (known as the Pro Circuit). The dataset contains Halo 3 multiplayer games between two teams of four players each. Each game was played in one of two environments - over the Internet on Microsoft’s Xbox Live service in custom games (known as scrimmages) or on a local area network at an MLG tournament. Information on each game includes the players and teams involved, the date of the game, the map and game type, the result (win/loss) and score, and per-player statistics such as kills, deaths, assists (where one player helps another player from the same team kill an opponent), and score. The dataset has several interesting characteristics, such as the high frequency of team changes from one tournament to the next. With four players per team, it is not uncommon for a team with a poor showing in one tournament to replace
526
C. DeLong et al.
one or two players before the next. As such, the resulting dataset lends itself to analyses of skill at the group level since the diversity of player assignments can aid in isolating interesting characteristics of teams who do well versus those who do not. Additionally, since the players making up the top professional and semi-professional teams are all highly-skilled individually, “basic” game familiarity (such as control mechanics) are not considered as important a factor in winning/losing as overall team strategy, execution, and adaptation to the opposition. This focus also helps mitigate issues pertaining to the personal motivations of players since all must be dedicated to winning in order to have earned a spot in the top 32 teams in the league, winnowing out those who might intentially lose games for their teams (as is commonplace in standard Halo 3 multiplayer gaming). Taken together, these elements make for a very high quality research dataset for those interested in studying competitive gaming, skill ratings systems, and teamwork. The dataset has been made available on the HaloFit web site in two formats. The first, http://stats.halofit.org, contains several views into the dataset similar to statistics pages of professional sports leagues such as Major League Baseball. Users can drill down into the dataset using a series of filters to find data relevant to favorite teams or players. The second, http://halofit.org, contains partial and full comma-separated exports of the dataset. The dataset currently houses information on over 9,100 games, 566 players, and 186 teams.
5
Experimental Analysis
The four proposed TeamSkill approaches were evaluated by predicting the outcomes of games occuring prior to 10 Pro Circuit tournaments and comparing their accuracy to unaltered versions (k = 1) of their base learner rating systems - Elo, Glicko, and TrueSkill. For TeamSkill-K, all possible choices of k for teams of 4, 1 ≤ k ≤ 4, were used. Given two teams, t1 and t2 , The prior probability of t1 winning is a straightforward derivation from the negative CDF at 0 of the distribution describing the difference between two independent, normally-distributed random variables: P (t1 > t2 ) =
1 − F (0; μ1 − μ2 , σ12 + σ22 )
=
1 0 − (μ1 − μ2 ) 1 − (1 + erf ( 2 )) 2 2 2 2 (σ1 + σ2 )
=
1 μ2 − μ1 (1 − erf ( 2 )) 2 2(σ1 + σ22 )
(5.1)
For each tournament, we evaluated each rating approach using: – 3 types of training data sets - games consisting only of previous tournament data, games from online scrimmages only, and games of both types. – 3 periods of game history - all data except for the data between the test tournament and the one preceding it (“long”), all data between the test
TeamSkill: Modeling Team Chemistry in Online Multi-player Games
527
tournament and the one preceding it (“recent”), and all data before the test tournament (“complete”). – 2 types of games - the full dataset and those games considered “close” (i.e., prior probability of one team winning close to 50%). In the case where only tournament data is used as training set data, the most recent tournament preceding the test tournament replaced the inter-tournament scrimmage data for the “long” and “recent” game history configurations. Similarly, “recent” game history when considering both tournament and scrimmage data included the most recent tournament. “close” games were defined using a slightly modified version of the “challenge” method [4] in which the top 20% closest games were selected for one rating system and presented to the other (and vice versa). In this evaluation, the closest games from the “vanilla” versions of each rating system (i.e., k = 1) were presented to each of the TeamSkill approaches while the closest games from TeamSkill-AllK-EV were presented to the “vanilla” versions. The reasons these two were chosen is because all the TeamSkill approaches are intended to improve upon their respective “vanilla” versions and that repeated testing had shown TeamSkill-AllK-EV to be the best performing approach on full datasets in many cases. The default values used during the evaluation of Elo (α = 0.07, β = 193.4364, μ0 = 1500, σ02 = β 2 ), Glicko (q = log(10)/400, μ0 = 1500, σ02 = 1002 ), and TrueSkill ( = 0.5, μ0 = 25, σ02 = (μ0 /3)2 , β = σ02 /2) correspond to the defaults outlined in [4] and [3]. Additionally, for Glicko, a rating period of one game was assigned due to the continuity of game history over the course of 2008 and 2009, as well as to approximate an “apples to apples” comparison with respect to Elo and TrueSkill. In the interest of space, a subset of the 3,780 total evaluations are presented corresponding to the “complete” cases. The “long” results essentially mirrored the “complete” results while the “recent” results were virtually identical across all TeamSkill variations for all non-close games and produced no clear patterns for close games (with differences only emerging after one or two tournaments, as can be seen in the “complete” results). 5.1
Findings and Analysis
The results in figures 1, 2, and 3 show that in general, Glicko and TrueSkill benefit from the incorporation of team chemistry components and tend to improve the prediction accuracy overall in comparison to the “vanilla” versions (k = 1). The TeamSkill-AllK and TeamSkill-AllK-EV approaches - TeamSkill-AllK-EV in particular - outperform k = 1 in nearly all cases. TeamSkill-AllK-LS, on the other hand, shows no similar performance gain, nor do any of the TeamSkill versions in the range 1 < k ≤ 4. These results suggest that group-level ratings alone are insufficient for accurately assessing the strength of a team - player-level ratings must be incorporated as well. No similarly positive effect is observed for Elo, although TeamSkill-AllK-EV’s accuracy does approach that of k = 1. In fact, the accuracy for all non-k = 1 approaches is, at best, equal to k = 1. Interestingly, Elo still performs well for k = 1, in some cases outperforming Glicko and TrueSkill. Considering Elo
528
C. DeLong et al. TeamSkill (Elo base)
TeamSkill (Glicko base)
TeamSkill (TrueSkill base)
0.68
0.68
0.68
0.66
0.66
0.66
0.64
0.64
0.64
0.62
1 2 3 4 5 6 7 8 9 10
0.62
1 2 3 4 5 6 7 8 9 10
0.62
k=1 k=2 k=3 k=4 AllK AllK−EV AllK−LS
1 2 3 4 5 6 7 8 9 10
Fig. 1. Prediction accuracy for both tournament and scrimmage/custom games using complete history TeamSkill (Elo base)
TeamSkill (Glicko base)
TeamSkill (TrueSkill base)
0.66
0.66
0.66
0.64
0.64
0.64
0.62
0.62
0.62
0.6
0.6
0.6
0.58
0.58
0.58
0.56
0.56 1 2 3 4 5 6 7 8 9 10
k=1 k=2 k=3 k=4 AllK AllK−EV AllK−LS
0.56 1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
Fig. 2. Prediction accuracy for tournament games using complete history
was developed in the mid-1950’s, that it still competes with state-of-the-art approaches is an impressive result unto itself. As to the source of Glicko and TrueSkill’s improved overall performance, further inspection (figures 4, 5, and 6) reveals significant performance increases (with respect to k = 1) in close games. At times, the margin of difference is as much as 8%. It can also be seen that over time, this margin tends to widen. Taken together, these results indicate that the group-level ratings have the effect of better distinguishing which team has the true advantage in close match-ups, a key finding well-aligned with prior research [5], [6]. 5.2
Discussion
As mentioned, Elo doesn’t benefit from the inclusion of group-level ratings information. The reason stems from Elo’s use of a constant variance and as such, Elo is not sensitive to the dynamics of a player’s skill over time. For groups of players, this issue is compounded since the higher the k under consideration, the less prior game history can be drawn on to infer their collective skill. With the TeamSkill approaches, the net effect is that incorporating (k > 1)-level group ratings ‘dilute’ the overall team rating, resulting in a higher number of closer games since there is no provision for Elo’s constant variance to differ depending on the size of the group under consideration. Similarly, variance also accounts for much of the differences between Glicko and TrueSkill’s performances. Both make use of player-level variances (and, thus, group-level variances using the TeamSkill approaches). However, TrueSkill also
TeamSkill: Modeling Team Chemistry in Online Multi-player Games
TeamSkill (Elo base)
TeamSkill (Glicko base)
TeamSkill (TrueSkill base)
0.7
0.7
0.7
0.68
0.68
0.68
0.66
0.66
0.66
0.64
0.64
0.64
0.62
1 2 3 4 5 6 7 8 9 10
0.62
1 2 3 4 5 6 7 8 9 10
529
0.62
k=1 k=2 k=3 k=4 AllK AllK−EV AllK−LS
1 2 3 4 5 6 7 8 9 10
Fig. 3. Prediction accuracy for scrimmage/custom games using complete history TeamSkill (Elo base)
TeamSkill (Glicko base)
TeamSkill (TrueSkill base)
0.55
0.55
0.55
0.5
0.5
0.5
0.45
0.45
0.45 1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
k=1 k=2 k=3 k=4 AllK AllK−EV AllK−LS
1 2 3 4 5 6 7 8 9 10
Fig. 4. Prediction accuracy for both tournament and scrimmage/custom games using complete history, close games only
maintains a constant “performance” variance, β 2 , across all players, which is applied just prior to computing the predicted ordering of teams during updates. β 2 is a user-provided parameter which, when increased, similarly increases the probability of TrueSkill believing teams will draw, discounting the potentially small differences between them in collective skill. As such, this “performance” variance has a similar ‘dilution’ effect as in Elo, but are less pronounced because TrueSkill also maintains player/group-level variances. These results highlight the critical role played by skill variance in estimating the skill of a group of players. The ways in which it is maintained can result in certain biases which arise when models’ prior beliefs are different relative to the observations. Methods for tackling such an issue could consist of maintaining a prior distribution over the skill variance itself or using a mixture model for the skill variance. Such extensions to Glicko or TrueSkill could result in techniques that can better assimilate new observations with prior beliefs in order to generate superior predictions. Additionally, given the ensemble methodology employed by the TeamSkill approaches, a logical next step is to consider boosted (or online) versions of the TeamSkill framework to see if any further gains can be made. The additional computation cost of boosting in this context, though, could render it unfeasible in a real-world deployment, but would be of academic interest with respect to studying how accurate skill assessment can be using only win/loss/team formation information. Given these real-world constraints, a fully online ensemble framework for TeamSkill would be ideal and as such, our future work is concerned with developing this idea more fully.
530
C. DeLong et al. TeamSkill (Elo base)
TeamSkill (Glicko base)
TeamSkill (TrueSkill base)
0.55
0.55
0.55
0.5
0.5
0.5
0.45
0.45
0.45
0.4
0.4
0.4
0.35
0.35
0.35
0.3
0.3
1 2 3 4 5 6 7 8 9 10
k=1 k=2 k=3 k=4 AllK AllK−EV AllK−LS
0.3 1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
Fig. 5. Prediction accuracy for tournament games using complete history, close games only TeamSkill (Elo base)
TeamSkill (Glicko base)
TeamSkill (TrueSkill base)
0.55
0.55
0.55
0.5
0.5
0.5
0.45
0.45
0.45
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
k=1 k=2 k=3 k=4 AllK AllK−EV AllK−LS
1 2 3 4 5 6 7 8 9 10
Fig. 6. Prediction accuracy for scrimmage/custom games using complete history, close games only
6
Conclusions
Our experiments demonstrate that in many cases, the proposed TeamSkill approaches can outperform the “vanilla” versions of their respective base learner, particularly in close games. We find that the ways in which skill variance is addressed by each base learner has a large effect on the prediction accuracy of the TeamSkill approaches, the results suggesting that those employing a dynamic variance (i.e., Glicko, TrueSkill) can benefit from group-level ratings. In our future work, we intend to investigate ways of better representing skill uncertainty, possibly by modeling the uncertainty itself as a distribution, and constructing an online version of TeamSkill in order to improve skill estimates.
Acknowledgments We would like to thank the Data Analysis and Management Research group, as well as the reviewers, for their feedback and suggestions. We would also like to thank Major League Gaming for making their 2008-2009 tournament data available.
References 1. Elo, A.: The Rating of Chess Players, Past and Present. Arco Publishing, New York (1978) 2. Glickman, M.: Paired Comparison Model with Time-Varying Parameters. PhD thesis. Harvard University, Cambridge (1993)
TeamSkill: Modeling Team Chemistry in Online Multi-player Games
531
3. Glickman, M.: Parameter estimation in large dynamic paired comparison experiments. Applied Statistics 48, 377–394 (1999) 4. Herbrich, R., Graepel, T.: Trueskill: A bayesian skill rating system. Microsoft Research, Tech. Rep. MSR-TR-2006-80 (2006) 5. Yukelson, D.: Principles of effective team building interventions in sport: A direct services approach at penn state university. Journal of Applied Sport Psychology 9(1), 73–96 (1997) 6. Martens, R.: Coaches guide to sport psychology. Human Kinetics, Champaign (1987) 7. Thurstone, L.: Psychophysical analysis. American Journal of Psychology 38, 368– 389 (1927) 8. Bradley, R.A., Terry, M.: Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika 39(3/4), 324–345 (1952) 9. Coulom, R.: Whole-history rating: A bayesian rating system for players of timevarying strength. Computer and Games, Beijing (2008) 10. Huang, T., Lin, C., Weng, R.: Ranking individuals by group comparisons. Journal of Machine Learning Research 9, 2187–2216 (2008) 11. Birlutiu, A., Heskes, T.: Expectation propagation for rating players in sports competitions. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladeniˇc, D., Skowron, A. (eds.) PKDD 2007. LNCS (LNAI), vol. 4702, pp. 374– 381. Springer, Heidelberg (2007) 12. Menke, J.E., Reese, C.S., Martinez, T.R.: Hierarchical models for estimating individual ratings from group competitions. American Statistical Association (2007) (in preparation) 13. Shapley, L.S.: A value for n-person games. Contributions to the Theory of Games (Annals of Mathematics Studies) 2(28), 307–317 (1953)
Learning the Funding Momentum of Research Projects Dan He and D.S. Parker Computer Science Dept., Univ. of California, Los Angeles, CA, 90095-1596, USA {danhe,stott}@cs.ucla.edu
Abstract. In developing grant proposals for funding agencies like NIH or NSF, it is often important to determine whether a research topic is gaining momentum — where by ‘momentum’ we mean the rate of change of a certain measure such as popularity, impact or significance — to evaluate whether the topic is more likely to receive grants. Analysis of data about past grant awards reveals interesting patterns about successful grant topics, suggesting it is sometimes possible to measure the degree to which a given research topic has ‘increasing momentum’. In this paper, we develop a framework for quantitative modeling of the funding momentum of a project, based on the momentum of the individual topics in the project. This momentum follows certain patterns that rise and fall in a predictable fashion. To our knowledge, this is the first attempt to quantify the momentum of research topics or projects. Keywords: Grants, Momentum, Bursts, Technical stock analysis, Classification.
1
Introduction
Research grants are critical to the development of science and the economy. Every year billions of dollars are invested in diverse scientific research topics, yet there is far from sufficient funding to support all researchers and their projects. Funding of research projects is highly selective. For example, in the past roughly only 20% of submitted projects have been funded by NIH. As a result, to maximize their chances of being funded, researchers often feel pressured to submit grant projects on topics that have ‘increasing momentum’ — where ‘momentum’ is defined as the rate of change of a certain measure such as popularity, impact or significance. It would be helpful if one could model this pressure quantitatively. 1.1
Basic Models of Topic Popularity
In [6], indicators such as popularity, impact, and significance can serve as important measures of a research topic. Popularity can be measured by the number of publications relevant to the topic, reflecting the volume of attention devoted
This research supported by NIH grants RL1LM009833 (Hypothesis Web) and UL1DE019580, in the UCLA Consortium for Neuropsychiatric Phenomics.
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 532–543, 2011. c Springer-Verlag Berlin Heidelberg 2011
Learning the Funding Momentum of Research Projects
533
to it. Impact is usually defined in terms of the number of citations to publications involving the topic, which intuitively measures influence. Significance can be defined as the number of citations per article or the number of highly-cited articles involving a topic, giving a measure of overall visibility. In this work, we focus on using a momentum indicator to study successful funding awards over the past few decades from NIH on specific biomedical research topics. We want to identify how the momentum of research topics changes over time, using it as an indicator of success in gaining awards. Can we model the momentum of a research topic, or of funding success? We believe the answer is yes: changes in topic momentum in awards have followed measurable trends. Topic modeling is not new, and there has been considerable work in modeling the popularity of a topic in a way that reflects increasing momentum. Popularity of a topic is often measured using frequency of occurrence, and trends in frequency are used as measures of trends in popularity. However, this leads to relatively naive analysis, such as simple changes in trends involving increase, decrease, etc. Another weakness of this analysis is that it focuses on the direction of changes, and not on more specific features of change. For example, it cannot answer questions like: How ‘rapidly’ is the popularity of a topic increasing? We can answer questions like this by adapting more powerful trend models, such as models of ‘bursts’ (intervals of elevated occurrence [10]). The popularity in burst periods usually increases ‘rapidly’ and therefore the occurrence of a burst indicates a significant change. More detailed discussion is given later. 1.2
Modeling Funding Momentum
We are interested in applying trend models to study trends in research funding. For example the NIH collects extensive information on its funding awards in RePORTER [2], a database tracking biomedical topics for successful (funded) proposals since 1986; each award includes a set of topic keywords from a predefined set (‘terms’). We can use this historical data to model trends in research interests over the past 25 years. Thus, in successful awards, we can study not only trends for single topics, but also trends for projects (sets of topics). In this paper, we first propose a definition of momentum for research topics. According to our definition, the momentum of a topic is a measure of its burst strength over time, computing derivatives of moving averages as a model of its momentum. To calculate momentum, we adapt technical analysis methods used for analyzing stocks. Our experimental results show that our method is able to model momentum for a research topic, by training classifiers on historical data. To our knowledge, this is the first attempt to develop quantitative models for occurrence of research topics in successful grant awards. Other indicators such as impact and significance can be easily integrated into this model to evaluate different aspects of momentum. With a model for the momentum of individual research topics, we study funding momentum of research projects. As just mentioned, each project involves a set of research topics. We can therefore model the funding momentum of a research project in terms of the funding momentum of each research topic it
534
D. He and D.S. Parker
contains. This gives what we believe is the first quantitative definition of funding momentum of a research project. The remainder of the paper is structured as follows. In Section 2 we summarize previous work on analyzing grant databases, and explore the relevance of stock market momentum models to development of ‘research topic portfolios’ — such as identifying burst periods and momentum trends. In Section 3, we define a model of funding momentum for research topics and projects. We propose a framework to predict funding momentum in Section 4. We report experimental results and comparisons in Section 5, and conclude the paper in Section 6.
2
Related Work
There has been a great deal of work in the analysis of grant portfolios, such as in the MySciPI project [1] and the NIH RePORTER system [2]. However, these works have focused on development of databases to store information about awards. For example, MySciPI [1] provides a database of projects that can be queried by keywords or by research topics. When given a research topic, extensive information about all projects related to the topic can be extracted. However, beyond the extraction of project information, only basic statistics (such as the number of hits of the topic in the database, etc.) are shown. The ability to mine information in these databases has been lacking. We try to analyze this information based on the indicators such as popularity, impact and significance. We can easily calculate these indicators over a time window for certain research topics. We are interested in the problem of identification and prediction of periods in which the indicators have a strong upward momentum movement, or ‘burst’. Lots of work have been done to identify bursts of a topic over certain time series. Kleinberg [14] and Leskovec et al. [11] use an automaton to model bursts, in which each state represents the arrival rate of occurrences of a topic. Shasha and co-workers [19] defined bursts based on hierarchies of fixed-length time windows. He and Parker [10] model bursts with a ‘topic dynamics’ model, where bursts are intervals of positive change in momentum. Then they apply trend analysis indicators such as EMA and MACD histogram, which are well-developed momentum computation methods used in evaluating stocks, to identify burst periods. They show that their topic dynamics model is successful in identifying bursts for individual topics, while Kleinberg’s model is more appropriate to identify bursts from a set of topics. He and Parker also point out that the topic dynamics model may permit adaptation of the multitude of technical analysis methods used in analyzing market trends. Of course, an enormous amount of work has gone into prediction of stock prices and of market trends (such as upward or downward movement, turning points, etc.), and a multitude of models has been proposed. For example, AlQaheri et al. [5] recently developed rough sets to develop rules for predicting stock prices. Classical methods involving neural networks (Lawrence [15], Gryc [8], Nikooa et al. [17], Sadd [18]) in forecasting stock prices. Hassan and Nath [9] applied Hidden Markov Models to forecast prices. Genetic Algorithms have
Learning the Funding Momentum of Research Projects
535
Table 1. Key symbols used in the paper Symbol FM(T, m) BS(T, m) FI(T, m) F(T )i CF(T ) histogram(T )i cor(A, B) co(A, B) f req(A)
Description Funding Momentum for the topic or project T in a period of m months Burst Strength for the topic or project T in a period of m months Frequency Increase indicator for topic or project T in a period of m months Frequency of the topic or project T at month i Current frequency, or Start Frequency of the topic or project T MACD Histogram value of the topic or project T at month i correlation between two topics A, B co-occurrences between two topics A, B frequency of the topic A
also been applied [13] [16]. Recently Agrawal et al. [4] developed adaptive NeuroFuzzy Inference System (ANFIS) to analyze stock momentum. Bao and Yang [7] build an intelligent stock trading system using confirmation of turning points and probabilistic reasoning. This paper shows how this wealth of mining experience might be adapted in analyzing grant funding histories.
3 3.1
Funding Momentum Technical Analysis Indicators of Momentum
Key symbols in the paper are summarized in Table 1. To define the momentum of research topics and projects, we adapt the stock market trend analysis indicators. Here we first include very well-established background about technical analysis indicators of momentum. – EMA — the exponential moving average of the momentum, or the ‘first derivative’ of the momentum over a time period: EMA(n)[x]t = α xt + (1 − α) EMA(n − 1)[x]t−1 =
n
k
α (1 − α) xt−k
k=0
where x has a corresponding discrete time series {xt |t = 0, 1, . . .}, α is a smoothing factor. We also write EMA(n)[x]t = EMA(n) for a n-time unit period EMA. – MACD — the difference between EMA’s over two different time periods. MACD(n1 , n2 )
=
EMA(n1 ) − EMA(n2 )
– MACD Histogram — the difference between MACD and its moving average, or the ‘second derivative’ of the momentum: signal(n1 , n2 , n3 ) = EMA(n3 )[MACD(n1 , n2 )] histogram(n1 , n2 , n3 ) = MACD(n1 , n2 ) − signal(n1 , n2 , n3 )
where EMA(n3 )[MACD(n1 , n2 )] denotes the n3 -time unit moving average of the sequence MACD(n1 , n2 ); – RSI — the relative strength of the upwards movement versus the downwards movement of the momentum over a time period: RS(n) = Ut =
xt − xt−1 0
EMAU (n) EMAD (n)
RSI(n) = 100 − 100
xt > xt−1 otherwise
Dt =
xt−1 − xt 0
1 1 + RS(n) xt < xt−1 otherwise
where EMAU (n) and EMAD (n) are EMA for time series U and D, respectively.
536
3.2
D. He and D.S. Parker
Funding Momentum for Research Topics
The degree to which a research topic has ‘increasing momentum’ (funding-wise) can depend on many characteristics. As mentioned earlier, traditional measures of topic popularity rest on the topic’s frequency of occurrence in the literature. However, simple measures do not permit us to model things such as intervals in which popularity of the topic increases ‘sharply’. In this paper we characterize a topic’s funding success in terms of momentum. Specifically, we follow previous work of He and Parker [10], which measures momentum — using the well-known MACD histogram value from technical stock analysis — in order to define a ‘burst’. The histogram value measures change in momentum, so they define the period over which it is positive as the burst period, and the value of the histogram indicates how strong the burst is. The stronger the burst, the greater its momentum. For funding, one is often interested in selecting the best time to start investment in a topic. If momentum plunges after we invest, we have entered the market at a bad time, and should perhaps have waited. We are also interested in the ‘staying power’ of a topic, so that the frequency of a topic remains high even after bursts. Rapid drops in frequency after a burst can make a topic a poor investment (toward a funded portfolio of topics). Therefore, the funding momentum of a topic can depend on the presence of bursts, their strength, and the frequency afterwards. With this in mind we define the funding momentum FM of a topic for a period of m months as follows: −
FM(topic, m) = 1 − e BS(topic, m) =
m
BS(topic,m)×FI(topic,m) α
Hi (topic)
(1) (2)
i=1
FI(topic, m) = Hi (topic) =
1 0
CF(topic) < min1≤i≤m F(topic)i otherwise
histogram(topic)i 0
histogram(topic)i > 0 otherwise
(3) (4)
The burst strength of a topic over a m-month period BS(topic, m) is the sum of the histogram values H for the topic in burst periods, or periods where the values are all positive. The frequency increase indicator FI(topic, m) indicates if the frequency of the topic increases or drops within the m-month period. We define the value of the funding momentum of a topic over a m-month period FM(topic, m) with an exponential model to normalize the value within the range of [0, 1] with α as a decay parameter. In this model, a higher burst strength or a higher momentum increase ratio yield higher funding momentum, with α controlling the rate of increase with respect to these factors. The definition encodes the following intuition about funding momentum: (1) if there is no burst in the m-month period, then no matter how high the frequency or popularity, the m-month period has no funding momentum; (2) if there is a burst, but prior to the burst there is a drop in momentum, the m-month period has no funding momentum. (Hence, it may be advantageous to invest in the topic after the drop ends.) (3) if there is a burst, but after the burst there is a
Learning the Funding Momentum of Research Projects
537
significant drop in popularity, the m-month period has no funding momentum. Instead, we may reduce the investment period to a smaller interval m. We also set a threshold h (0 ≤ h ≤ 1) modeling whether a topic has increasing momentum or not given its funding momentum. If the momentum is greater than h, we say the topic has increasing momentum. As explained below, in our experiments, setting α as 10 and h as 0.2 and following the three selection criteria above has yielded results that are consistent with the RePORTER data. This model formalizes three intuitive ideas: (1) research topics have bursts of popularity, and bursts can be defined in terms of momentum (as in [10]); (2) funded research proposals often explore novel potential correlations between popular research topics; (3) real data (like RePORTER) can be used to learn quantitative models of successful funding. The concept of funding momentum of a set of topics developed in this paper draws directly on these ideas. 3.3
Funding Momentum of Research Projects: Percentage Model
Although we have explored other models, in this paper we consider only what we call the percentage model for funding momentum: a research project has increasing momentum if it contains sufficiently many topics that have increasing momentum for that period. Specifically, we say a research project has increasing momentum whenever the percentage of these topics exceeds a threshold parameter t. In our experiments, this definition with t=0.2 has been adequate to accurately identify increasing momentum intervals.
4
Methods
As mentioned earlier, we have adapted methods of technical analysis to compute momentum. In the stock market, despite claims to the contrary, a common assumption is that past performance is an indicator of likely future achievement. Of course, this assumption is often violated, since the market is news-driven and fluctuates rapidly. As we show in our experiments, for our definitions of momentum and funding momentum, the assumption often works well. Therefore, training classifiers on past funding momentum makes sense, and in some cases may even be adequate to forecast future funding momentum. Gryc [8] shows that indicators such as EMA, MACD histogram, RSI can help to improve prediction accuracy. Although we are not able to validate whether the definitions can accurately identify intervals that are not of increasing momentum, the prediction methods we propose work well for the selection criteria encoded in Formulas 1–4. Alternatively speaking, when we select the increasing momentum topics and projects based on these criteria, the prediction methods have good accuracy. Again, when negative information about success in funding is available to reveal more criteria, or any modification of the criteria, it can be assimilated into our model. In our experiments, we have used four kinds of classifier to model whether a topic has increasing momentum: Linear Regression, Decision Tree, SVM and Artificial Neural Networks (ANN). The linear regression classifier estimates the
538
D. He and D.S. Parker
output as a linear combination of the attributes, fit with a least squares approach. However, trends of the indicators are usually non-linear. Therefore, non-linear classifiers are generally able to achieve better performance. ANN is a popular technique for predicting stock market prices, with excellent capacity for learning non-linear relationships without prior knowledge. SVM is another well-known non-linear classifier that has been widely applied in stock analysis. Compared with ANN and SVM, the Decision tree is much less popular, but as we will show later for determining whether a topic has increasing momentum or not, in our experiments the decision tree classifier performed as well as the SVM and ANN. (We used classifiers implemented in Weka [3], a well-known machine learning and data mining tool, with default parameter settings. The corresponding classifier implementations in Weka for C4.5 decision trees, ANN and SVM are the J48, the MultiLayerPerceptron, and LibSVM classifiers, respectively.)
5 5.1
Experimental Results Analyzing Bursts for Research Topics and Projects
The RePORTER database [2] provides information for NIH grant awards since 1986. Each award record includes a set of project terms, which can be considered as research topics in this project (we use ‘topic’ and ‘term’ interchangeably). Since the total number of years is only 24, we consider the funding momentum for the terms at each month, and term frequencies are calculated by month. As RePORTER imposes limits on volume of downloaded data, we considered awards only for the state of California — a dataset containing 12,378 unique terms and 119,079 awards. We use technical analysis indicators such as the MACD histogram value and RSI to compute momentum and identify burst periods for the terms, adapting the definition of bursts as intervals of increasing momentum studied in He and Parker [10]. Burst periods for the terms ‘antineoplastics’, ‘complementary-DNA’, and ‘oncogenes’ are shown in Figure 1 — as well as their frequencies and funding momentum. The funding momentum covers the 6 month period beyond any time point. We set the the threshold value to 0.2 for selection of increasing momentum years (we call these years increasing momentum years). Clearly increasing momentum years are highly correlated with bursts. Strong bursts usually define intervals that have increasing momentum. Weak bursts, such as the one for ‘gene expression’ around 1993, are omitted by the threshold filter. However, it is not necessarily the case that a strong burst period has increasing momentum. For example, ‘oncogenes’ has a strong burst from year 1997 to 1999, but the increasing momentum years associated with the burst extended only from 1997 to the middle of 1998, because the burst levels off. According to criterion (3) in section 3.1, we say after the middle of year 1998, ‘oncogenes’ does not have increasing momentum. We can also observe how criteria (1) and (2) affect increasing momentum. For example, for ‘antineoplastics’, the increasing momentum years start after 1997, which is the end of its frequency plunge; criterion (2) avoids labeling years in a plunge followed by a burst as having increasing momentum.
Learning the Funding Momentum of Research Projects
antineoplastics
oncogenes
value month
histogram(12,26,9) term frequency investment potentials funding momentum 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
month
5
histogram(12,26,9) term frequency investment potentials funding momentum
1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
40
histogram(12,26,9) term frequency investment potentials funding momentum
1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
30
50
20
0
0
value
10
value
0
5
50
10
100
20
gene expression
539
month
Fig. 1. Burst periods for terms ‘antineoplastics’, ‘complementary-DNA’, ‘oncogenes’ with histogram parameters (12, 26, 9), with funding momentum for all terms. The dashed line shows the threshold 0.2 for selection of increasing momentum years. All values are scaled to fit within one figure.
5.2
Experimental Validation of the Funding Momentum Definition
To validate our definition of the funding momentum, we have checked years labeled as increasing momentum against historical events. For example, during the early 1990s, advancements in genetics boosted deeper research on DNA. Two well-known projects, the Human Genome Project and the cloning of Dolly the sheep, started in 1990 and 1996, respectively. We checked the increasing momentum years for all ‘genetic’ terms, and most have increasing momentum years consistent with the year of the development of these two projects. For illustration purposes, we randomly drew five terms and summarized their increasing momentum years in Table 2. There has been a steady increase of research on HIV and AIDS since 1986, when the FDA approved the first anti-retroviral drug to treat AIDS [12]. Therefore we checked the increasing momentum years for terms related to ‘AIDS’, and obtained the results for the five randomly drawn terms shown in Table 2. We observed similar consistency between our increasing momentum years and the periods of expansion in HIV and AIDS research, supporting our funding momentum definition. Following the percentage model of funding momentum, we do only the simple evaluation in which a project has increasing momentum if the percentage of the topics it contains that have increasing momentum is greater than a threshold t. This situation for research topics appears to differ significantly from that Table 2. The increasing momentum years for all terms related to ‘genetic’ and ‘AIDS’ ‘genetic’ Terms pharmacogenetics genetic mapping genetic transcription genetic regulation virus genetics
Increasing Momentum ‘AIDS’ Terms Increasing Momentum 1996 - 1998 AIDS /HIV diagnosis 1988 - 1990 1986 - 1988, 1996 - 1998 AIDS 1986 - 1990 1986 - 1991, 1996 - 1999 AIDS /HIV test 1986 - 1990 1986 - 1993, 1996 - 1998 AIDS therapy 1988 - 1989 1986 - 1991, 1996 - 1998 antiAIDS agent 1988 - 1990
540
D. He and D.S. Parker Table 3. Accuracy of the funding momentum definition of research projects t = 0.2 t = 0.5 t = 0.8 average term # 83.89% 43.13% 11.12% 16
Table 4. The prediction error rate of classifier J48, SVM, Linear Regression, ANN and Naive for 1,000 randomly selected terms for different M (month) period from the current month Classifier J48 SVM Linear Regression ANN Naive
M=6 0.084 0.084 0.112 0.108 0.217
M=7 0.085 0.085 0.119 0.105 0.235
M=8 0.086 0.086 0.125 0.111 0.240
M=9 0.086 0.086 0.132 0.108 0.254
M=10 0.088 0.089 0.138 0.115 0.256
M=11 0.088 0.091 0.143 0.112 0.272
M=12 0.088 0.092 0.146 0.121 0.269
for news ‘memes’ presented in [11]. where topics are more correlated and their clustering is stronger. We can define the accuracy of our funding momentum definition for research projects as the percentage of consistency between the increasing momentum years of the projects and their grant years. The more they are consistent with one another, the higher the accuracy. Varying the threshold as 0.2, 0.5 and 0.8 and randomly selecting 1000 projects, we obtained the results shown in Table 3, showing the average number of terms contained by each project, which is around 16. Therefore, a threshold t=0.2 requires at least 3 topics be of increasing momentum for a project, while a threshold t = 0.8 requires at least 13 such topics. Accuracy of these results drops as the threshold increases. For t = 0.2, the accuracy was sufficiently high that we adopted t = 0.2 in all remaining experiments. 5.3
Prediction of Funding Momentum for Research Topics
In order to predict the funding momentum for the research topics, we create our dataset where the attribute are the momentum of the topic, the technical analysis indicators MACD histogram value, and RSI value. The class is binary: whether the m month period from the current month has increasing momentum or not. We apply J48, SVM, ANN, Linear Regression classifiers on 1000 randomly selected research topics. We also vary the length of the period m from half a year (6 months) to one year. We conduct a ten-fold cross-validation and the error rate of each classifier is shown in Table 4. As we can see, J48 and SVM achieve low error rate in general and their error rates remains almost constant for different time periods. ANN is superior to Linear Regression but both classifiers incur drops in their performance as m increases. Figure 2 shows observed sensitivity (true positive rate) and specificity (1 - false positive rate) of all classifiers. ANN achieved the highest sensitivity. J48, SVM and ANN all had much better sensitivity than LR. All classifiers had specificity almost 1. SVM had the highest specificity, while ANN had the lowest. These results suggest that all classifiers make wrong predictions for positive instances.
541
●
●
Sensitivity 0.4 0.2
●
0.0
●
6
7
8
9 Month
10
11
J48 ANN SVM LR
12
Specificity 0.90 0.95
●
●
●
●
●
●
●
0.85
● ● ●
0.80
0.6
1.00
Learning the Funding Momentum of Research Projects
●
●
6
7
8
9 Month
10
J48 ANN SVM LR
11
12
Fig. 2. Sensitivity vs. Specificity for classifiers J48, SVM, Linear Regression and ANN
This performance may be caused by the un-balance property of our dataset, in which most instances are negative instances. This is reasonable since generally speaking the burst period is much shorter than the non-burst period, and only periods with strong bursts are reported as increasing momentum. Earlier, in the related work section, we mentioned a few techniques for stock price prediction. However, these are not easily adapted to our problem of funding momentum prediction. We therefore used a Naive method to predict funding momentum as the baseline. The Naive method simply assigns the funding momentum of the next month as the funding momentum of the current month, with the assumption that the trend movement usually lasts for a while. The performance of the Naive method, reported in Table 4, was much worse than any of the classifiers, indicating that much better performance can be achieved with intelligent classifier design. Next we compared the performance of the classifiers both with and without TA attributes (technical analysis indicators: MACD, histogram value, and RSI). Since the overall error rates of the J48 and SVM classifiers were superior to the others, we reviewed only the performance of these two classifiers. We again randomly selected 1000 research topics and conducted a ten-fold cross-validation. Comparing the prediction accuracy only for m=6, the experimental results reported in Table 5 show that when the technical analysis indicators were included, the performance of both classifiers improved significantly. The main performance improvements when TAs are included were in improved sensitivity; when TAs were not included, the two classifiers tended to make incorrect predictions for almost all positive instances. Table 5. Prediction error rates with J48 and SVM for 1,000 randomly selected terms, with and without TA (technical analysis indicators) for 6-month periods extending beyond the current month Classifier Error Without TA Error With TA Sensitivity Without TA Sensitivity With TA J48 13.1% 8.4% 0.065 0.451 SVM 13.3% 8.4% 0.10 0.373
542
D. He and D.S. Parker
Table 6. Prediction accuracy with classifier J48, SVM, ANN, Linear Regression and Naive for 1,000 randomly selected projects with threshold 0.2 for the 6 month period from the current month J48 SVM ANN Linear Regression Naive 87.9% 88.0% 83.2% 82.4% 58.9%
5.4
Prediction of Funding Momentum for Projects
To predict the funding momentum of a project, we first predicted the funding momentum of terms in the project. Then we determined if a project had increasing momentum or not according to the percentage of the terms increasing momentum. (If the percentage is greater than the threshold t, the project has increasing momentum.) Again we compared the performance of the classifier J48, SVM, ANN, Linear Regression and Naive. We conducted the experiments on 1,000 randomly selected projects with threshold 0.2 for the 6 month period from the current month. Results are reported in Table 6: the performance of classifiers was positively related to their performance on research topics. Among these classifiers, the performance of J48 and SVM remained the best. However, since the variance of the classification error on terms accumulated, performance for each classifier got worse, especially for the Naive method.
6
Conclusion and Future Work
In this paper, by analyzing historical NIH grant award data (in RePORTER [2]), we were able to model occurrence patterns of biomedical topics in successful grant awards with a corresponding measure that we call funding momentum. We also developed a classification method to predict funding momentum for these topics in projects. We were able to show that this method achieved good prediction accuracy. It seems possible that indicators such as impact and significance could be addressed with variations on funding momentum. To our knowledge, this is the first quantitative model of funding momentum for research projects. We also show in this work that the classification problem is highly un-balanced, Therefore the sensitivity of all the classifiers are not satisfactory. so un-balanced classification techniques might be used to improve performance. We proposed a percentage model for the funding momentum of research projects. There can be other models. For example, in the percentage model, the topics the project contains may have semantic correlations in that some topics are always show up together. A more complicated model maybe needed to define the momentum of a research project. Another possible model is instead of considering the percentage of the increasing momentum topics in the project, we add the frequency of the topics as the ‘frequency’ of the project. We can then apply the same trend models to identify intervals of increasing momentum for the project. The intuition behind this additive model comes from stock market
Learning the Funding Momentum of Research Projects
543
analogies, considering topics as independent stocks and the project as a ‘sector’, with a sector index defined as the sum of stocks it covers. Therefore, similarly we add the frequency of the topics to approximate the frequency of the project.
References 1. 2. 3. 4.
5. 6. 7. 8. 9. 10.
11.
12. 13. 14. 15. 16.
17.
18.
19.
MySciPI (2010), http://www.usgrd.com/myscipi/index.html RePORTER (2010), http://projectreporter.nih.gov/reporter.cfm Weka (2010), http://www.cs.waikato.ac.nz/ml/weka Agrawal, S., Jindal, M., Pillai, G.N.: Momentum analysis based stock market prediction using adaptive neuro-fuzzy inference system (anfis). In: Proc. of the International MultiConference of Engineers and Computer Scientists, IMECS 2010 (2010) Al-Qaheri, H., Hassanien, A.E., Abraham, A.: Discovering Stock Price Prediction Rules Using Rough Sets. Neural Network World Journal (2008) Andelin, J., Naismith, N.C.: Research Funding as an Investment: Can We Measure the Returns? U.S. Government Printing Office, Washington, DC (1986) Bao, D., Yang, Z.: Intelligent stock trading system by turning point confirming and probabilistic reasoning. Expert Systems with Applications 34, 620–627 (2008) Gryc, W.: Neural Network Predictions of Stock Price Fluctuations. Technical report, http://i2r.org/nnstocks.pdf (accessed July 02, 2010) Hassan, M.R., Nath, B.: Stock market forecasting using hidden Markov model: a new approach. He, D., Parker, D.S.: Topic Dynamics: an alternative model of ‘Bursts’ in Streams of Topics. In: The 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, SIGKDD 2010, July 25-28 (2010) Kleinberg, J., Leskovec, J., Backstrom, L.: Meme-tracking and the dynamics of the news cycle. In: Proceedings of the Fifteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,, Paris, France (July 2009) Johnston, M.I., Hoth, D.F.: Present status and future prospects for HIV therapies. Science 260(5112), 1286–1293 (1993) Kaboudan, M.A.: Genetic programming prediction of stock prices. Computational Economics 16(3), 207–236 (2000) Kleinberg, J.M.: Bursty and hierarchical structure in streams. Data Min. Knowl. Discov. 7(4), 373–397 (2003) Lawrence, R.: Using neural networks to forecast stock market prices. University of Manitoba (1997) Li, J., Tsang, E.P.K.: Improving technical analysis predictions: an application of genetic programming. In: Proceedings of The 12th International Florida AI Research Society Conference, Orlando, Florida, pp. 108–112 (1999) Nikooa, H., Azarpeikanb, M., Yousefib, M.R., Ebrahimpourb, R., Shahrabadia, A.: Using A Trainable Neural Network Ensemble for Trend Prediction of Tehran Stock Exchange. IJCSNS 7(12), 287 (2007) Saad, E.W., Prokhorov, D.V., Wunsch, D.C.: Comparative study of stock trend prediction using time delay, recurrent and probabilistic neural networks. IEEE Transactions on Neural Networks 9(6), 1456–1470 (1998) Zhu, Y., Shasha, D.: Efficient elastic burst detection in data streams. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 24-27, pp. 336–345 (2003)
Local Feature Based Tensor Kernel for Image Manifold Learning Yi Guo1 and Junbin Gao2, 1
Freelance Researcher yg
[email protected] 2 School of Computing and Mathematics, Charles Sturt University, Bathurst, NSW 2795, Australia
[email protected]
Abstract. In this paper, we propose a tensor kernel on images which are described as set of local features and then apply a novel dimensionality reduction algorithm called Twin Kernel Embedding (TKE) [1] on it for images manifold learning. The local features of the images extracted by some feature extraction methods like SIFT [2] are represented as tuples in the form of coordinates and feature descriptor which are regarded as highly structured data. In [3], different kernels were used for intra and inter images similarity. This is problematic because different kernels refer to different feature spaces and hence they are representing different measures. This heterogeneity embedded in the kernel Gram matrix which was input to a dimensionality reduction algorithm has been transformed to the image embedding space and therefore led to unclear understanding of the embedding. We address this problem by introducing a tensor kernel which treats different sources of information in a uniform kernel framework. The kernel Gram matrix generated by tensor kernel is homogeneous, that is all elements are from the same measurement. Image manifold learned from this kernel is more meaningful. Experiments on image visualization are used to show the effectiveness of this method.
1
Introduction
Conventionally, raster grayscale images can be represented by vectors by stacking pixel brightness row by row. It is convenient for computer processing and storage of images data. However, it is not natural for recognition and perception. Human brains are more likely to handle images as collections of features laying on a highly nonlinear manifold [4]. In recent research, learning image manifold has attracted great interests in computer vision and machine learning community. There are two major strategies towards manifold learning for within class variability, appearance manifolds from different views [5]: (1) Local Feature based methods; and (2) Key Points based methods. There exist quit a few feature extraction algorithms such as colour histogram [6], auto-associator [7], shape context [8] etc. Among them, local appearance
The author to whom all the correspondences should be addressed.
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 544–554, 2011. c Springer-Verlag Berlin Heidelberg 2011
Local Feature Based Tensor Kernel for Image Manifold Learning
545
based methods such as SIFT have drawn a lot of attention as their success in generic object recognition [3]. Nevertheless, it was also pointed out in [3] that it is quite difficult to study the image manifold from the local features point of view; moreover, the descriptor itself throws an obstacle in learning a smooth manifold because it is not in vector space. The authors of [3] proposed a learning framework called Feature Embedding (FE) which takes local features of images as input and constructs an interim layer of embedding in metric space where the dis/similarity measure can be easily defined. This embedding is then utilized in following process such as classification, visualization etc. As another main stream, key points based methods have been successful in shape modeling, matching and recognition, as demonstrated by the Active Shape Models (ASM) [9] and the Shape Contexts [10]. Generally speaking, key points focus on spatial information/arrangement of the interested objects in images while local features detail object characterization. An ideal strategy for image manifold learning shall incorporate both kinds of information to assist the learning procedure. Actually the combination of both spatial and feature information has been applied in object recognition in recent work, e.g., visual scene recognition in [11]. This kind of approaches has close link with the task of learning from multiple sources, see [12,13]. Kernel method is one of the approaches which can be easily adapted to the multiple sources by the so-called tensor kernels [14] and additive kernels [15,16]. The theoretical assumption is that the new kernel function is defined over a tensor space determined by multiple source space. In our scenario, there are two sources, i.e., the source for the spatial, denoted by y, and the source for the local feature, denoted by f . Thus each hyper “feature” is the tensor y ⊗ f . For the purpose of learning image manifold, we aim at constructing appropriate kernels for the hyper features in this paper. The paper is organized as follows. Section 2 proposes tensor kernels suitable for learning image manifold. Section 3 gives a simple introduction to the Twin Kernel Embedding which is used for manifold embedding. In Section 4, we present several examples of using proposed tensor kernels for visualization.
2
Tensor Kernels Built on Local Features
In the sequel, we assume that a set of K images {Pk }K k=1 is given, each of which Pk is actually represented by a data set as a collection of tensors of spatial and k k local features {yik ⊗ fik }N i=1 where fi is the local feature extracted at the location k yi and Nk is the number of features extracted from Pk . yik ⊗ fik is the i-th hyper k feature in image k and {yik ⊗ fik }N i=1 can be regarded as a tensor field. A tensor kernel was implicitly defined in [3] based on two kernels, ky (yi , yj ) and kf (fi , fj ), which are for spatial space and features space respectively and the kernel between two hyper features is defined as ky (yik , yjl ), k = l kp (yik ⊗ fik , yjl ⊗ fjl ) = (1) kf (fik , fjl ), k = l
546
Y. Guo and J. Gao
where i = 1, ..., Nk , j = 1, ..., Nl and k, l = 1, ..., K. In other words, when two hyper features yik ⊗fik and yjl ⊗fjl are from two different images, kp (·, ·) evaluates only the features, otherwise kp (·, ·) focuses on only coordinates. If we denote Ky as kernel Gram matrix of ky (·, ·) and Kf , Kp likewise. Kp will have Ky blocks in main diagonal and Kf elsewhere. A kernel is associated with a particular feature mapping function from object space to feature space [15] and kernel function is the inner product of the images of the objects in feature space. There is seldom any proof that any two kernels share the same feature mapping and hence kernels are working on different feature spaces. If we see kernels as similarity measures [17], we conclude that every kernel represents an unique measure for its input. This leads to the observation that the above kp is problematic where two kernels are integrated together without any treatment. ky and kf operate on different domains, coordinates and features respectively. Thus the similarity of any two hyper features is determined by the similarity of either spatial or feature information while the joint contribution of two sources is totally ignored in the above construction of the tensor kernel. Thus a successive dimensionality reduction algorithm based on the above separated tensor kernels may erroneously evaluate the embeddings in a uninformed measure. Every feature in images will be projected onto this lower dimensional space as a point. However, under this framework, the relationships of projected features from one image are not comparable with other projected features from different images. As a consequence, the manifold learnt by the dimensionality reduction is distorted and therefore not reliable and interpretable. To tackle this problem, we need a universal kernel for the images bearing in mind that we have two sources of information, i.e. spatial information as well as feature description. Multiple sources integration property of tensor kernel [14] brings homogeneous measurement for the similarity between hyper features. What follows is then how to construct a suitable tensor kernel. Basically, we have two options to choose from, productive and additive tensor kernel which are stated below ρy ky (yik , yjl ) + ρf kf (fik , fjl ) additive tensor kt (yik ⊗ fik , yjl ⊗ fjl ) = (2) ky (yik , yjl ) × kf (fik , fjl ) productive tensor i = 1, ..., Nk , j = 1, ..., Nl , ρy + ρf = 1, ρy ≥ 0, ρf ≥ 0. As we can see, tensor kernel unifies the spatial and features information together in harmony. It is symmetric, positive semi-definite and still normalized. We are particularly interested in the additive tensor kernel. The reason is that the productive tensor kernel tends to produce very small values thus forcing the Gram matrix to be close to identity matrix in practice. This will bring some numerical difficulties for dimensionality reduction. Additive tensor kernel does not have this problem. However, the additive tensor kernel kt defined in (2) takes into account the spatial similarity between two different images which makes little sense in practice. So we adopt a revised version of additive tensor kernel as
Local Feature Based Tensor Kernel for Image Manifold Learning
kt (yik ⊗ fik , yjl ⊗ fjl ) =
ρy ky (yik , yjl ) + ρf kf (fik , fjl ), k = l kf (fik , fjl ), k = l
547
(3)
In both (2) and (3) we need to determine two extra parameters ρy and ρf . To the best knowledge of the authors, there is no principal way to resolve this problem. In practice, we can optimize them using cross validation.
3
Manifold Learning with Twin Kernel Embedding
Given a tensor kernel kt defined in either (2) or (3) and a set of images, a kernel matrix Kt can be calculated. Kt contains all the similarity information among hyper features contained in the given images. The matrix can then be sent to a kernel-based dimensionality reduction algorithm to find the embedding. We start with a brief introduction of TKE algorithm and then proceed to image manifold learning with tensor kernel described in last section. In this section, for the sake of simplicity, we use oi ’s to denote the super feature data we are dealing with and xi the corresponding embeddings of oi . So oi could be an image Pi as collection of features as we mentioned before. 3.1
Twin Kernel Embedding
Twin Kernel Embedding (TKE) preserves the similarity structure of input data in the latent space by matching the similarity relations represented by two kernel gram matrices, i.e. one for input data and the other for embedded data. It simply minimizes the following objective function with respect to xi ’s L=− k(xi , xj )kt (oi , oj ) + λk k 2 (xi , xj ) + λx xi 2 , (4) ij
ij
where k(·, ·) is the kernel function on embedded data and kt (·, ·) the kernel function on hyper feature data of images. The first term performs the similarity matching which shares some traits with Laplacian Eigenmaps in that it replaces the Wij by kt (·, ·) and the Euclidean distance on embedded data xi − xj 2 by k(·, ·). The second and third terms are regularization to control the norms of the kernel and the embeddings. λk and λx are tunable positive parameters to control the strength of the regularization. The logic is to preserve the similarities among input data and reproduce them in lower dimensional latent space expressed again in similarities among embedded data. k(·, ·) is normally a Gaussian kernel, i.e. k(xi , xj ) = γ exp{−σ||xi − xj ||2 },
(5)
because of its analytical form and strong relationship with Euclidean distance. A gradient-based algorithm has to be employed for minimization of (4). The conjugate gradient (CG) algorithm [18] can be applied to get the optimal X which is the matrix of the embeddings X = [x1 , . . . , xN ] . The hyper-parameters
548
Y. Guo and J. Gao
of the kernel function k(·, ·), γ and σ, can also be optimized as well in the minimization procedure. It frees us from setting too many parameters. To start the CG, initial state should be provided. Any other dimensionality reduction methods could work. However, if the non-vectorial data applicability is desirable, only a few of them such as KPCA [19], KLE [20] would be suitable. It is worth explaining the method of locality preserving in TKE. This is done by the k-nearest neighboring. Given an object oi , for any other input oj , kt (oi , oj ) will be artificially set to 0 if oj is not one of the k nearest neighbors of oi . The parameter k(> 1) in k-nearest neighboring controls the locality that the algorithm will preserve. This process is a kind of filtering that retains what we are interested while leaving out minor details. However, the algorithm also works without filtering in which case TKE turns out to be a global approach. The out-of-sample problem [21] can be easily solved by introducing a kernel mapping as X = Kt A (6) where Kt is the Gram matrix of kernel kt (·, ·) and A is a parameter matrix to be determined. Substitute (6) to TKE and optimize the objective function with respect to A instead of X will give us a mapping from original space to lower dimensional space. Once we have the new input, the embedding can be found by xnew = kt (onew , O)A and we denote O as collection of all the given data for training. This algorithm is called BCTKE in [22] where details were provided. 3.2
Manifold Learning Process
An elegant feature of TKE is that it can handle non-vectorial data since in its objective function, it involves only the kernels that can accept non-vectorial inputs. It is particularly useful in this case since the only available information about the images is the tensor kernel which is built on the local features represented in non-vectorial form. In last section, we discussed the additive tensor kernel kt . For each hyper feature in every image expressed as yi ⊗ fi , we can find a point xi in d dimensional space through TKE where d is pre-specified. It yields k ˆ a collection of coordinates {xki }N i=1 for k-th image Pk which is the projection of Pk in the so-called feature embedding space. This feature embedding space is only an interim layer of the final image manifold learning. The reason for it is to transform the information in form of local features to objects in metric space where some distance can be easily defined. There are several distance metrics to evaluate two sets of coordinates [23]. Hausdorff based distance is a suitable candidate since it handles the situation where the cardinalities of two sets are different which is common in real application. Once we get the distance between two sets of coordinates, i.e. two images, we can proceed to manifold learning using TKE again. Suppose the distance is d(Pˆi , Pˆj ), we revise the objective function of TKE as follows L= k(zi , zj )d(Pˆi , Pˆj ) + λk k 2 (zi , zj ) + λz zi 2 , (7) ij
ij
Local Feature Based Tensor Kernel for Image Manifold Learning
549
which differs from (4) in that the kernel kt (·, ·) is replaced by a distance metric. We can still minimize (7) with respect to zi . The logic is when two images are close in feature embedding space, they are also close in the manifold. Another easier way to learn the manifold using TKE is to convert the distance to a kernel by (8) k(Pˆi , Pˆj ) = exp{−σk d(Pˆi , Pˆj )} and substitute this kernel in (4) in TKE where σk is positive parameters. So we minimize the following objective function L=− k(zi , zj ) exp{−σk d(Pˆi , Pˆj )} + λk k 2 (zi , zj ) + λz zi 2 . (9) ij
ij
As a conclusion of this section, we restate the procedures here. 1. We apply tensor kernel kt to the images as collections of hyper features; 2. Use TKE with Kt to learn the feature mapping space and projections of the images i.e. Pˆi ’s; 3. Finally, we obtain the image manifold by using TKE again or other DR methods. In following experiments, we use KLE, KPCA for comparison. Actually, in step 2, we could use other methods which are kernel applicable such as KPCA, KLE etc. Interestingly, if we integrate step 1 and 2 and use (8), the Gram matrix of kernel k(Pˆi , Pˆj ) could be seen as a kernel Gram matrix on the original images and therefore the whole step 1 and 2 could be a kernel construction on histograms (collections of local features). The dimensionality of feature embedding space and image manifold could be detected using some automated algorithms such as rank priors [24]. If visualization is the purpose, 3D or 2D manifold would be preferable.
4
Experimental Results
We applied the tensor kernel and TKE to image manifold learning on several image data sets: the ducks from COIL data set, Frey faces and handwritten digits. They are widely available online for machine learning and image process tests. For TKE, we fixed λx = 0.001 and λk = 0.005 as stated in original paper. We chose Gaussian kernel for kx , Eq. (5), as described in Section 3. Its hyperparameters were set to be 1 and they were updated in runtime. We used additive tensor kernel (3) and set ρy = 0.3 and ρf = 0.7 which were picked from doing the same experiment with different ρy and ρf repeatedly until best combination is found. It shows the preference to local features over coordinates. The dimensionality of feature embedding space de and number of features extracted from images are maximized according to the capability of computational platform. For the demonstration purpose, we chose to visualize those images in 2D plane to see the structure of the data. 4.1
COIL 3D Images
We used 36 128×128 greyscale images of ducks and extracted 60 features from each images. TKE projected all the features to 80 dimensional feature embedding space. Since all the images are perfectly aligned and noise free, traditional
550
Y. Guo and J. Gao
(a) TKE
(b) KLE
(c) KPCA Fig. 1. ducks
methods like PCA, MDS can achieve good embedding using vectorial representation. As we can see from Fig. 1, tensor kernel on local features can capture the intrinsic structure of the ducks, that is the horizontal rotation of the toy duck. The is revealed successfully by KLE which gives a perfect circle like embedding. The order of the images shows the rotation. TKE seems to focus more on the classification information. Its embedding shows 3 connected linear components each of which represents different facing direction. KPCA tries to do the same thing as KLE, but not as satisfactory as KLE. 4.2
Frey Faces
In this subsection, the objects are 66 images extracted from the whole data set with 1,965 images (each image is 28 × 20 grayscale) of a single person’s face. The data set was from a digital movie which is also used in [25]. Two parameters control the images, that is the face direction and expression. Ideally, there should be two axes in 2D plane for these two parameters respectively, one for face direction from left to right and one for face expression from happy to
Local Feature Based Tensor Kernel for Image Manifold Learning
551
(a) TKE
(b) KLE
(c) KPCA Fig. 2. Frey faces
sad. However, the understanding like this is somewhat artificial. This may not even close to the truth. But we hope our algorithms can show some idea of these two dimensions. In this case, de = 30 and 80 features were extracted from each image. The choice of de reflects high computational cost of TKE which is a major drawback of this algorithm. As the number of samples grows, the number of objectives to be optimized in TKE increases linearly. So when the number of images doubled, de has to be half for the limitation of computation resources. In this case, KLE does not reveal any meaningful patterns (see Fig. 2). On the other hand, TKE’s classification property is very well exhibited. It successfully classifies happy and not happy expressions into two different groups. In each group, from top to bottom, the face direction turns from right to left. So we can draw two perpendicular axes on TKE’s result, horizontal one for mood, and vertical one for face direction. KPCA reveals similar pattern as TKE does. The only difference is that TKE’s result shows clearer cluster structure.
552
Y. Guo and J. Gao
(a) TKE
(b) KLE
(c) KPCA Fig. 3. Handwritten digits
4.3
Handwritten Digits
In this section, a subset of handwritten digits images was extracted from a binary alphadigits database which contains 20×16 digits of “0” through “9” and capital “A” through “Z” with 39 examples in each class. It is from Algoval system (available at http://algoval.essex.ac.uk/). Because of limited resources of the computation platform, we used only the digits from 0 to 4 with 20 images per class. We extract 80 features from each image and casted them to de = 20 feature embedding space. Compared with previous two experiments, this experiment is much harder for dimensionality reduction algorithms. It is not clear what the intrinsic dimensionality is. If we choose too small dimension, DR methods will have to throw away too much information. As a matter of fact, we do not know what the manifold should be. The images of the digits are not simply governed by some parameters as we can see in previous experiments. So we could only expect to see clusters of digits which is a quite intuitive interpretation.
Local Feature Based Tensor Kernel for Image Manifold Learning
553
It is worth mentioning that for TKE plus tensor kernel, we use KPCA in the last step instead of TKE for computational difficulty. Fig. 3 shows results of final image manifold learnt by three different algorithms with tensor kernel. TKE shows good classification capability even clearer in this experiment. All classes have clear dominant clusters with some overlapping. Interestingly, by examining the visualization by TKE closely, we can see digit “1” class has two subclasses of two different types of drawing. They are properly separated. Moreover, because they are all “1” from the local feature point of view, these two subclasses are very close to each other whereby forms a whole digit “1” class. KLE does a very good job separating digit “1” from others. However, other classes are overlapped significantly. KPCA has clear “2” and “4” classes but the other classes were not distinguishable. This experiment once again confirms the classification ability of TKE and effectiveness of tensor kernel on local features in depicting the structural relationships between images in terms of classification, recognition and perception.
5
Conclusion
In this paper, we proposed using tensor kernel on local features and TKE in image manifold learning. Tensor kernel provides a homogeneous kernel solution for images which are described as collection of local features instead of conventional vector representation. The most attractive advantage of this kernel is that it integrates multiple sources of information in a uniform measure framework such that the following algorithm can be applied without difficulty in theoretical interpretation. TKE shows very strong potential in classification when it is used in conjunction with local feature focused kernel, for example tensor kernel. So it is interesting to explore more applications of this method in other areas such as bioinformatics and so on. One drawback of TKE which may limit its application is its high computational cost. The number of parameters to be optimized is about O(n2 ) where n is the product of target dimension and number of samples. Further research on whether some efficient approximation is achievable would be very interesting.
References 1. Guo, Y., Gao, J., Kwan, P.W.: Twin kernel embedding. IEEE Transaction of Pattern Analysis and Machine Intelligence 30(8), 1490–1495 (2008) 2. Lowe, D.G.: Object recognition from local scale-invariant features. In: Proceedings of the International Conference on Computer Vision, pp. 1150–1157 (1999) 3. Torki, M., Elgammal, A.: Putting local features on a manifold. In: CVPR (2010) 4. Seung, H., Lee, D.: The manifold ways of perception. Science 290(22), 2268–2269 (2000) 5. Murase, H., Nayar, S.: Visual learning and recognition of 3D objects from appearance. International Journal of Computer Vision 14, 5–24 (1995)
554
Y. Guo and J. Gao
6. Swain, M.J., Ballard, D.H.: Indexing via color histograms. In: Proceedings of the International Conference on Computer Vision, pp. 390–393 (1990) 7. Verma, B., Kulkarni, S.: Texture feature extraction and classification. LNCS, pp. 228–235 (2001) 8. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(24), 509–522 (2002) 9. Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: Active shape models: Their training and application. Computer Vision and Image Understanding 61(1), 38–59 (1995) 10. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape contexts. IEEE Trans. on Pattern Analysis and Machine Intelligence 24(4), 509–522 (2002) 11. Sudderth, E.B., Torralba, A., Freeman, W.T., Willsky, A.S.: Describing visual scenes using transformed objects and parts. International Journal of Computer Vision 77(1-3), 291–330 (2008) 12. Crammer, K., Kearns, M., Wortman, J.: Learning from multiple sources. Journal of Machine Learning Research 9, 1757–1774 (2008) 13. Cesa-Bianchi, N., Hardoon, D.R., Leen, G.: Guest editorial: Learning from multiple sources. Machine Learning 79, 1–3 (2010) 14. Hardoon, D.R., Shawe-Taylor, J.: Decomposing the tensor kernel support vector machine for neuroscience data with structured labels. Machine Learning 79, 29–46 (2010) 15. Sch¨ olkopf, B., Smola, A.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. The MIT Press, Cambridge (2002) 16. Evgeniou, T., Micchelli, C.A., Pontil, M.: Learning multiple tasks with kernel methods. Journal of Machine Learning Research 6, 615–637 (2005) 17. G¨ artner, T., Lloyd, J.W., Flach, P.A.: Kernels for structured data. In: Proceedings of the 12th International Conference on Inductive Logic Programming (2002) 18. Nabney, I.T.: NETLAB: Algorithms for Pattern Recognition. In: Advances in Pattern Recognition. Springer, London (2004) 19. Sch¨ olkopf, B., Smola, A.J., M¨ uller, K.: Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation 10, 1299–1319 (1998) 20. Guo, Y., Gao, J., Kwan, P.W.: Kernel laplacian eigenmaps for visualization of nonvectorial data. In: Sattar, A., Kang, B.-h. (eds.) AI 2006. LNCS (LNAI), vol. 4304, pp. 1179–1183. Springer, Heidelberg (2006) 21. Bengio, Y., Paiement, J., Vincent, P., Delalleau, O., Roux, N.L., Ouimet, M.: Outof-sample extensions for lle, isomap, mds, eigenmaps, and spectral clustering. In: Advances in Neural Information Processing Systems, vol. 16 22. Guo, Y., Gao, J., Kwan, P.W.: Twin Kernel Embedding with back constraints. In: HPDM in ICDM (2007) 23. Cuturi, M., Fukumizu, K., Vert, J.P.: Semigroup kernels on measures. Journal of Machine Learning Research 6, 1169–1198 (2005) 24. Geiger, A., Urtasun, R., Darrell, T.: Rank priors for continuous non-linear dimensionality reduction. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 880–887 (2009) 25. Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding. Science 290(22), 2323–2326 (2000)
Author Index
Adams, Brett I-136 Akinaga, Yoshikazu I-525 Anand, Rajul II-51 Azevedo, Paulo II-432 Bai, Kun I-500 Baquero, Carlos II-123 Bench-Capon, Trevor II-357 Berrada, Ghita II-457 Bhat, Harish S. I-399 Bhattacharyya, Dhruba Kumar Borgelt, Christian II-493 Bow, Mark I-351 Bruza, Peter I-363 Buza, Krisztian II-149 Cao, Wei II-370 Caragea, Doina II-75 Chawla, Sanjay II-345 Chen, Hui-Ling I-249 Chen, Songcan II-506 Cheng, Victor I-75 Coenen, Frans II-357 Costa, Joaquim II-432 Dai, Bi-Ru I-1 de Keijzer, Ander II-457 DeLong, Colin II-519 Deng, Zhi-Hong II-482 De Raedt, Luc II-382 de S´ a, Cl´ audio Rebelo II-432 Desai, Aditya II-469 De Smet, Wim I-549 Di, Nan I-537 Ding, Chris I-148 Ding, Zhiming I-375 Dobbie, Gillian I-387 Du, Jun II-395 Du, Xiaoyong II-407 Erickson, Kendrick II-519 Etoh, Minoru I-525 Faloutsos, Christos II-13 Fan, Jianping II-87
Fan, Xiannian II-309 Fang, Gang I-338 Fujimoto, Hiroshi I-525 Fujiwara, Yasuhiro II-38 Fung, Pui Cheong Gabriel
I-225
I-26
Gallagher, Marcus II-135 Gao, Jun II-270 Gao, Junbin II-544 Gao, Liangcai I-500 Gao, Ning II-482 Garg, Dinesh I-13 Gong, Shaogang II-296 Greiner, Russell I-124 G¨ unnemann, Stephan II-444 Guns, Tias II-382 Guo, Yi II-544 Guo, Yuanyuan I-100 Gupta, Sunil Kumar I-136 He, Dan II-532 He, Dongxiao II-123 He, Jiangfeng II-258 He, Jing I-375 He, Jun II-407 He, Qinming II-258 He, Xian II-420 Hirose, Shuichi II-26 Hospedales, Timothy M. II-296 Hsu, Shu-Ming I-1 Hu, Weiming II-270 Hu, Xuegang I-313 Huang, Bingquan I-411 Huang, Guangyan I-375 Huang, Hao II-258 Huang, Heng I-148 Huang, Houkuan I-38 Huang, Joshua I-171 Huang, Xuanjing I-50 Huang, Ying I-411 Huynh, Dat I-476 Inge, Meador II-198 Ivanovi´c, Mirjana I-183 Iwai, Hiroki II-185
556
Author Index
Jia, Peifa I-448 Jiang, Jia-Jian II-482 Jin, Di II-123 Jing, Liping I-38, I-171 Jorge, Al´ıpio M´ ario II-432
Luo, Luo, Luo, Luo, Luo,
Kang, U II-13 Kantarcioglu, Murat II-198 Kasabov, Nikola II-161 Kashima, Hisashi I-62, II-222 Kechadi, M.-T. I-411 Khoshgoftaar, Taghi M. I-124 Kimura, Daisuke I-62 Kinno, Akira I-525 Kitsuregawa, Masaru II-38 Koh, Yun Sing I-387 Kremer, Hardy II-444 Kuboyama, Tetsuji I-62 Kudo, Mineichi II-234 Kumar, Vipin I-338 Kutty, Sangeetha I-488
Ma, Lianhang II-258 Ma, Wanli I-476, II-246 Makris, Dimitrios II-173 Mao, Hua II-420 Mao, Xianling I-537 Marukatat, Sanparith I-160 Masada, Tomonari I-435 Mayers, Andr´e I-265 Meeder, Brendan II-13 Mladeni´c, Dunja I-183 Moens, Marie-Francine I-549 Monga, Ernest I-265 Morstatter, Fred I-26 Muzammal, Muhammad II-210
Lau, Raymond Y.K. I-363 Laufk¨ otter, Charlotte II-444 Le, Huong Thanh I-512 Le, Trung II-246 Lewandowski, Michal II-173 Li, Chao II-87 Li, Chun-Hung I-75, I-460 Li, Jhao-Yin II-111 Li, Lian II-63 Li, Nan I-423 Li, Pei II-407 Li, Peipei I-313 Li, Xiaoming I-537 Li, Yuefeng I-363, I-488 Li, Yuxuan II-321 Li, Zhaonan II-506 Liang, Qianhui I-313 Ling, Charles X. II-395 Liu, Bing I-448 Liu, Da-You I-249 Liu, Dayou II-123 Liu, Hongyan II-407 Liu, Huan I-26 Liu, Jie I-249 Liu, Wei II-345 Liu, Xiaobing I-537 Liu, Ying I-500 Lu, Aidong II-1
Chao II-370 Dan II-370 Dijun I-148 Jun II-87 Wei II-135
Nakagawa, Hiroshi I-87 Nakamura, Atsuyoshi II-234 Nanopoulos, Alexandros II-149 Napolitano, Amri I-124 Nayak, Richi I-488, II-99 Nebel, Jean-Christophe II-173 Nguyen, Thien Huu I-512 Nguyen, Thuy Thanh I-512 Nijssen, Siegfried II-382 Oguri, Kiyoshi I-435 Okanohara, Daisuke II-26 Onizuka, Makoto II-38 Pan, Junfeng I-289 Parimi, Rohit II-75 Parker, D.S. II-532 Pathak, Nishith II-519 Pears, Russel I-387, II-161 Perrino, Eric II-519 Phung, Dinh I-136 Pudi, Vikram II-469 Qing, Xiangyun I-301 Qiu, Xipeng I-50 Qu, Guangzhi I-209 Radovanovi´c, Miloˇs I-183 Raman, Rajeev II-210 Reddy, Chandan K. II-51 Ru, Liyun II-506
Author Index Sam, Rathany Chan I-512 Sarmah, Rosy Das I-225 Sarmah, Sauravjyoti I-225 Sato, Issei I-87 Schmidt-Thieme, Lars II-149 Segond, Marc II-493 Seidl, Thomas II-444 Sharma, Dharmendra I-476, II-246 Shevade, Shirish I-13 Shibata, Yuichiro I-435 Shibuya, Tetsuo I-62 Shim, Kyong II-519 Singh, Himanshu II-469 Sinthupinyo, Wasin I-160 Soares, Carlos II-432 Spencer, Bruce I-100 Srivastava, Jaideep II-519 Steinbach, Michael I-338 Su, Xiaoyuan I-124 Sun, Xu II-222 Sundararajan, Sellamanickam I-13 Tabei, Yasuo II-26 Takamatsu, Shingo I-87 Takasu, Atsuhiro I-435 Tang, Jie I-549, II-506 Tang, Ke II-309 Tomaˇsev, Nenad I-183 Tomioka, Ryota II-185, II-222 Tran, Dat I-476, II-246 Tsai, Flora S. II-284 Tsuda, Koji II-26 Ueda, Naonori II-222 Urabe, Yasuhiro II-185 Venkatesh, Svetha
I-136
Wan, Xiaojun I-326 Wang, Baijie I-196 Wang, Bin I-100 Wang, Bo II-506 Wang, Gang I-249 Wang, Shengrui I-265 Wang, Su-Jing I-249 Wang, Xin I-196 Wang, Xingyu I-301 Wang, Yang I-289 Wardeh, Maya II-357 Weise, Thomas II-309
Widiputra, Harya II-161 Wu, Hui I-209 Wu, Jianxin I-112 Wu, Leting II-1 Wu, Ou II-270 Wu, Xindong I-313 Wu, Xintao II-1 Wyner, Adam II-357 Xiang, Tao II-296 Xiang, Yanping II-420 Xiong, Tengke I-265 Xu, Hua I-448 Xu, Yue I-363 Xue, Gui-Rong I-289 Yamanishi, Kenji II-185 Yan, Hongfei I-537 Yang, Bo I-249, II-123 Yang, Jing II-63 Yang, Pengyi II-333 Yang, Weidong I-423 Yeh, Mi-Yen II-111 Yin, Jianping I-237 Ying, Xiaowei II-1 Yoo, Jin Soung I-351 Yu, Hang II-482 Yu, Haoyu I-338 Yu, Jeffrey Xu II-407 Yu, Jian I-38, I-171 Yu, Yong I-289 Yun, Jiali I-38, I-171 Zaelit, Daniel I-399 Zeng, Yifeng II-420 Zhai, Zhongwu I-448 Zhan, Tian-Jie I-460 Zhan, Yubin I-237 Zhang, Chengqi II-370 Zhang, Harry I-100 Zhang, Kuo II-506 Zhang, Xiaoqin II-270 Zhang, Xiuzhen II-321 Zhang, Yanchun I-375 Zhang, Yi II-284 Zhang, Yuhong I-313 Zhang, Zhongfei (Mark) II-270 Zhang, Zili II-333 Zhao, Yanchang II-370 Zhao, Zhongying II-87
557
558 Zhou, Zhou, Zhou, Zhou, Zhou,
Author Index Bing B. II-333 Jinlong I-50 Xujuan I-363 Yan II-198 Zhi-Hua II-1
Zhu, Guansheng I-423 Zhu, Hao I-423 Zhu, Xingquan I-209 ˇ Zliobait˙ e, Indr˙e I-277 Zomaya, Albert Y. II-333