Lecture Notes in Artificial Intelligence Subseries of Lecture Notes in Computer Science Edited by J. G. Carbonell and J. Siekmann
Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen
2430
3
Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Tokyo
Tapio Elomaa Heikki Mannila Hannu Toivonen (Eds.)
Machine Learning: ECML 2002 13th European Conference on Machine Learning Helsinki, Finland, August 19-23, 2002 Proceedings
13
Series Editors Jaime G. Carbonell, Carnegie Mellon University, Pittsburgh, PA, USA J¨org Siekmann, University of Saarland, Saarbr¨ucken, Germany Volume Editors Tapio Elomaa Heikki Mannila Hannu Toivonen University of Helsinki, Department of Computer Science P.O. Box 26, 00014 Helsinki, Finland E-mail: {elomaa, heikki.mannila, hannu.toivonen}@cs.helsinki.fi
Cataloging-in-Publication Data applied for Die Deutsche Bibliothek - CIP-Einheitsaufnahme Machine learning : proceedings / ECML 2002, 13th European Conference on Machine Learning, Helsinki, Finland, August 19 - 23, 2002. Tapio Elomaa ... (ed.). - Berlin ; Heidelberg ; New York ; Barcelona ; Hong Kong ; London ; Milan ; Paris ; Tokyo : Springer, 2002 (Lecture notes in computer science ; Vol. 2430 : Lecture notes in artificial intelligence) ISBN 3-540-44036-4
CR Subject Classification (1998): I.2, F.2.2, F.4.1
ISSN 0302-9743 ISBN 3-540-44036-4 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag Berlin Heidelberg New York, a member of BertelsmannSpringer Science+Business Media GmbH http://www.springer.de © Springer-Verlag Berlin Heidelberg 2002 Printed in Germany Typesetting: Camera-ready by author, data conversion by DA-TeX Gerd Blumenstein Printed on acid-free paper SPIN: 10873900 06/3142 543210
Preface
We are pleased to present the proceedings of the 13th European Conference on Machine Learning (LNAI 2430) and the 6th European Conference on Principles and Practice of Knowledge Discovery in Databases (LNAI 2431). These two conferences were colocated in Helsinki, Finland during August 19–23, 2002. ECML and PKDD were held together for the second year in a row, following the success of the colocation in Freiburg in 2001. Machine learning and knowledge discovery are two highly related fields and ECML/PKDD is a unique forum to foster their collaboration. The benefit of colocation to both the machine learning and data mining communities is most clearly displayed in the common workshop, tutorial, and invited speaker program. Altogether six workshops and six tutorials were organized on Monday and Tuesday. As invited speakers we had the pleasure to have Erkki Oja (Helsinki Univ. of Technology), Dan Roth (Univ. of Illinois, UrbanaChampaign), Bernhard Sch¨ olkopf (Max Planck Inst. for Biological Cybernetics, T¨ ubingen), and Padhraic Smyth (Univ. of California, Irvine). The main events ran from Tuesday until Friday, comprising 41 ECML technical papers and 39 PKDD papers. In total, 218 manuscripts were submitted to these two conferences: 95 to ECML, 70 to PKDD, and 53 as joint submissions. All papers were assigned at least three reviewers from our international program committees. Out of the 80 accepted papers 31 were first accepted conditionally; the revised manuscripts were accepted only after the conditions set by the reviewers had been met. Our special thanks go to the tutorial chairs Johannes F¨ urnkranz and Myra Spiliopoulou and the workshop chairs Hendrik Blockeel and Jean-Fran¸cois Boulicaut for putting together an exiting combined tutorial and workshop program. Also the challenge chair Petr Berka deserves our sincerest gratitude. All the members of both program committees are thanked for devoting their expertise to the continued success of ECML and PKDD. The organizing committee chaired by Helena Ahonen-Myka worked hard to make the conferences possible. A special mention has to be given to Oskari Heinonen for designing and maintaining the web pages and Ilkka Koskenniemi for maintaining CyberChair, which was developed by Richard van de Stadt. We thank Alfred Hofmann of Springer-Verlag for cooperation in publishing these proceedings. We gratefully acknowledge the financial support of the Academy of Finland and KDNet. We thank all the authors for contributing to what in our mind is a most interesting technical program for ECML and PKDD. We trust that the week in late August was most enjoyable for all members of both research communities. June 2002
Tapio Elomaa Heikki Mannila Hannu Toivonen
ECML/PKDD-2002 Organization
Executive Committee Program Chairs:
Tutorial Chairs:
Workshop Chairs:
Challenge Chair: Organizing Chair: Organizing Committee:
Tapio Elomaa (Univ. of Helsinki) Heikki Mannila (Helsinki Inst. for Information Technology and Helsinki Univ. of Technology) Hannu Toivonen (Nokia Research Center and Univ. of Helsinki) Johannes F¨ urnkranz (Austrian Research Inst. for Artificial Intelligence) Myra Spiliopoulou (Leipzig Graduate School of Management) Hendrik Blockeel (Katholieke Universiteit Leuven) Jean-Fran¸cois Boulicaut (INSA Lyon) Petr Berka (University of Economics, Prague) Helena Ahonen-Myka (Univ. of Helsinki) Oskari Heinonen, Ilkka Koskenniemi, Greger Lind´en, Pirjo Moen, Matti Nyk¨ anen, Anna Pienim¨ aki, Ari Rantanen, Juho Rousu, Marko Salmenkivi (Univ. of Helsinki)
ECML Program Committee H. Blockeel, Belgium I. Bratko, Slovenia P. Brazdil, Portugal H. Bostr¨om, Sweden W. Burgard, Germany N. Cristianini, USA J. Cussens, UK L. De Raedt, Germany M. Dorigo, Belgium S. Dˇzeroski, Slovenia F. Esposito, Italy P. Flach, UK J. F¨ urnkranz, Austria J. Gama, Portugal J.-G. Ganascia, France T. Hofmann, USA L. Holmstr¨om, Finland
A. Hyv¨ arinen, Finland T. Joachims, USA Y. Kodratoff, France I. Kononenko, Slovenia S. Kramer, Germany M. Kubat, USA N. Lavraˇc, Slovenia C. X. Ling, Canada R. L´opez de M`antaras, Spain D. Malerba, Italy S. Matwin, Canada R. Meir, Israel J. del R. Mill´ an, Switzerland K. Morik, Germany H. Motoda, Japan R. Nock, France E. Plaza, Spain
Organization
G. Paliouras, Greece J. Rousu, Finland L. Saitta, Italy T. Scheffer, Germany M. Sebag, France J. Shawe-Taylor, UK A. Siebes, The Netherlands D. Sleeman, UK M. van Someren, The Netherlands P. Stone, USA
H. Tirri, Finland P. Turney, Canada R. Vilalta, USA P. Vit´ anyi, The Netherlands S. Weiss, USA G. Widmer, Austria R. Wirth, Germany S. Wrobel, Germany Y. Yang, USA
PKDD Program Committee H. Ahonen-Myka, Finland E. Baralis, Italy J.-F. Boulicaut, France N. Cercone, Canada B. Cr´emilleux, France L. De Raedt, Germany L. Dehaspe, Belgium S. Dˇzeroski, Slovenia M. Ester, Canada R. Feldman, Israel P. Flach, UK E. Frank, New Zealand A. Freitas, Brazil J. F¨ urnkranz, Austria H.J. Hamilton, Canada J. Han, Canada R. Hilderman, Canada S.J. Hong, USA S. Kaski, Finland D. Keim, USA J.-U. Kietz, Switzerland R. King, UK M. Klemettinen, Finland W. Kl¨ osgen, Germany Y. Kodratoff, France J.N. Kok, The Netherlands S. Kramer, Germany S. Matwin, Canada
S. Morishita, Japan H. Motoda, Japan G. Nakhaeizadeh, Germany Z.W. Ra´s, USA J. Rauch, Czech Republic G. Ritschard, Switzerland M. Sebag, France F. Sebastiani, Italy M. Sebban, France B. Seeger, Germany A. Siebes, The Netherlands A. Skowron, Poland M. van Someren, The Netherlands M. Spiliopoulou, Germany N. Spyratos, France E. Suzuki, Japan A.-H. Tan, Singapore S. Tsumoto, Japan A. Unwin, Germany J. Wang, USA K. Wang, Canada L. Wehenkel, Belgium D. Wettschereck, Germany G. Widmer, Austria R. Wirth, Germany S. Wrobel, Germany M. Zaki, USA
VII
VIII
Organization
Additional Reviewers
N. Abe F. Aiolli Y. Altun S. de Amo A. Appice E. Armengol T.G. Ault J. Az´e M.T. Basile A. Bonarini R. Bouckaert P. Brockhausen M. Brodie W. Buntine J. Carbonell M. Ceci S. Chikkanna-Naik S. Chiusano R. Cicchetti A. Clare M. Degemmis J. Demsar F. De Rosis N. Di Mauro G. Dorffner G. Dounias N. Durand P. Er¨ ast¨o T. Erjavec J. Farrand S. Ferilli P. Flor´een J. Franke T. Gaertner P. Gallinari P. Garza A. Giacometti
S. Haustein J. He K.G. Herbert J. Himberg J. Hipp S. Hoche J. Hosking E. H¨ ullermeier P. Juvan M. K¨ a¨ari¨ ainen D. Kalles V. Karkaletsis A. Karwath K. Kersting J. Kindermann R. Klinkenberg P. Koistinen C. K¨ opf R. Kosala W. Kosters M.-A. Krogel M. Kukar L. Lakhal G. Lebanon S.D. Lee F. Li J.T. Lindgren J. Liu Y. Liu M.-C. Ludl S. Mannor R. Meo N. Meuleau H. Mogg-Schneider R. Natarajan S. Nijssen G. Paaß
L. Pe˜ na Y. Peng J. Petrak V. Phan Luong K. Rajaraman T. Reinartz I. Renz C. Rigotti F. Rioult ˇ M. Robnik-Sikonja M. Roche B. Rosenfeld S. R¨ uping M. Salmenkivi A.K. Seewald H. Shan J. Sinkkonen J. Struyf R. Taouil J. Taylor L. Todorovski T. Urbancic K. Vasko H. Wang Y. Wang M. Wiering S. Wu M.M. Yin F. Zambetta ˇ B. Zenko J. Zhang S. Zhang T. Zhang M. Zlochin B. Zupan
Organization
Tutorials Text Mining and Internet Content Filtering Jos´e Mar´ıa G´ omez Hidalgo Formal Concept Analysis Gerd Stumme Web Usage Mining for E-business Applications Myra Spiliopoulou, Bamshad Mobasher, and Bettina Berendt Inductive Databases and Constraint-Based Mining Jean-Fran¸cois Boulicaut and Luc De Raedt An Introduction to Quality Assessment in Data Mining Michalis Vazirgiannis and M. Halkidi Privacy, Security, and Data Mining Chris Clifton
Workshops Integration and Collaboration Aspects of Data Mining, Decision Support and Meta-Learning Marko Bohanec, Dunja Mladeni´c, and Nada Lavraˇc Visual Data Mining Simeon J. Simoff, Monique Noirhomme-Fraiture, and Michael H. B¨ ohlen Semantic Web Mining Bettina Berendt, Andreas Hotho, and Gerd Stumme Mining Official Data Paula Brito and Donato Malerba Knowledge Discovery in Inductive Databases Mika Klemettinen, Rosa Meo, Fosca Giannotti, and Luc De Raedt Discovery Challenge Workshop Petr Berka, Jan Rauch, and Shusaku Tsumoto
IX
Table of Contents
Contributed Papers Convergent Gradient Ascent in General-Sum Games . . . . . . . . . . . . . . . . . . . . . . . . 1 Bikramjit Banerjee and Jing Peng Revising Engineering Models: Combining Computational Discovery with Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Stephen D. Bay, Daniel G. Shapiro, and Pat Langley Variational Extensions to EM and Multinomial PCA . . . . . . . . . . . . . . . . . . . . . . . 23 Wray Buntine Learning and Inference for Clause Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Xavier Carreras, Llu´ıs M` arquez, Vasin Punyakanok, and Dan Roth An Empirical Study of Encoding Schemes and Search Strategies in Discovering Causal Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Honghua Dai, Gang Li, and Yiqing Tu Variance Optimized Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Philip Derbeko, Ran El-Yaniv, and Ron Meir How to Make AdaBoost.M1 Work for Weak Base Classifiers by Changing Only One Line of the Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 G¨ unther Eibl and Karl Peter Pfeiffer Sparse Online Greedy Support Vector Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Yaakov Engel, Shie Mannor, and Ron Meir Pairwise Classification as an Ensemble Technique . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Johannes F¨ urnkranz RIONA: A Classifier Combining Rule Induction and k-NN Method with Automated Selection of Optimal Neighbourhood . . . . . . . . . . . . . . . . . . . . . 111 Grzegorz G´ ora and Arkadiusz Wojna Using Hard Classifiers to Estimate Conditional Class Probabilities . . . . . . . . 124 Ole Martin Halck Evidence that Incremental Delta-Bar-Delta Is an Attribute-Efficient Linear Learner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Harlan D. Harris Scaling Boosting by Margin-Based Inclusion of Features and Relations . . . . 148 Susanne Hoche and Stefan Wrobel Multiclass Alternating Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 Geoffrey Holmes, Bernhard Pfahringer, Richard Kirkby, Eibe Frank, and Mark Hall
XII
Table of Contents
Possibilistic Induction in Decision-Tree Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 173 Eyke H¨ ullermeier Improved Smoothing for Probabilistic Suffix Trees Seen as Variable Order Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 Christopher Kermorvant and Pierre Dupont Collaborative Learning of Term-Based Concepts for Automatic Query Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 Stefan Klink, Armin Hust, Markus Junker, and Andreas Dengel Learning to Play a Highly Complex Game from Human Expert Games . . . . 207 Tony Kr˚ akenes and Ole Martin Halck Reliable Classifications with Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 Matjaˇz Kukar and Igor Kononenko Robustness Analyses of Instance-Based Collaborative Recommendation . . . 232 Nicholas Kushmerick iBoost: Boosting Using an instance-Based Exponential Weighting Scheme . . . . . . . . . . . . . . . . . . . . . . . . 245 Stephen Kwek and Chau Nguyen Towards a Simple Clustering Criterion Based on Minimum Length Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 Marcus-Christopher Ludl and Gerhard Widmer Class Probability Estimation and Cost-Sensitive Classification Decisions . . 270 Dragos D. Margineantu On-Line Support Vector Machine Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 Mario Martin Q-Cut – Dynamic Discovery of Sub-goals in Reinforcement Learning . . . . . . 295 Ishai Menache, Shie Mannor, and Nahum Shimkin A Multistrategy Approach to the Classification of Phases in Business Cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307 Katharina Morik and Stefan R¨ uping A Robust Boosting Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319 Richard Nock and Patrice Lefaucheur Case Exchange Strategies in Multiagent Learning . . . . . . . . . . . . . . . . . . . . . . . . . 331 Santiago Onta˜ n´ on and Enric Plaza Inductive Confidence Machines for Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . .345 Harris Papadopoulos, Kostas Proedrou, Volodya Vovk, and Alex Gammerman Macro-Operators in Multirelational Learning: A Search-Space Reduction Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 Lourdes Pe˜ na Castillo and Stefan Wrobel
Table of Contents
XIII
Propagation of Q-values in Tabular TD(λ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .369 Philippe Preux Transductive Confidence Machines for Pattern Recognition . . . . . . . . . . . . . . . 381 Kostas Proedrou, Ilia Nouretdinov, Volodya Vovk, and Alex Gammerman Characterizing Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391 Bohdana Ratitch and Doina Precup Phase Transitions and Stochastic Local Search in k-Term DNF Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 Ulrich R¨ uckert, Stefan Kramer, and Luc De Raedt Discriminative Clustering: Optimal Contingency Tables by Learning Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418 Janne Sinkkonen, Samuel Kaski, and Janne Nikkil¨ a Boosting Density Function Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431 Franck Thollard, Marc Sebban, and Philippe Ezequel Ranking with Predictive Clustering Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444 Ljupˇco Todorovski, Hendrik Blockeel, and Saˇso Dˇzeroski Support Vector Machines for Polycategorical Classification . . . . . . . . . . . . . . . . 456 Ioannis Tsochantaridis and Thomas Hofmann Learning Classification with Both Labeled and Unlabeled Data . . . . . . . . . . . 468 Jean-No¨el Vittaut, Massih-Reza Amini, and Patrick Gallinari An Information Geometric Perspective on Active Learning . . . . . . . . . . . . . . . . 480 Chen-Hsiang Yeang Stacking with an Extended Set of Meta-level Attributes and MLR . . . . . . . . 493 ˇ Bernard Zenko and Saˇso Dˇzeroski
Invited Papers Finding Hidden Factors Using Independent Component Analysis . . . . . . . . . . 505 Erkki Oja Reasoning with Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506 Dan Roth A Kernel Approach for Learning from almost Orthogonal Patterns . . . . . . . . 511 Bernhard Sch¨ olkopf, Jason Weston, Eleazar Eskin, Christina Leslie, and William Stafford Noble Learning with Mixture Models: Concepts and Applications . . . . . . . . . . . . . . . .529 Padhraic Smyth Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .531
Convergent Gradient Ascent in General-Sum Games Bikramjit Banerjee and Jing Peng Electrical Engineering and Computer Science Department, Tulane University New Orleans, LA 70118, USA {banerjee,jp}@eecs.tulane.edu http://www.eecs.tulane.edu/Peng
Abstract. In this work we look at the recent results in policy gradient learning in a general-sum game scenario, in the form of two algorithms, IGA and WoLF-IGA. We address the drawbacks in convergence properties of these algorithms, and propose a more accurate version of WoLF-IGA that is guaranteed to converge to Nash Equilibrium policies in self-play (or against an IGA learner). We also present a control theoretic interpretation of variable learning rate which not only justifies WoLF-IGA, but also shows it to achieve fastest convergence under some constraints. Finally we derive optimal learning rates for fastest convergence in practical simulations.
1
Introduction
Game theory has been a driving impetus for modeling concurrent reinforcement learning problems. With booming e-commerce (notwithstanding the recent slump), the day is not far when automated buyers and sellers will control the electronic economy. There are other potential applications like disaster relief by robots(in potentially hazardous environments, especially after September 11), automated and robotic control of applications (viz. ranging from households to Mars exploration), etc where coordination among multiple agents will hold the key. This makes the focus on multiagent learning research extremely timely and justified. Several algorithms for multiagent learning have been proposed[5],[4],[3],[1], mostly guaranteed to converge to equilibrium in the limit. Bowling and Veloso note in [2], that none of these methods simultaneously satisfies rationality and convergence, two of the desirable criteria for any multiagent learning algorithm. A recent work [8] demonstrated that policy gradient ascent (which they called “Infinitesimal Gradient Ascent”or IGA) could achieve convergence to either Nash Equilibrium policies or Nash Equilibrium payoffs (when the policies don’t converge) in self-play. This algorithm was rational [2] but not convergent to Equilibrium policies in all general-sum games. Subsequently, it was modified with a variable learning rate [2] but the resulting algorithm (WoLF-IGA) might still not converge in some general-sum games. The rationale behind WoLF (Win or T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 1–9, 2002. c Springer-Verlag Berlin Heidelberg 2002
2
Bikramjit Banerjee and Jing Peng
Learn Fast) was to allow the opponent to adapt to the learner’s policy by learning slowly (i.e. changing its policy slowly) when the learner is “winning”, but learn fast when it is not “winning”. However, they [2] used an approximate criterion to determine when the learner was “winning”, and consequently could fail to converge in some general-sum games. The motivation of the present work is to fill this gap to produce the first multiagent learning algorithm that is both rational and universally convergent. We then proceed to analyse why learning at a variable rate is essential for convergence in policy gradient learning, and show that WoLF-IGA is the best we can do under some constraints. Finally we address the question of how should we choose the learning rates so that we can achieve fastest convergence.
2
Definitions
Here we provide definitions of key concepts for our work. We refer to Ai as the set of possible actions available to the ith agent. Definition 1 A bimatrix game is given by a pair of matrices, (M1 , M2 ), (each of size |A1 | × |A2 | for a two-agent game) where the payoff of the ith agent for the joint action (a1 , a2 ) is given by the entry Mk (a1 , a2 ), ∀(a1 , a2 ) ∈ A1 × A2 , k = 1, 2. A constant-sum game (also called competitive games) is a special bimatrix game where M1 (a1 , a2 ) + M2 (a1 , a2 ) = c, ∀(a1 , a2 ) ∈ A1 × A2 , where c is a constant. If c = 0, then it is also called a zero-sum game. Definition 2 A mixed-strategy Nash Equilibrium for a bimatrix game (M1 , M2 ) is a pair of probability vectors (π1∗ , π2∗ ) such that π1∗T M1 π2∗ ≥ π1T M1 π2∗ ∀π1 ∈ P D(A1 ). π1∗T M2 π2∗ ≥ π1∗T M2 π2 ∀π2 ∈ P D(A2 ). where P D(Ai ) is the set of probability-distributions over the ith agent’s action space. No player in this game has any incentive for unilateral deviation from the Nash equilibrium strategy, given the other’s strategy. There always exists at least one such equilibrium profile for an arbitrary finite bimatrix game [6].
3
Policy Gradient Ascent in Bimatrix Games
The basic idea of such an algorithm is to iteratively update an agent’s strategy based on the consequent improvement in the agent’s expected payoff. When both the agent’s fail to improve their strategies any further (which may never happen), they must have converged to some Nash Equilibrium of the game [8]. The
Convergent Gradient Ascent in General-Sum Games
3
simplified domain for studying this problem is a two-agent, two action scenario, with the following payoff matrices r r c c R = 11 12 and C = 11 12 . r21 r22 c21 c22 Let α and β denote the probablities of the two agents selecting the first actions from their respective sets of available actions. Then, the expected payoffs of the two agents, (Vr for the row agent, and Vc for the column agent) are given by Vr (α, β) = r11 (αβ) + r22 ((1 − α)(1 − β)) + r12 (α(1 − β)) + r21 ((1 − α)β) Vc (α, β) = c11 (αβ) + c22 ((1 − α)(1 − β)) + c12 (α(1 − β)) + c21 ((1 − α)β) Then given a strategy pair (α, β) (constrained to lie in the unit square), and letting u = (r11 + r22 )− (r21 + r12 ) and u = (c11 + c22 )− (c12 + c21 ), the gradients are given by ∂Vr (αk , βk ) = βu − (r22 − r12 ) (1) ∂α ∂Vc (αk , βk ) = αu − (c22 − c21 ) (2) ∂β and the strategy pair can be updated as αk+1 = αk + η
∂Vr (αk , βk ) ∂α
βk+1 = βk + η
∂Vc (αk , βk ) ∂β
(3)
The new gradients generated by the above rules are constrained to lie in the unit square by taking their projections on the boundary whenever they cross out. For η → 0, the algorithm is called infinitesimal gradient ascent or IGA. It is known from game theory [7], that the algorithm may never converge in (α, β), but their expected payoffs have been proved to always converge to that of some Nash pair[8].
4
WoLF-IGA
Using equations 1,2,3 and η → 0, we get the unconstrained dynamics of the strategy pair given by the differential equations ∂α −(r22 − r12 ) 0 u α ∂t + . (4) = ∂β −(c22 − c21 ) β u 0 ∂t It has been proved [8] that the points of zero gradients (projected) are Nash Equilibria. However, the algorithm may not converge to such a point in case 0 u the matrix U = has imaginary eigenvalues and the center (point of zero u 0
4
Bikramjit Banerjee and Jing Peng
gradient) lies within the unit square. Consequently, the algorithm was modified with a variable learning rate[2] to converge to Nash pair even in this remaining subcase. The notion of variable learning rate changes the update rules in 3 to αk+1 = αk + ηlkr
∂Vr (αk , βk ) ∂α
βk+1 = βk + ηlkc
∂Vc (αk , βk ) ∂β
(5)
where lkr,c ∈ {lmin , lmax }. Since the proof of convergence of IGA for the other subcases depends only on the sign of the gradients, the above learning rules have the same convergence properties as long as lmin , lmax > 0. Moreover, for lmin < lmax , the algorithm can be made to converge to Nash pair in the remaining subcase[2], by choosing l when Vr (αt , βt ) ≥ Vr (αe , βt ) lr (t) = min lmax otherwise l when Vc (αt , βt ) ≥ Vc (αt , β e ) lc (t) = min (6) lmax otherwise where (αe , β e ) are some Nash pair. The unconstrained dynamics of the system now follows the differential equations r ∂α α −l (t)(r22 − r12 ) 0 lr (t)u ∂t + . (7) = c ∂β l (t)u 0 −lc(t)(c22 − c21 ) β ∂t The algorithm given by equations 5,6 is called the WoLF-IGA. When Vr (αt , βt ) ≥ Vr (αe , βt ), the row agent is said to be winning, in the sense that it would prefer its current strategy to its Nash strategy against the opponent’s current strategy. However, in order to find out whether an agent is winning, Bowling et.al.[2] used an approximate criterion for winning, defined by Vr (αt , βt ) ≥ Vr (αe , β e ), since Vr (αe , βt ) cannot be directly computed without knowing αe . As they note, this criterion fails to guarantee convergence in some general-sum games. In the next section we derive an alternative criterion, that can be easily computed and guarantees convergence.
5
Accurate Criterion for Winning
We are concerned with the subcase where U has purely imaginary eigenvalues and the center is within the unit square. The differential equation 4 has following solution for α(t) for the unconstrained dynamics[8] √ √ α(t) = B u cos( uu t + φ) + αe
Convergent Gradient Ascent in General-Sum Games
5
where B and φ are dependent on the intitial α, β. This also describes the constrained motion of the row agent when they have come down to an ellipse fully contained within the unit square. We note that ∂ 2 α(t) = −|uu |(α − αe ) ∂t2 From [2], ∂Vr (αt , βt ) ∂α 1 ∂ 2 α(t) ∂Vr (αt , βt ) =− |uu| ∂t2 ∂α
Vr (αt , βt ) − Vr (αe , βt ) = (αt − αe )
2
∂Vr (αt ,βt ) < 0. Thus for the Hence we have Vr (αt , βt ) ≥ Vr (αe , β e ) when ∂ ∂tα(t) 2 ∂α 2 iterative updates, if ∆k = αk − αk−1 and ∆k = ∆k − ∆k−1 , then the row agent is winning if ∆k ∆2k < 0. This is not only easy to compute, but because this accurately estimates Vr (αt , βt ) − Vr (αe , βt ), the new method is now guaranteed to converge in all general-sum games. There is a similar criterion of winning for the column agent. We note that the same criterion can be extended to the system r in equation 7, since ∂l∂t(t) = 0 and lr (t)lc (t) = constant within each quadrant, as is evident from figure 1. Figure 2 shows more clearly, how this criterion works in this case. We demonstrate the working of our criterion on the bimatrix game R = 03 32 ,C = . This game is directly adopted from [2], where they used it 12 01 as a counterexample for their approximate criterion. The only Nash Equilibrium is (0.5, 0.5), and the starting point is (α, β) = (0.1, 0.9). Figure 3 shows our algorithm converging to the equilibrium while their’s failing. The experiment was run for 10, 000 iterations, and the accurate version converged in 1369 iterations. The choices for various parameters in both cases were thus: lmax = 1.0, lmin = 0.08, and precision1 required was = 0.01. We now turn to looking at learning with a variable rate from a control theory perspective.
6
Variable Learning Rate
Observe that when uu < 0, the equation 4 specifies a conserved energy oscillation system. For the constrained system, the motion may reach an edge of the unit square but sooner or later it will come down to an ellipse contained completely in the unit square. Since energy is conserved, convergence is improbable. The only way to force it to converge is to apply a force of attenuation proportional to its velocity, and in a direction opposing its velocity. For real world free oscillations e.g. in case of a swinging pendulum, friction with air provides such 1
(α,β) (α,β) i.e. if the stopping condition is | ∂Vr∂α | < and | ∂Vc∂β |<
6
Bikramjit Banerjee and Jing Peng lmax
lr(t)
lmin
0
t lmax
lc(t)
0
lmin t
lr’(t)
0 t
Fig. 1. The learning rate curves and the derivative for lr a force. We can create a second order differential equation from equation 7 for the row agent having the form ∂2α ∂α − F (t)α = G(t) − H(t) ∂t2 ∂t r
where H(t) = dldt /lr , F (t) = lr lc uu (which is incidentally a constant) and G(t) = (r22 − r12 )H(t) − (c22 − c21 )lr u. Notice that in this case, H(t) ∂α ∂t specifies the force2 of attenuation, with magnitude proportional to H(t). Hence if the learning rate is a constant at all times, the oscillation will never be damped, and consequently convergence will not be achieved. If lmin > 0 and lmax is finite, then H(t) is also given by the bottom curve in figure 1. This involves a short force of infinite magnitude exactly at the points where (α, β) cross quadrants in the phase plane. At all other points the force is zero. Given that we are constrained to have lr and lc constant within each quadrant (to maintain a piecewise elliptical trajectory for (α, β)), the attenuation within r any quadrant is necessarily zero, since dldt /lr = 0 inside any quadrant. This 2
This force was absent in the second order differential equation that could be created from equation 4
Convergent Gradient Ascent in General-Sum Games
7
β >0
>0
2> 0
2< 0
α
<0 2
<0 2
>0
e
<0 e
(α , β )
1
1
0.8
0.8 Column agent’s policy
Column agent’s policy
Fig. 2. The ∆ and ∆2 for row agent with WoLF-IGA. The gradient components are also shown
0.6
0.4
0.2
0.6
0.4
0.2
0
0 0
0.2
0.4 0.6 Row agent’s policy
0.8
1
0
0.2
0.4 0.6 Row agent’s policy
0.8
Fig. 3. Left: Failure at convergence with Bowling and Veloso’s criterion. Right: Succesful convergence with the accurate criterion
1
8
Bikramjit Banerjee and Jing Peng
leaves only the quadrant-crossing points available for damping. An infinite force at such points is clearly the best attenuation achievable. This shows that with the given constraints, WoLF-IGA achieves fastest convergence to the center. However, theoretically it is still possible to have overdamping (in case of any oscillation) which leads to fastest convergence possible, and which WoLF-IGA does not achieve. We would no longer be constrained to have elliptical orbits, and could design new lr (t) and lc (t) functions. This would also necessitate a proof of convergence of all possible cases from scratch. The chief motivation for exploring this avenue is that WoLF is not sound when it comes to competitive games. Allowing the learner to slow down when “winning” assumes that the opponent will do the same when the learner is not “winning”, at least in selfplay. However, in order to develop a robust algorithm, we cannot discount the possibility that the opponent is an opportunist or that the domain is strictly competitive. We shall look into this prospect in future. We note in passing, that even if the opponent does not slow down when the learner is “loosing”, WoLFIGA is guaranteed to converge, albeit more slowly.
7
Optimal Learning Rate
Another issue that was not addressed in [2] is the choice of values for lmin and lmax . In order to leave the convergence of the system in other subcases unaffected, it is necessary to have lmin , lmax > 0. We also need lmin < lmax to guarantee convergence for the remaining subcase. The choice of lmax is constrained by the fact that ηlmax → 0 for infinitesimal gradient ascent. Hence we cannot choose an lmax that is arbitrarily high. Now given lmax , we cannot make lmin arbitrarily close to zero, since that makes one of the agents stagnant (in that quadrant) and consequently slows down the other (notice from equations 1,2 that the gradient of α depends on β and vice versa). Actually as lmin → 0, the convergence may slow down remarkably. We state and prove the theorem for the optimal range of lmin given lmax , below. Theorem 1 Given the choice of lmax , the fastest convergence of WoLF-IGA to precision can be achieved for a value of lmin in the range √ lmax 2 ≥ lmin ≥ lmax 2 Proof : (Outline) As noted in [2], the rate of convergence is proportional to min 2 ( llmax ) per “revolution”3. Since the maximum length of half-axis in the unit min 2n square is 0.5, in order to converge to a precision we require ( llmax ) ≤ 2, where n is the number of revolutions. This gives us the minimum number of revolutions required for convergence to precision . Since we cannot have convergence in less than a half “revolution” (this is deduced from the directions of the components of gradients in each quadrant, in figure 2), and we should 3
By “revolution” we mean a complete rotation around the center, through 3600 . We assume that as lr,c η → 0 the number of revolutions is a monotonic function of time.
Convergent Gradient Ascent in General-Sum Games
9
not require more than one “revolution” (provided we choose lmin optimally), we require log(2) 1 1≥ ≥ lmin 2 2 log( l ) max
which gives
8
√ lmax 2 ≥ lmin ≥ lmax 2
Conclusion
We have presented an accurate version of WoLF-IGA that is guaranteed to converge to Nash Equilibrium policies in all general-sum games. We have also argued from a control theoretic perpective why “Win or Learn Fast” makes sense in concurrent learning. This also revealed the possibility of having more complex learning rate functions with unexplored effects on convergence, that could alleviate the irrational “slow down” in learning in potentially competitive environments. Finally we have derived optimal learning rates for fast convergence of policy gradient learning in practical simulations. In future, we intend to explore the dynamics in situations with more than two actions available to the agents, and experiment on real-world applications.
References 1. Bikramjit Banerjee, Sandip Sen, and Jing Peng. Fast concurrent reinforcement learners. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, Seattle, WA, 2001. 1 2. M. Bowling and M. Veloso. Multiagent learning using a variable learning rate. Artificial Intelligence, 2002. In Press. 1, 2, 4, 5, 8 3. Caroline Claus and Craig Boutilier. The dynamics of reinforcement learning in cooperative multiagent systems. In Proceedings of the Fifteenth National Conference on Artificial Intelligence, pages 746–752, Menlo Park, CA, 1998. AAAI Press/MIT Press. 1 4. J. Hu and M. P. Wellman. Multiagent reinforcement learning: Theoretical framework and an algorithm. In Proc. of the 15th Int. Conf. on Machine Learning (ML’98), pages 242–250, San Francisco, CA, 1998. Morgan Kaufmann. 1 5. M. L. Littman. Markov games as a framework for multi-agent reinforcement learning. In Proc. of the 11th Int. Conf. on Machine Learning, pages 157–163, San Mateo, CA, 1994. Morgan Kaufmann. 1 6. John F. Nash. Non-cooperative games. Annals of Mathematics, 54:286 – 295, 1951. 2 7. G. Owen. Game Theory. Academic Press, UK, 1995. 3 8. S. Singh, M. Kearns, and Y. Mansour. Nash convergence of gradient dynamics in general-sum games. In Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, pages 541–548, 2000. 1, 2, 3, 4
Revising Engineering Models: Combining Computational Discovery with Knowledge Stephen D. Bay, Daniel G. Shapiro, and Pat Langley Institute for the Study of Learning and Expertise 2164 Staunton Court, Palo Alto, CA 94306
[email protected] [email protected] [email protected]
Abstract. Developing mathematical models that represent physical devices is a difficult and time consuming task. In this paper, we present a hybrid approach to modeling that combines machine learning methods with knowledge from a human domain expert. Specifically, we propose a system for automatically revising an initial model provided by an expert with an equation discovery program that is tightly constrained by domain knowledge. We apply our system to learning an improved model of a battery on the International Space Station from telemetry data. Our results suggest that this hybrid approach can reduce model development time and improve model quality.
1
Introduction
Building accurate mathematical models of physical devices is an important engineering task. For example, engineers at NASA have developed detailed models that describe the electrical power system on the International Space Station (ISS). The engineers use these models for many tasks, including mission planing, monitoring, and fault diagnosis [3,4]. Because the components on the space station are run close to operating limits, the models must be very accurate, as there is little room for error. However, accurately modeling a physical device is a difficult problem for several reasons. First and foremost, device modeling is an inverse problem that involves reasoning backward from observations on a device’s behavior to possible equations that may have generated the data. Second, our knowledge of most devices is incomplete. For instance, engineers commonly assume constant operating conditions for variables whose affect is not fully understood. Finally, device modeling involves many practical difficulties. For example, data for model development is often available only for a limited range of conditions and may not cover the deployed situation. This is especially true for ISS components, whose operating conditions cannot be easily duplicated. Additionally, testing a component on a lab bench will not account for interactions with nearby devices or changes as the device ages.
©
T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 10–22, 2002. Springer-Verlag Berlin Heidelberg 2002
Revising Engineering Models: Combining Computational Discovery
11
If the structure of the model (i.e. the forms of the equations) is known, but not the specific values for parameters, many techniques can learn the missing parameter values from data. However, a more likely situation is that the structure of the equations, and perhaps even the set of relevant variables, are not completely known. This leaves the engineer with the difficult task of building an appropriate model manually from domain principles and her intuitions. Building models manually is an iterative and time consuming process whereby an engineer may specify an initial model, tune its parameters, and then test it against data. If the model’s performance is inadequate, the engineer will revise the model and repeat the process until she is sure that it is accurate enough for the intended task. This trial and error approach is cumbersome, especially with many parameters or possible model structures. An alternative is to rely on computational methods to automatically discover a model. For example, equation discovery programs, such as Bacon [5] and Lagramge [9], take data in the form of observations and attempt to find equations that govern the relationship between independent and dependent variables. This approach is appealing because it automates much of the modeling process. However, equation discovery methods can suffer from very large search spaces and require strong constraints to limit the search space [9]. In this paper, we propose and formalize a hybrid modeling technique that combines the engineer’s knowledge about a device with machine learning methods. In particular, we use engineering knowledge to constrain the search for better models and we use computational discovery programs to manage search, parameter fitting, and model scoring. We believe this approach has several advantages. From the engineer’s perspective, a hybrid approach would let them focus on identifying possible refinements and explore a wider set than could be done manually. From a computational perspective, domain knowledge massively constrains the search space and makes equation discovery feasible. We demonstrate this hybrid approach by revising battery models to better explain real-world behavior. In the next section, we begin by describing a simple battery model and showing how an engineer might revise it to explain complex non-linear behavior. In Section 3, we present our method for combining equation discovery and background knowledge. In Section 4, we evaluate our method on revising the battery model and show that much of the non-linearity can be recovered. In Section 5, we test our approach on improving battery models for the International Space Station from telemetry data. We then discuss limitations and related work, and conclude with a discussion of future research.
2
An Engineering Approach to Model Revision
Iterative refinement is a common engineering approach for modeling a device. An engineer starts with an initial model that is not perfect, but that explains much of the known behavior. Next, the engineer makes successive changes to the model to improve its predictive power. In this section, we give an example of this process from battery modeling. Although battery models have existed for many
12
Stephen D. Bay et al. i Rs
Vcb Rp
Vt
controller
soc
Fig. 1. A battery model
years, they are complex electro-chemical devices that are not well understood. Battery modeling is an active research area and new models are continually being published. Figure 1 shows a simple battery model drawn as an equivalent electric circuit. In the model, Vcb represents the battery voltage of an ideal cell. The term Rs represents a resistor connected in series to the battery cell and models the battery’s internal resistance to current flow when the circuit is completed. The term Rp is a resistor connected in parallel to the battery cell and represents resistance to self-discharge. In this model, Vcb , Rp , and Rs are constants and cannot be directly observed. The state of charge (soc) is a measure of the total electric charge stored in the battery. To complete the electric circuit, the battery must be connected to another device, which we will call a controller. For this paper, we assume that the controller is an active device that regulates the charging and discharging of the battery. It charges the battery at constant current and discharges at constant resistive load. The battery interacts with the controller through i and Vt , which are the current into (or out of) the battery and the voltage at the battery terminals, respectively. The variables i, Vt , and soc are observable.1 Although this component model appears simple, it maps onto a complex set of equations that govern the input/output relationships of the battery. The terminal voltage, Vt , is determined by Equation 1 during charge and Equation 2 during discharge. The battery’s state of charge is modeled by Equation 3, which is a differential equation that states the rate of change is equal to the current flow minus loss through the resistor Rp . Vt Vt
1
charge
= Vcb + i × Rs
Vcb × Rload Rs + Rload dsoc Vcb =i− dt Rp
discharge
=
(1) (2) (3)
State of charge may not be observable in some batteries. For our work modeling components on the space station, the batteries are Nickel-Hydrogen pressure cells and soc can be observed indirectly through the battery’s temperature and pressure.
Revising Engineering Models: Combining Computational Discovery
13
This model can explain much of a battery’s behavior, but it is not adequate for many applications. Chan and Sutanto [2] point out several deficiencies and suggest modifications to improve its fidelity.2 First, the model fails to explain changes such as the apparent series resistance, Rs , depending on whether the battery is charging or discharging. They suggest an improvement where Rs is equivalent to a resistor Rc during charge and a resistor Rd during discharge. Second, the model ignores dependence of battery properties on the state of charge. For example, real batteries become much more difficult to charge when they are nearly full compared to when they are empty. This could be represented in the model by making Rc a monotonically increasing function of soc. In general, all of the terms Vcb , Rp , Rc , and Rd will depend on battery properties and are not constants. Chan and Sutanto focused their paper by modeling a specific battery from a given manufacturer. They made Rc , Rd , and Rp functions of Vcb , which in turn depended on soc. Although there is some expectation about the general shape of these functions, the exact forms were not known and they resorted to the manufacturer’s test data to determine the functions empirically. Figure 2 shows the functional forms, with the dependent variable on the y axis. The curves in Figure 2 can be obtained by performing in-depth battery testing, ideally for each specific physical device. However, some tests can be destructive and shorten the lifespan of the battery, such as those involving deep discharge. Manufacturers often provide these curves for a typical battery, but they are not specific to an individual physical device and may not cover the relevant operating conditions or external effects. This provides a perfect opportunity for machine learning techniques to improve existing models by allowing adaptation in response to observational data.
3
Combining Equation Discovery with Knowledge
Our goal is to help the engineer with the revision process and to support the types of refinements described in the previous section. We envision a system where the user can input information about her modeling problem, including data on the specific device she is modeling, and the system would suggest several revisions to the model that better explain the observed data. The key insight of our work is that engineers will not suggest arbitrary changes to a model. Although they may not know the exact changes needed, they have a good idea of where their model is wrong because they know the approximations and assumptions made in the model’s development. We feel this knowledge can be leveraged by computational tools. 2
In their paper, Chan and Sutanto examine five historical models and point out their deficiencies before suggesting an improved version. The model in Figure 1 is not identical to any of the five models but has many common elements with them.
14
Stephen D. Bay et al.
1
14
Rc Rd Rp
0.9
12
0.8 10
0.6
8 Vcb
Resistance %
0.7
0.5
6
0.4 0.3
4
0.2 2
0.1 0 5
5.5
6
6.5
7 Vcb
7.5
8
8.5
9
0 0
50
(a)
100
150
200
250
soc
(b)
Fig. 2. Dependence of battery parameters on other variables. (a) Rc , Rd , and Rp versus Vcb . Resistance is scaled by the maximum observed value. (b) Vcb versus soc 3.1
Problem Definition
We can state the problem of revising an engineering model as follows: – – – –
Given: an initial set of equations that describe the system’s behavior; Given: data on the observable variables in the equations; Given: knowledge about the equations and how they might be modified; Find: an improved model that better explains the data.
Knowledge about the equations takes two forms in our current system. First, the user can specify plausible values for parameters, such as a valid range or an initial guess. For example, in the battery model in Figure 1 the user can state that Rs is between 0 and 10 ohms with an initial guess of 0.1. Second, the user can specify that a term which is a parameter in the initial model may depend functionally on other variables in the analysis. The user can also specify a set of plausible independent variables and possible functional forms. For example, she may believe that Vcb is not a constant and is possibly a quadratic or sigmoidal function of other variables such as soc or temperature. Our problem definition is stated as a “single shot” process that is solved once, but clearly refinement can be iterative. Often the errors from one stage of revision will suggest new refinements that can further improve the model. This leads to a set of relevant models, each progressively explaining more of the data. 3.2
Transformation into Equation Discovery
We transform our problem into an equation discovery task. We use Lagramge [9], which is a program for equation discovery that can find both ordinary differential
Revising Engineering Models: Combining Computational Discovery
15
equations and regular algebraic equations that describe the data. The system uses a context-free grammar to define a space of possible equations that may explain the observed data. Lagramge searches through the space of equations defined by the grammar, evaluates each candidate model on the data, and returns the best models according to a score function. Our system takes the knowledge specified by the engineer and compiles a highly constrained grammar to search for revisions of the initial model. The knowledge is transformed according to three rules: – the initial equation becomes the starting state of the grammar; – variable dependencies are encoded as symbol expansions in the grammar; – knowledge about the values of constant parameters are passed to Lagramge to be used in parameter fitting. For example, consider trying to revise the model of Vt , the voltage at the battery terminals. We first transform Equations 1 and 2 into an initial starting sentence, Vt → δ(i)(Vcb + i × Rs ) + δ(−i)
Vcb × Rload Rs + Rload
where δ(x) is an indicator function that is one when x is positive and zero otherwise. Note that i is defined as positive when current flow is into the battery, and the above production covers both charge and discharge conditions. Next, if we believe that Vcb may depend on the variables time, soc, and temperature, with possible forms that are sigmoidal or linear (in one or two variables), we obtain the following productions for the grammar: Vcb → const1 + const2 /(1 + e(X−const3 )const4 ) | const1 + const2 X + const3 X
|
const1 + const2 X X → temperature | soc | time Finally, any information about the constants is passed through the grammar to Lagramge, for instance, Rload → const[0 : 10 : 2] Rs → const The constant Rload is given an allowable range from 0 to 10 with an initial guess of 2, and Rs is left unspecified. To select the best revision produced with the grammar, we use Lagramge’s minimum description length (MDL) score function, which evaluates a candidate model by taking into consideration both the sum of squared error on the training set and the model’s complexity, measured as the size of its parse tree.
16
4
Stephen D. Bay et al.
Revising a Battery Model with Synthetic Data
To demonstrate the feasibility of our revision approach, we used synthetic data to test our system’s ability to refine initial models. Synthetic data lets us compare the discovered changes with the true structure. We used Equations 1 to 3 in conjunction with the battery parameters in Figure 2 to generate synthetic data by simulating it in Matlab with an ordinary differential equation solver (ode113). We assumed that the controller cycles and charges the battery at constant current followed by discharge at constant resistive load (Rload = 2Ω). We examined two cases: (1) Charging occurs with current i = 3A, which results in steady cycling of soc from about 98% to 73%; (2) charging occurs with current i = 2A, which results in a gradual loss of soc from 92% to 30%. For each case, we generated data for 1000 time points (eight cycles). We added an irrelevant variable, temperature, that varied sinusoidally with a period matched to the charge-discharge cycles. For the initial model, we used Equations 1 to 3 with all parameters considered constants. We tried a simple scenario in which an engineer might believe that Rs and Rp are well modeled as constants with respective ranges and initial values of [0:100:1] and [0:200:100]. The engineer may also believe that Vcb is not a constant and could depend on other variables such as temperature, soc, or time, with a functional form that is a polynomial (up to third degree) or a sigmoid. The above statements were automatically compiled into a grammar from a file specification and then used as an input to Lagramge. During execution, the program expands the grammar and examines 13 different revisions. The best revision according to Lagramge’s MDL score involves expanding Vcb to be a linear function of soc. Figure 4a shows the target signal, and Figure 4b and c shows the reconstruction error for the initial and revised models. The results indicate that the revised model was better able to reconstruct the signal. Although the revised model reduced the error for case 1, the error was still sizeable. We performed another refinement iteration in which we let Rs depend the variables time, temperature, or soc with a polynomial form. We recompiled the grammar file and reran Lagramge, which explored 240 possible revisions and suggested expanding Rc as a quadratic function of soc. Figure 4d shows the reduced error of this new revision compared with the first refinement. Equations 4 and 5 show the final results and the revisions have moved the initial model closer to the curves in Figure 2. The linear expansion of Vcb on soc partially reconstructs the curve in Figure 2b, and the quadratic expansion of Rc attempts to model the sharp increase in Rc with increases in soc. Vt
charge
= (5.84 + 0.00451soc) + i × (0.145 − 0.00527soc + 4.90E−5soc2) (4) Vt
discharge
= (5.41 + 0.00876soc) × 2/(2 + 0.00495)
(5)
Finally, we note that the parameters in the linear equations that represent Vcb differ slightly in the case of charge and discharge. This is caused by a limitation of Lagramge, which its authors are addressing.
Revising Engineering Models: Combining Computational Discovery
0.3
case 1 case 2
6.7
6.5
0.2
6.4
0.15 Vt Error
Vt
Initial Revised 1
0.25
6.6
6.3 6.2
17
6.1
0.1 0.05 0
6
−0.05
5.9 −0.1
5.8 0
200
400
600
800
−0.15 0
1000
200
400
time
600
800
1000
time
(a)
(b)
0.3
0.16
Initial Revised
Revised 1 Revised 2
0.14
0.2
0.12 0.1
0
Vt Error
Vt Error
0.1
−0.1
0.08 0.06 0.04 0.02
−0.2
0
−0.3 −0.02
−0.4 0
200
400
600 time
(c)
800
1000
−0.04 0
200
400
600
800
1000
time
(d)
Fig. 3. Original signal and reconstruction error for (a) target Vt . (b) case 1 for the initial model and first refinement, (c) case 2 for the initial model and first refinement, and (d) case 1 for the first and second refinements
5
Modeling Batteries on the Space Station
Our experiments with revising battery models on synthetic data showed that our system can refine initial models to explain complex, non-linear behavior. In this section, we apply our approach to battery models for the International Space Station with real telemetry data to show that it can develop accurate models and is robust to problems in data quality. We modeled the batteries for a single power channel on the Space Station. Within a power channel there are three battery units that each contain two sets of 36 nickel-hydrogen cells. We treated the entire collection of 216 cells as a single battery, and here we focus on modeling the battery’s terminal voltage, Vt . We have telemetry data for 24 hours with samples approximately every ten seconds. Only a fraction of the cells are instrumented with sensors, so we aver-
18
Stephen D. Bay et al.
aged readings from six cells to obtain the battery’s temperature and pressure. We estimated the state of charge by the ratio of pressure to temperature. The current and voltage were available for each group of 36 cells (six total) and we summed and averaged them to get total current and terminal battery voltage. The data are very poor quality and suffer from several problems. First, the signals for the observed variables have many dropouts for long time periods and this affects approximately 1/4 of all time points. Second, because of bandwidth limitations, the signals are encoded at low resolution. For example, the sensors can only report current flow to the nearest Ampere. Finally, the data show evidence of non-Gaussian noise that manifests itself as large spikes in the signal. We linearly interpolated the data to register time points at ten second intervals and to impute missing values. Figure 3a shows the target variable Vt after this processing. We divided the data into a training set, of approximately three quarters of the data (before the dashed line in Figure 3a), and a test set consisting of the remaining data. We used Equations 1, 2, and 3 for our initial model. As possible refinements, we let Vcb be a function of the variables temperature, pressure, or soc with possible functional forms that are polynomial (up to third degree), sigmoidal, or linear in two variables. We let Rs depend on the same variables with a polynomial form. Lagramge explores 6859 revisions and takes approximately nine hours of computation time on an 1.5 Ghz Pentium 4.3 The top ranked revision, shown in Equations 6 and 7, modifies the initial model by representing Vcb as a linear function of soc. Figure 3b shows the prediction error on the test data, which is much smaller than the error of the initial model. Vt Vt
charge
discharge
= (36.2 + 76.2 × soc) − i × 0.214
= (20.3 + 36.2 × soc) × 5.77/(2.60 + 0.408)
(6) (7)
Table 1 shows summary statistics for the initial model and the top three revisions returned by Lagramge. These results indicate that the revised models greatly improved the test error compared with the initial model. The mean squared error (MSE) for the revised models are approximately one third that of the initial model. However, MSE is sensitive to outliers, so we also report mean absolute error, which is more robust. On this measure the revised models all obtained an average error of about one volt. This is surprisingly good, considering that the individual sensors only resolve to one volt. Finally, the difference in predictive performance between the revised models is not substantial. Because the second and third models add extra complexity but do not significantly improve the models, they are rated worse with Lagramge’s MDL score function. 3
The number of revisions is much greater than for the synthetic example in Section 4 because the number of sentences that can be produced by grammar can expand exponentially with additional productions.
Revising Engineering Models: Combining Computational Discovery
19
Table 1. Error statistics on the training and test data: Lagramge's MDL score, mean squared error (MSE), and mean absolute error Trainine Test MDL M ~ E MSE Mean Abs. Initial Model Vcb, Rc, and Rd are constants Best Revised Models 1. Vcbis a linear function of soc. 2. Vcband Rd are linear hnctions of soc. 3. Vcbis a linear function of soc; Rd is a linear hnction of temperature.
12.3
20.5
2.75
2.65 2.14 2.67 2.12 2.68 2.13
6.99 6.95 6.99
1.01 1.00 1.00
n.a.
Fig. 4. Battery terminal voltage & and prediction error. (a) Training and test data for &. (b) Error predicting &
6
Limitations
We demonstrated with experiments that our system can successfully refine initial models to better explain data. However, our revision approach has four important limitations that we discuss here. First, our system focuses the search for better models by exploring revisions that are near an initial model. This provides tremendous power if the true model is close to the initial model. However, if the true model is structurally very different, then searching near the initial model will not find the necessary revisions. Second, our system depends on an expert to suggest plausible functional forms that may explain the values of dependent variables in the initial model. Again, this provides tremendous power if the expert provides specific forms that closely match behavior in the real physical device. However, our experiments
20
Stephen D. Bay et al.
suggest that our system may be robust to mis-specification. On synthetic data, it was still able to significantly improve the initial model even though none of the functional forms exactly matched the relationships in Figure 2. On real data, even though we have a limited knowledge of battery dynamics, the forms we suggested were capable of greatly lowering the prediction error. Third, the suggested revisions are conditional on the data seen and may not generalize well to new operating conditions. For example, the models in Section 4 were revised on data that represented a battery whose state of charge varied from 98% to 30%. The revised model may not perform well outside this region, such as at very low charge levels. This limitation is not unique to our system, but applies to all induction algorithms. Finally, Lagramge took over nine hours to revise the battery model for the Space Station. This is clearly too slow to support iterative and interactive refinement with an engineer. We are examining methods to speed up Lagramge with techniques such as error bounds to eliminate poor candidates quickly.
7
Related Work
Our approach builds on recent work by Todorovski and Dzeroski [8] who proposed revising a mathematical model by providing Lagramge with a grammar that encodes a specific set of changes. They let Lagramge refit the value of known constants based on the data, and they supported replacing a polynomial in the original equation with a polynomial of arbitrary degree (on the same variables). In their application, a major goal was minimal change with the initial model, so their implementation examines each revision separately and then considers only a few combinations. Our application of revising models of engineering devices, specifically the devices on the space station, has driven our work in a slightly different direction. We start with the assumption that the model is wrong, because of approximations and different operating conditions in orbit, and that the engineers can (mostly) identify the parts of the model that need to be revised. Because of these assumptions we allow many more changes to the model. Specifically, we allow revisions involving a variety of functional forms, we allow these forms to depend on different sets of variables, and we consider all changes at once to catch interactions. Other work in equation discovery has also incorporated domain knowledge, but in different ways. Washio and Motoda [10] developed SDS, a program that uses dimensional analysis to constrain the possible equations. Bradley, Easley, and Stolle [1] developed PRET a program that automatically tries to find an ordinary differential equation model of a physical system. PRET uses automated reasoning about modeling techniques to select from a set of traditional system identification methods. Finally, we have focused on developing interpretable and transparent models that can be examined by the engineer. An alternative approach is black box techniques, such as neural networks (e.g., [6]), which model a device’s input/output
Revising Engineering Models: Combining Computational Discovery
21
behavior without attempting to find a concise mathematical description. Neural networks are not always applicable because they are not transparent and are difficult to verify. However, recently Saito et al. [7] have started to address this drawback in the context of revision by examining methods that use neural networks to learn interpretable structures.
8
Conclusions and Future Work
We presented an approach for combining machine learning methods with an engineers’ knowledge to revise models of physical devices. The engineer specifies an initial model and possible revisions to that model and we combine this with an equation discovery program to manage the search process. Our experiments showed that this method can successfully revise models of physical devices from noisy sensor data and substantially improve their accuracy. Our work represents a first step toward a computer assisted environment for revising models of physical devices. This approach is promising and may speed model development by relieving the engineer of tedious computational tasks. We also believe this approach may lead to better models by allowing exploration of a wide set of refinements and adaptation to observed data. There are many directions for future work and we highlight three areas: First, we intend to apply our approach to improving models of other components on the space station. Second, we intend to expand the types of qualitative knowledge that an engineer can specify to constrain the search space. For example, in addition to specifying a set of relevant variables, the engineer can also specify the general effect of those variables. For instance, in our battery model Rs should increase with soc as it becomes progressively more difficult to charge a battery as it nears maximum capacity. We can eliminate models without this behavior. Finally, we intend to explore how the engineer can interact with the search process, possibly by specifying a search order or viewing intermediate results and selecting particular paths to follow.
Acknowledgments This work was supported by grant NCC 2-1220 from NASA Ames Research Center. We thank Rick Alena and Daryl Fletcher for providing access to the data, Ljupco Todorovski and Saso Dzeroski for their help with Lagramge, and Javier Sanchez for assistance with simulation tools.
References 1. E. Bradley, M. Easley, and R. Stolle. Reasoning about nonlinear system identification. Artificial Intelligence, 133:139–188, 2001. 20 2. H. L. Chan and D. Sutanto. A new battery model for use with battery energy storage systems and electric vehicle power systems. In Proceedings of the IEEE Power Engineering Society Winter Meeting Conference, 2000. 13
22
Stephen D. Bay et al.
3. J. S. Hojnicki, R. D. Green, T. W. Kerslake, D. B. McKissock, and J. J. Trudell. Space station freedom electrical performance model. In Proceedings of the 28th Intersociety Energy Conversion Engineering Conference, 1993. 10 4. T. W. Kerslake, J. S. Hojnicki, R. D. Green, and J. C. Follo. System performance predictions for space station freedom’s electrical power system. In Proceedings of the 28th Intersociety Energy Conversion Engineering Conference, 1993. 10 5. P. Langley, H. Simon, G. Bradshaw, and J. M. Zytkow. Scientific Discovery: Computational Explorations of the Creative Process. The MIT Press, 1987. 11 6. J. Peng, Y. Chen, and R. Eberhart. Battery pack state of charge estimator design using computational intelligence approaches. In Proceedings of the Fifteenth Annual Battery Conference on Applications and Advances, 2000. 20 7. K. Saito, P. Langley, T. Grenager, C. Potter, A. Torregrosa, and S. A. Klooster. Computational revision of quantitative scientific models. In Proceedings of the Fourth International Conference on Discovery Science, pages 336–349, 2001. 21 8. L. Todorovski and S. Dzeroski. Theory revision in equation discovery. In Proceedings of the Fourth International Conference on Discovery Science. 20 9. L. Todorovski and S. Dzeroski. Declarative bias in equation discovery. In Proceedings of the Fourteenth International Conference on Machine Learning, pages 376–384, 1997. 11, 14 10. T. Washio and H. Motoda. Discovering admissible models of complex systems based on scale-types and identity constraints. In Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence, pages 810–817, 1997. 20
Variational Extensions to EM and Multinomial PCA Wray Buntine Helsinki Institute of Information Technology P.O. Box 9800, FIN-02015 HUT, Finland
[email protected] http://www.hiit.fi/wray.buntine
Abstract. Several authors in recent years have proposed discrete analogues to principle component analysis intended to handle discrete or positive only data, for instance suited to analyzing sets of documents. Methods include non-negative matrix factorization, probabilistic latent semantic analysis, and latent Dirichlet allocation. This paper begins with a review of the basic theory of the variational extension to the expectation-maximization algorithm, and then presents discrete component finding algorithms in that light. Experiments are conducted on both bigram word data and document bag-of-word to expose some of the subtleties of this new class of algorithms.
1
Introduction
In text and image analysis, standard clustering algorithms are unsatisfactory because documents or images seem to mix additively in contrast to the mutually exclusive mixture semantics of standard clustering. Principle components analysis (PCA) is unsatisfactory because it has a multivariate Gaussian interpretation [1] difficult to justify with the low discrete counts for words in documents, and it comes up with difficult to interpret components, e.g., with negative quantities. Its cousin, latent semantic indexing (LSI), also has interpretation problems due to its Gaussian nature [2]. Authors have proposed analogues to PCA intended to handle discrete or positive only data. Methods include non-negative matrix factorization (NMF) [3], probabilistic latent semantic analysis (pLSI) [2] latent Dirichlet allocation (LDA) [4], and a general purpose extension of PCA itself to Bregman distances [5], which are a generalization of Kullback-Leibler (KL) divergence. A good discussion of the motivation for these techniques can be found in [2], and an analysis of related reduced dimension models and some of the earlier statistical literature which used simpler algorithms can be found in [6]. Related models using Dirichlets have been dubbed Dirichlet mixtures and applied extensively in molecular biology [7]. A common problem with the earlier formulations of these discrete component analysis [3, 2, 5] is that they fail to make a full probability model of the target data in question, a model where hidden variables, observed data, and assumptions are all clearly exposed. Moreover, the relationship to LDA remains T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 23–34, 2002. c Springer-Verlag Berlin Heidelberg 2002
24
Wray Buntine
unclear. In this paper I present the problem as a multinomial analogue to PCA, However, unlike standard PCA, spectral analysis does not come to the rescue to yield a simple solution. Instead, the usual statistical machinery for mixture distributions needs to be wheeled out, and applied. This paper reviews the EM algorithm and its variational extension, and applies them to multinomial PCA.
2
The Problem of Document Components
Consider Tipping et al.’s [1] representation of standard PCA. The resultant algorithm is a simple application of numerical methods: find the largest k eigenvectors. Their span represents a projection of the data containing most of the “variance.” A hidden variable m is sampled from K-dimensional Gaussian noise. Each entry represents the strength of the corresponding component and can be positive or negative. This is folded with the J × K matrix of component means Ω and then used as a mean of J-dimensional Gaussian. For documents represented as a bag of words, J would represent the number of words in the application’s dictionary and is expected to be considerably larger than K. m ∼ Gaussian(0, IK ) , x ∼ Gaussian(Ωm + µ, IJ σ) . Satellite and telescope images, for instance, are often analyzed as “pure” high resolution, positive pixel elements added to form a lower resolution pixel, thus a convex combinition of components is a suitable model of pixel types. Likewise, the Poisson distribution is more appropriate for the small counts seen in some telescope image data. Note that if the total sum for a set of independent Poisson variables is known, then their joint distribution becomes multinomial. The multinomial is used here as the basic discrete distribution. A discrete analogue to the above Gaussian formulation is first to sample a probability vector m that represents the proportional weighting of components, and then to mix it with a set of probability vectors Ω representing the component means: m ∼ Dirichlet(α) or
m ∼ Entropic(λ) ,
x ∼ M ultinomial(Ωm, L) , where L is the total number of words in the document. By varying the distribution on m the model can have different behaviors. The Dirichlet has a vector of K-dimensional parameters, α and the entropic prior used here is an extension of Brand’s [8] where p(m) ∝ exp(−λH(m)). This analogue is an example of Collins et al.’s [5] generalization of PCA. For the Gaussian case, Ω can be folded into the covariance matrix leaving a mixture problem directly amenable to solution via the EM algorithm [9], but also simplified into a “K-th largest eigenvectors” problem. However, for the multinomial case, there appears to be no such transformation. So more sophisticated mixture modeling is needed.
Variational Extensions to EM and Multinomial PCA
3
25
Background Theory
The theory of exponential family distributions and Kullback-Leibler approximations is briefly reviewed here. The formulations of Ghahramani and Beal [10] and Buntine [11] are roughly followed. A notation convention used here is that indices i, j, k, l in sums and products always range over 1 to I, J, K, L respectively. i usually denotes a sample index, j a dictionary word index, k a component index, and l a word-in-document index. 3.1
Exponential Family Distributions
The general exponential family distribution takes the form as follows. For an individual sample point, a vector of measurements x, we have a vector of T functions t(x) and some parameters θ also of dimension T , and possibly with some additional constraints. The probability q(x | θ) has the form q(x | θ) =
1 exp t(x)† θ . Yt (x)Zt (θ)
I also usually abbreviate Zt (θ) to Z or add a distinguishing subscript. I use the notation Eq(y|φ) {A} to denote the expected value of the quantity A when y is distributed according to q(y|φ). Two key definitions needed [11] are: µt ≡ Eq(x | θ) {t(x)} =
∂ log Zt , ∂θ
Σ t ≡ Eq(x | θ) {(t(x) − µt )(t(x) − µt )† } =
(1) ∂ 2 log Zt ∂ θ∂ θ
=
∂ µt . ∂θ
The mean vector µt has the same dimensionality as θ and the matrix Σ t is the covariance of t(x). If θ is in a region where Σ t is not of full rank, then the data vector t(x) is redundant. Thus the following remarkable condition holds: µt can be treated as a dual set of parameters to θ. When Σ t is of full rank, it is the Hessian for the change of basis. While Σ t is also the expected Fisher Information for the distribution. Note that both µt and Σ t are directly derivable from Zt . For a univariate Gaussian with mean µ and standard deviation σ, t(x) = (x, x2 ), θ = ( σµ2 , − 2σ1 2 ) and µt = (µ, σ 2 + µ2 ). Examples of this form for the multinomial distribution with probability vector α and count N and the Dirichlet distribution with probability vector α are given in Table 1. Here Γ (y) is the function, and the parameter vector for gamma function, Ψ0 (y) is the digamma the multinomial has the constraint that k αk = 1. Note that for the Dirichlet, the Hessian Σ t is of full rank when each αk > 0 and computing the parameter vector α from the dual can be done using Minka’s fixed point method [12]. One more key result about the exponential family is computing maximum a posterior (MAP) values for parameters given a sample of I data points. From the likelihood, all that matters is the so-called sufficient statistics: i t(xi ). A conjugate prior has the same functional form as the likelihood. One way to model these is to have an “effective” prior sample whose sufficient statistics
26
Wray Buntine
Table 1. Exponential family characterizations Zt
1 Γ (αk ) k
Γ
k
αk
tk (x) xk
θk log αk
log rk
αk − 1
µt,k N αk Ψ0 (αk ) − Ψ0
k
αk
are ν t and the prior sample size is St . In this case, the unique MAP for the exponential family parameters yields an estimate for the dual ν t + i t(xi ) t = . (2) µ St + I 3.2
Kullback-Leibler Approximations
Consider a distribution p(x | φ) which is a posterior rendered impractical for use due to normalization or marginalization problems. Approximate it with a distribution of the form q(x | θ). The so-called “mean-field” approximation is to choose θ by minimizing the KL divergence between distributions q and p, q(x | θ) KL (q(x | θ)||p(x | φ)) = Eq(x | θ) log , p(x | φ) On the RHS, p(x, φ) can replace p(x | φ), and their normalizing constants can be ignored. Kullback-Leibler Approximations on Exponential Family. Suppose q(x | θ) is in the exponential family as described in Section 3. For this minimization task, the following fixed-point update formula can be used to optimize for θ: ∂ Eq(x | θ) {log p(x | φ) + log Yt (x)} . (3) θ ←− ∂ µt Proof. (sketch) Substitute the expected log q() term in the KL with the entropy H(q(x | θ)) given by Eq(x | θ) {log Yt (x)}+log Zt −µ†t θ and differentiate w.r.t. µt .
Note that by Eqn. (1), µt often occurs linearly in the expected value thus this can be easy. For the Gaussian and multinomial, the dual parameters µt are the ones we are familiar with anyway. Kullback-Leibler Approximations on Products. Another view of these approximations can be gained by looking at a factored distribution [10]1 . Consider approximating p(x | φ) by a distribution that factors x into two independent, non-overlapping components, x1 , x2 : q(x) = q1 (x1 )q2 (x2 ). 1
Indeed, they show this also applies to Markov and Bayesian network models
Variational Extensions to EM and Multinomial PCA
27
From functional analysis (when p() is non-zero everywhere), one can show the following minimizes the KL for all independent components of the form above: 1 exp Eq2 (x2 ) {log p(x | φ)} Z1 1 q2 (x2 ) ←− exp Eq1 (x1 ) {log p(x | φ)} Z2 q1 (x1 ) ←−
(4)
Note this is cyclic, q1 () is defined in terms of q2 () and visa verse, where the x1 and x2 terms in the log probability are repeatedly replaced by their mean. This result yields the same rules as (3) when they both apply. 3.3
Computational Methods with Hidden Variables
With so-called hidden variables, each data point in the sample is in the form x[i] the observed data, but there is also unknown h[i] the hidden data, for i = 1, . . . , I. The special subscript [i] is used to denote part of the i-th data. Denote the full set of these vectors by x{} and h{} . The observed data is in the exponential family when the hidden data is known, and the hidden data itself is in the exponential family, Suppose the parameter sets for these two distributions are φ1 and φ2 respectively, yielding φ as the full set. I wont flesh out the details of these until the examples later on. The full joint distribution then becomes: p(h[i] | φ1 )p(x[i] | h[i] , φ2 ) . p(φ, x{} , h{} ) = p(φ) i
Several computational methods present themselves here for maximizing the joint p(φ, x{} ) where the hidden variables are marginalized out. Approximating the MAP Directly with Variational Methods. It is not known if one can compute the maximum a posterior value for p(φ | x{} ) exactly. Instead consider the following function [13]: (5) L(φ; θ) = log p x{} , φ − KL q(h{} | θ) || p(h{} | x{} , φ)
(6) = Eq(h{} | θ) log p(x{} , h{} , φ) + H(q(h{} | θ)) . The two lines are easily shown to be equivalent. Note that L() represents a lower bound on log p x{} , φ . Thus maximizing it yields a variational algorithm [13]. Repeat the following: 1. Minimize the KL divergence in (5) w.r.t. θ. 2. Maximize the expected value in (6) w.r.t. φ. Note the first step would use the methods just developed, Eqns. (3) and (4). The second step is usually done using Eqn. (2) directly because the expected value will look like the log probability of some exponential family likelihoods, thus having a unique global maximum.
28
Wray Buntine
Computing the MAP via EM. If the exponential family q(hi | θi , ψ) is rich enough to include p(hi | xi , φ), then this previous variational method becomes the EM algorithm. In this case, KL = 0 and therefore the MAP for φ is obtained. For instance, in standard clustering algorithms, hi is a discrete variable and q() is the discrete distribution on that variable. Thus EM is a variational algorithm at the end condition, where the bound is tight!
4
Algorithms for Document Clustering
First I will develop mixture models extending the basic idea of multinomial PCA to be mixture models of convenient exponential family distributions, and then we can grind the formula just presented to produce some algorithms. 4.1
Priors for Document Clustering
In the model, m, a K-dimensional probability vector, represents the proportion of components in a particular document. m must therefore represent quite a wide range of values with a mean representing the general frequency of components in the sample. Potential priors for a probability vector such as m are well discussed in the literature including: – A Dirichlet prior with equal parameters of 1 or 0.5 (uniform and Jeffreys’ prior respectively), or C/K for prior sample size C, c.f., Eqn. (2). – A hierarchical prior, and let the parameters α for a general Dirichlet be estimated using the data [4]. – The entropic prior [8], which tends to extinguish small components. For convenience, I will assume the general Dirichlet and specialize to the other forms as required. For the general Dirichlet prior on the proportions m, priors on the α are needed. We can use a prior for α corresponding to a prior sample of uniform probability of size 1, and apply Eqn. (2) to compute the MAP for it. The component means are columns of Ω represented as Ω k,· for k = 1, . . . , K are J-dimensional probability vectors over the J distinct words, and the prior on them is important to ensure low-count components are handled well. Given a suitable universe of words, one can use an empirical priorfor these based on the empirical proportion of words in the universe, f , for j fj = 1: Ω k,· ∼ Dirichlet(2f ), where 2 represents some small prior sample size. 4.2
Likelihoods for Document Clustering
Where Ordering is Relevant. In this case, iterate over the L words as they appear in the document. dl is the dictionary index for the l-th word in the document, where dl takes a value from 1 to J. m ∼ Dirichlet(α) , kl ∼ Discrete(m) dl ∼ Discrete(Ω kl ,· )
for l = 1, . . . , L , for l = 1, . . . , L .
Variational Extensions to EM and Multinomial PCA
29
The hidden variables here are m and k for each document. This turns out to have an identical likelihood to the next case, except for a combinatoric term which, for instance, is canceled out by the log Yt (x) term in Eqn. (3). Where Ordering is Irrelevant. In this case, iterate over word counts which have been totalled for the document. The index j = 1, . . . , J runs over dictionary index values, and the full hidden data matrix for a single document wk,j is the count of the number of times the j-th dictionary word occurs in the document representing the k-th component. Two derived vectors are the column-wise totals ck , the number of words in the document representing the k-th component, and the row-wise totals r, the observed data, typically stored in sparse form. Denote by w·,j the j-th row vector and w k,· the k-th column vector of w. m ∼ Dirichlet(α) , c ∼ Multinomial(m, L) , w k,· ∼ Multinomial(Ω k,· , ck )
for k = 1, . . . , K .
The hidden variables here are m and w for each document. The full likelihood for a single document p(m, w | α, Ω) then simplifies to: α −1 w 1 w L Cw mk k mk k,j Ωk,jk,j . 1,1 ...,wK,1 ,...,w1,J ...,wK,J ZD (α) k
(7)
k,j
Note that w·,j can be marginalized out (because k,j mk Ωk,j = 1) yielding the original model proposed in Section 2. The important aspect of this model is that the hidden variables m and w remain linked in the likelihood, and thus if q() for the mean field approximation is a product distribution, KL in Eqn. (5) cannot be zero, so an EM algorithm does not appear feasible. 4.3
Multinomial PCA with Dirichlet Prior
The first algorithm here estimates the MAP parameters for Ω with the general Dirichlet prior on m using parameters α. This extends [4] with a prior, simpler handling of the Dirichlet parameters, and a proof of optimality of the product approximation. Theorem 1. Given the likelihood model of Section 4.2 and Eqn. (7), and the priors: m ∼ Dirichlet(α) and Ω kl ,· ∼ Dirichlet(2f ). The following updates converge to a local maximum of a lower bound of log p(Ω, α | r) that is optimal for all product approximations q(m)q(w) for p(m, w | Ω, α, r). The subscript [i] indicates values from the i-th document.
1 Ωk,j exp Ψ0 (βk,[i] ) − Ψ0 βk,[i] , γj,k,[i] ←− Z3,j,[i] k
30
Wray Buntine
βk,[i] ←− αk +
rj,[i] γj,k,[i] ,
(8)
j
Ωk,j Ψ0 (αk ) − Ψ0
αk
1 ←− Z4,k ←−
2fj +
log1/K +
k
i
rj,[i] γj,k,[i]
,
i
Ψ0 (βk,[i] − Ψ0 1+I
k
βk,[i]
.
(9)
The exponential in the first equation is an estimate of mk as exp(Eq {log mk }) which reduces the component entropy H(m). Note the last two equations are the standard MAPs for a multinomial and a Dirichlet respectively. The last equation rewrites α in terms of its dual representation (according to exponential family convention), which is immediately inverted using Minka’s methods. The proof is outlined below because it highlights the simplicity of this using the exponential family machinery of Section 3. Proof. (sketch) For this, the first step is a KL approximation to the likelihood data is r the row totals of of the hidden variables p(h{} | x{} , φ). The observed w, thus use the product approximation, q(m) j q(w·,j ). Taking an expected value of m ∼ q(m) over log p(m, w, r | α, Ω) yields a form that is an independent multinomial for q(w ·,j ) for each j. For w ∼ q(w ·,j ) yields a form that is Dirichlet in m. Thus the optimal product distribution has m ∼ Dirichlet(β) and w·,j ∼ Multinomial(γ j,· , rj ) for some parameter vectors β and γ. For this problem, either Eqn. (4) or Eqn. (3) works equally well. First, inspect the required log probabilities of Eqn. (4), Eq(m|r) {log p()} and Eq(w|r) {log p()} together with Table 1. Eqn. (4) now becomes, ignoring constants log mk (βk − 1) ←− log mk αk − 1 + rj γj,k , k
k
k
wk,j log γj,k ←−
k
j
wk,j
Ψ0 (βk ) − Ψ0
βk
+ log Ωk,j
.
k
And the rewrite rules for β and γ can be extracted using the left hand sides as the updated values. The second step of the algorithm is to re-estimate the model parameters α and Ω based on optimizing the expected log probability of Eqn. (6). This again can be done by inspection using Table 1 as a guide. Rearrange the full log probability to make it look like posteriors for Dirichlet sampling and multinomial sampling for α and Ω respectively and then apply Eqn. (2) to write down the MAP values. This gives the last two rewrites in the theorem.
A simpler version of this theorem is to optimize log p(m, α, Ω) jointly.
Variational Extensions to EM and Multinomial PCA
31
Theorem 2. In the context of Theorem 1, the following updates converge to a local maximum of log p(Ω, α m| r). 1 Ωk,j mk,[i] , Z5,j,[i] 1 mk,[i] ←− αk − 1 + rj,[i] γj,k,[i] , Z6,[i] j
1 Ωk,j ←− rj,[i] γj,k,[i] , 2fj + Z7,k i
log1/K + i log mk,[i] Ψ0 (αk ) − Ψ0 αk ←− . 1+I γj,k,[i] ←−
k
Proof. (sketch) Since this case has some hidden variables in the primary objective function, it is not covered by Eqns. (5) and (6). Move m across the probability terms to yield a modified formula for variational optimization of the log probability above. For this case β disappears because m is fixed as far as the KL approximation is concerned. Optimization for m now occurs in the second step of the algorithm. The minimum KL divergence will be zero because q(w | γ, r, m) can be exactly modeled with multinomials.
4.4
Comparisons
The algorithm of Thm. 2, ignoring priors, is equivalent to the NMF algorithm of Lee and Seung [3] where a final normalization using the Z’s is not done, and the pLSI algorithm of Hofmann [2], which also includes a regularization step. These correspondences are tricky because, for instance, Hofmann marginalizes Ω in the reverse direction. Moreover, the only difference between the algorithms of Thms. 1 and 2 is the estimation of mk : exp(Eq {log mk }) versus the MAP estimate. Note if any other prior on m is used, then Thm. 1 only changes by using the different posterior to compute exp(Eq {log mk }) and the replacement or removal of (9). We have used the entropic prior here with a large λ as a replacement to the Dirichlet in some experiments to produce quiet different components.
5
Experimental Results
Previous researchers have presented results comparing this family of methods with LSI in tasks such as document recall based on the reduced features [2], perplexity evaluation of the dimensionality reduction achieved [2, 4], and comparisons with other statistical models for dimensionality reduction of discrete data [6]. Experiments presented here instead focus on a number of different aspects of the methods highlighted in the analysis. I apply the algorithms to both
32
Wray Buntine
bag-of-words document data and word bigram data since these turn out to have very different statistical properties. Note analysis of bigram data should not be viewed as a thesaurus discovery task since this should include part of speech information and word linkage from a dictionary as well. Bigram data was collected about words from a significant portion of the English language documents in Google’s August 2001 crawl. Identifying sentence breaks is a difficult task in HTML as seemingly random lists of words occur not infrequently in web pages, and space does not allow description of methods here. The bigram data is 17% non-zero for the matrix of the top 5000 words. The top word “to” has 139, 597, 023 occurrences and the 5, 000-th word “charity” has 920, 343 occurrences. The most frequent bigram is “to be” with 20, 971, 200 occurrences, while the 1, 000-th most frequent is “included in” at 2, 333, 447 occurrences. David Lewis’ Reuters-21578 collection of newswires was used as the document data. Words occurring less than 10 times in the entire collection were discarded from the bag-of-words representation, and numbers converted to tokens, leaving 10,697 words as features for approximately 20,000 documents. The code is 1500 lines of C. Space requirements for runtime are O(K ∗ (I + J) + S) where S is the size of the input data and each iteration takes O(K ∗ (I + J + S)). Thus the computational requirements are comparable to an algorithm for extracting the top K eigenvectors usually used for PCA. Convergence is maybe 10-30 iterations, depending on the accuracy required, slower than its PCA counterpart. Nevertheless, all experiments reported were run on an old Linux laptop overnight. The code outputs a HTML file for navigating documents/words, components, and their assignments for a document. Useful diagnostic measures reported on below are as follows: Expected words per component (EW/C): the conditional entropy of the word probability vectors in Ω given components raised to power 2. Expected components per document (EC/D): the entropy of the component probability vectors averaged over documents raised to power 2. Expected components (EC): the entropy of the observed component probabilities raised to power 2, should be O(K). The plots in Fig 1 show the change of these values as K is increased. For the Reuters-21578 data, EC/D stays constant at about 2 (not plotted) while EC continues growing almost linearly. For the Google bigram data, EC/D grows as plotted primarily because the sample sizes are very large. Intense subsampling (thinning out the data by a factor of 1000) eventually makes EC/D stay constant at about 4. Thus newswires/words in both data sets are almost equally distributed across components but typically one newswire only belongs to 2 components, whereas one word (for the Google Bigram data) belongs to several more components depending on the sample size. Note the efficacy of the priors used for parameters α of the component Dirichlet can be measured by computing the KL divergence between the expected component proportions (according to α) and the observed component proportions. Remarkably, on the Reuters-21578 data with K=1300 components, the KL divergence using a prior sample size of one for α is only 0.22503 compared to 2.24437
Variational Extensions to EM and Multinomial PCA
Reuters 21578
Google Bigrams 1200
140
EW/C EC
260
800 EW/C EC EC/D
120
1000 800
200 600 180 160
400
EC
EW/C
220
EW/C and EC/D
240
600
100
500 80 400 60 300 40
200
140 200 120 100 0
200
400
600
0 800 1000 1200 1400
K = Component Dimension
700
EC
280
33
20 0
100 0 0 100 200 300 400 500 600 700 800 900 1000 K = Component Dimension
Fig. 1. Diagnostics for the data runs when the maximum likelihood estimates of Blei et al. are used, thus the estimates are on average a factor of four times more accurate with the prior, while other diagnostic measures did remain unchanged. The poor quality of the maximum likelihood estimates is to be expected since they are hierarchical parameters not directly estimated from the data. The components found from the bigram data vary for different values of K. For small K (e.g., 10), general word forms such as verbs, adjectives, etc. are found. As K increases (20,30,50) these break out into “people verbs”, “internet nouns,” etc. Once K increases to about 300, the components include things like months, measurements, US states, democracy verbs, emotions, media formats, body parts, and more abstract components such as “aspects of a thing,” “new ideas,” etc. It is this unfolding of components for increasing K that is most remarkable about the method, and in complete contrast to PCA which simply adds components to the existing top K.
6
Conclusion
The model presented here is a multinomial analogue to PCA of the kind espoused by Collins et al. [5], where a multinomial mean is a convex combination of multinomial components (Sct. 2). It is also a model that assigns a single component/topic to each word of a document (Sct. 4.2, also [4]). Thus I support the name multinomial PCA: “non-negative matrix factorization” is better descriptive of a non-statistical matrix analysis task, “probabilistic LSI” has a clear comparison with LSI [2] but fails to indicate its far broader applicability, and “latent Dirichlet” focuses on a minor aspect (e.g., replacing the Dirichlet with an entropic prior leaves rewrite rules almost unchanged). The theorems presented here have extended the algorithm of Blei et al. [4], simplified their proof, and clarified the relationship with the earlier algorithms, NMF and pLSI. The relationship is rather like k-means clustering versus EM clustering: they use the same generative model but differ in optimization criteria.
34
Wray Buntine
Thus while the goal is analogous to PCA, the results of these methods depart radically from PCA and provide a rich source of new opportunities in analyzing discrete data. Finding greater numbers of components does not simply add more components, but rather develops refined components at a completely different scale. Multinomial PCA is therefore ideal for hierarchical analysis. Moreover, current versions of mPCA can have many components per document and it is only due to the small size of the newswires that typically there are 2 components per newswire. In the bigram task, components per word kept increasing beyond 30, highlighting that current mPCA algorithms are really suited for dimensionality reduction and not as a form of relaxed clustering that allows probabilistic assignment across just a few classes.
Acknowledgements An earlier theoretical basis for this research was conducted in NASA subcontract NAS2-00065 for the QSS Group in June 2001, and the bigram modeling as a machine learning research consultant for Google, Inc. with Peter Norvig in Sept.Oct. 2001.
References [1] Tipping, M., Bishop, C.: Mixtures of probabilistic principal component analysers. Neural Computation 11 (1999) 443–482 23, 24 [2] Hofmann, T.: Probabilistic latent semantic indexing. In: Research and Development in Information Retrieval. (1999) 50–57 23, 31, 33 [3] Lee, D., Seung, H.: Learning the parts of objects by non-negative matrix factorization. Nature 401 (1999) 788–791 23, 31 [4] Blei, D., Ng, A., Jordan, M.: Latent Dirichlet allocation. In: NIPS*14. (2002) to appear. 23, 28, 29, 31, 33 [5] Collins, M., Dasgupta, S., Schapire, R.: A generalization of principal component analysis to the exponential family. In: NIPS*13. (2001) 23, 24, 33 [6] Hall, K., Hofmann, T.: Learning curved multinomial subfamilies for natural language processing and information retrieval. In: ICML 2000. (2000) 23, 31 [7] Sjolander, K., Karplus, K., Brown, M., Hughey, R., Krogh, A., Mian, I. S., Haussler, D.: Dirichlet mixtures: A method for improved detection of weak but signicant protein sequence homology. Computer Applications in the Biosciences 12 (1996) 327–345 23 [8] Brand, M.: Structure learning in conditional probability models via an entropic prior and parameter extinction. Neural Computation 11 (1999) 1155–1182 24, 28 [9] Roweis, S.: EM algorithms for PCA and SPCA. In: NIPS*10. (1998) 24 [10] Ghahramani, Z., Beal, M.: Propagation algorithms for variational Bayesian learning. In: NIPS. (2000) 507–513 25, 26 [11] Buntine, W.: Computation with the exponential family and graphical models. unpublished handouts, NATO Summer School on Graphical Models, Erice, Italy (1996) 25 [12] Minka, T.: Estimating a Dirichlet distribution. Course notes (2000) 25 [13] Jordan, M., Ghahramani, Z., Jaakkola, T., Saul, L.: An introduction to variational methods for graphical models. Machine Learning 37 (1999) 183–233 27
Learning and Inference for Clause Identification Xavier Carreras1,, , Llu´ıs M` arquez1,∗∗ , 2, Vasin Punyakanok , and Dan Roth2,∗∗∗ 1
TALP Research Center – LSI Department Universitat Polit`ecnica de Catalunya {carreras,lluism}@lsi.upc.es 2 Department of Computer Science University of Illinois at Urbana-Champaign {punyakan,danr}@cs.uiuc.edu
Abstract. This paper presents an approach to partial parsing of natural language sentences that makes global inference on top of the outcome of hierarchically learned local classifiers. The best decomposition of a sentence into clauses is chosen using a dynamic programming based scheme that takes into account previously identified partial solutions. This inference scheme applies learning at several levels—when identifying potential clauses and when scoring partial solutions. The classifiers are trained in a hierarchical fashion, building on previous classifications. The method presented significantly outperforms the best methods known so far for clause identification.
1
Introduction
Partial parsing is studied as an alternative to full-sentence parsing. Rather than producing a complete analysis of sentences, the alternative is to perform only partial analysis of the syntactic structures in a text [6,1,5]. There are several possible levels of partial parsing—from the identification of base noun phrases[9] to the identification of several kinds of “chunks” [1,12] and to the identification of embedded clauses [1]. While earlier work in this direction concentrated on manual construction of rules, most of the recent work has been motivated by the observation that partial syntactic information can be extracted using local information—by examining the pattern itself, its nearby context and the local part-of-speech information. Thus, over the past few years there has been a lot of work on using statistical learning methods to recognize partial parsing patterns—syntactic phrases or words that participate in a syntactic relationship [9,7,8,2,12,3]. Earlier learning works on partial parsing have used mostly local classifiers; each detects the
Supported by a grant from the Catalan Research Department. This research is partially funded by the Spanish Research Department (TIC20000335-C03-02, TIC2000-1735-C02-02) and the EC (NAMIC IST-1999-12392). Supported by NSF grants IIS-99-84168,ITR-IIS-00-85836 and an ONR MURI award.
T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 35–47, 2002. c Springer-Verlag Berlin Heidelberg 2002
36
Xavier Carreras et al.
beginning or end of a phrase of some type (noun phrase, verb phrase, etc.) or determines, for each word in the sentence, whether it belongs to a phrase or not. Recent work on this problem has achieved significant improvement by using global inference methods to combine the outcomes of these classifiers in a way that provides a coherent inference that satisfies some global constraints, for example, non-overlapping constraints [8,7]. The work presented here can be viewed as an extension of this approach to a more involved partial parsing problem. In this paper we study a deeper level of partial parsing, that of clause identification. A clause is a sequence of words in a sentence that contains a subject and a predicate [13]. The problem is to split a sentence into clauses, as in, ( Coach them in (handling complaints ) ( so that ( they can resolve problems immediately )). ) This problem has been found more difficult than simply detecting non-overlapping phrases in sentences [13]. Existing approaches to it use a large number of local classifiers to determine the beginning and end of clauses, as well as the embedding level of the clause. The work presented here builds on the success of a phrase identification approach that uses inference on top of learned classifier; it develops a scheme that allows the use of global information to combine local classifiers for the finer and more difficult task of identifying embedded clauses. The approach is related also to methods used in bottom-up parsing methods [4]. The key difference is that, as in the inference with classifiers approach in [8], all the information sources used in the inference are derived from hierarchical classifiers that are applied within a recursive scheme. Specifically, the best decomposition of a sentence into clauses is chosen using a dynamic programming scheme that takes into account previously identified partial solutions. This scheme applies learning at several levels, when identifying beginnings and ends of potential clauses and when scoring partial solutions. The classifiers are trained from annotated data in a hierarchical fashion, built on previous classifications. This work develops a general framework for clause identification that, while being more complex than previous approaches, is derived in a principled way, based on a clear formalism. In particular, the inference scheme can take several scoring functions that could be derived in different ways and make use of different information sources. We exemplify this by experimenting using three different scoring functions.
2
Clause Identification
Basic Definitions. Let wi be the i-th word in a sentence. Let wst denote the sentence fragment or sequence of words ws , ws+1 , ..., wt and, in particular, let w1n represent a sentence. In this paper we do not consider clause types. In this setting, thus, a clause c is an element of the set C = {(ws , wt )|1 ≤ s ≤ t ≤ n}. For brevity, from now on we will denote a clause simply using the indices of
Learning and Inference for Clause Identification
37
the words of the sentence, and therefore a clause will be an element of the set C = {(s, t)|1 ≤ s ≤ t ≤ n}. Given two clauses c1 = (s1 , t1 ) and c2 = (s2 , t2 ), we say that c1 and c2 are equal, denoted by c1 = c2 , iff s1 = s2 and t1 = t2 . We define that c1 and c2 overlap iff s1 < s2 ≤ t1 < t2 or s2 < s1 ≤ t2 < t1 , and we note it as c1 ∼ c2 . Furthermore, we define that c1 is embedded in c2 iff s2 ≤ s1 ≤ t1 ≤ t2 and c1 = c2 , and we note it as c1 < c2 . A clause split for a sentence is a coherent set of clauses of the sentence, that is, a subset of C whose clauses do not overlap. Formally, a clause split can be seen as an element S of the set S = {S ⊆ C | ∀c1 , c2 ∈ S, c1 ∼ c2 }. We will refer to a clause without any embedded clause as a base clause, and to a clause which embeds other clauses as a recursive clause. Goal and Evaluation Metrics. The goal of the clause identification problem is to predict a clause split S P for a sentence which “guesses” the correct split S C for the sentence. For evaluating the task in a set of N sentences, the usual precision (P ), recall (R) and Fβ=1 measures are used: (| · |: number of elements in set.) N N |SiC ∩ SiP | |SiC ∩ SiP | 2P R P = i=1 R = i=1 Fβ=1 = N N P C P +R i=1 |Si | i=1 |Si | Clause Identification in a Language Processor. A clause splitter is intended to be used after a part-of-speech (pos) tagger and a chunk parser. pos tags are the syntactic categories of words. Chunks are sequences of consecutive words in the sentence which form the basic syntactic phrases, subject to the constraints that chunks cannot overlap or have embedded chunks. In the example ( [Balcor] NP , ( [which] NP ( [has] VP [interests] NP [in] PP [real estate] NP )) , [said] VP ( [the position] NP [is newly created] VP ) . ) the chunks are annotated together with its type between square brackets, while the clause split is annotated with parentheses. In a correct syntactic tree, clause boundaries are always at some chunk boundaries. However, in a real system chunk boundaries may be imperfect, so our formalization allows the violation of this constraint.
3
Inference Scheme
The decision scheme used for splitting a sentence into clauses includes two main tasks: 1) identifying single clauses in the sentence, that is, building the set C; and 2) selecting the clauses which form the optimal coherent split, that is, choosing the best element S in S. 3.1
Identifying Clauses
The identification of clauses is done in two steps: the first, identifies candidate clauses in the sentence and the second, scores each candidate. We define an S
38
Xavier Carreras et al.
point of the sentence as a word at which clauses may start, and similarly we define an E point of the sentence as a word at which clauses may end. The first step consists of two functions, spoint(w) and epoint(w) which, given a word w, decide whether it is an S point and an E point, respectively. Each pair of S and E points, where E is not before S, is considered a clause candidate of the sentence. The S and E identification step reduces the space of clauses to be combined to form the solution, as a way to make the problem computationally feasible. The second step scores each clause candidate of the sentence. This consists of a function score(s,t) which, given a clause candidate (s, t), outputs a real number. Its sign is interpreted as an indication of whether the candidate is a clause (positive) or not; the magnitude of the score is interpreted as the confidence of the decision. 3.2
Selecting the Clause Split
Given a set of scored clauses in the sentence, a coherent subset S must be selected as the clause split for the sentence. Our criterion of optimality for a clause split is the maximization of the summation of the scores of the clauses in the split: score(s, t). S P = arg max S∈S
(s,t)∈S
Given a matrix BestSplit[s,t] that for each pair of words ws and wt stores the best split found in wst , the best split for the whole sentence can be found at BestSplit[1,n]. Using dynamic programming, the matrix can efficiently be filled by exploring the sentence bottom-up. 3.3
A General Algorithm
Although we have described the whole process as two separate tasks, we want to perform them together. The main reason is that when a clause candidate is considered, we want to take advantage of the clause structure that is possibly embedded inside the candidate. The idea is that, syntactically, a clause c1 acts as an atomic constituent inside a clause c2 which embeds c1 so that, when considering c2 , all the constituents which form c1 can be reduced to a single constituent, making the structure of c2 simpler (which may affect the scoring function). The general algorithm is presented in Fig.1 as a recursive function. Two bidimensional matrices are maintained: BestSplit[s,t] stores the optimal split found in wst ; Score[s,t] stores the score for the clause candidate (s, t). The call to the function optimal clause split(1,n) explores the whole sentence and stores the optimal clause split for the sentence in BestSplit[1,n]. The first block of the function ensures the completeness of the exploration by making two recursive calls on the sentence fragments, one without the word at the end and the other without the word at the beginning. By induction, after the recursive calls all the clause splits inside the current sentence fragment are
Learning and Inference for Clause Identification
39
function optimal clause split (s, t) if (s = t) then optimal clause split(s, t − 1) optimal clause split(s + 1, t) π := { BestSplit[s, r] ∪ BestSplit[r + 1, t] | s ≤ r < t} S ∗ := arg maxS∈π (k,l)∈S Score[k, l] if (spoint(s) and epoint(t)) then Score[s, t] := score(s,t) if (Score[s, t] > 0) then S ∗ := S ∗ ∪ {(s, t)} BestSplit[s, t] := S ∗ end function
P
Fig. 1. General algorithm for clause splitting
identified. The second block of the function computes the optimal split for the current sentence fragment. First, the optimal split is selected as the best union of two disjoint splits which cover the whole fragment. Then, the clause candidate for the current fragment is considered. If the score function classifies the current clause as positive, it is added to the optimal split. In the next section we will discuss several settings for this function. The solution given by the algorithm is guaranteed to be coherent by construction. A clause split is constructed by joining two disjoint clause splits, and only a clause which embeds all the clauses in the split may be added. Note that the algorithm, as described in Fig.1, repeats recursive calls. This recalculation is not needed and can be easily avoided by keeping track of the visited sentence fragments. It can also be noticed that a function call is relevant only if the fragment considered is bounded by an S point and an E point, and the algorithm can be adapted for avoiding unnecessary calls. In general, a sentence requires a function call for each clause candidate and there is a quadratic number of clause candidates over the n words in the sentence. The function requires a linear time for selecting the optimal split plus the cost of the scoring function. Thus, identifying a clause split in a sentence will take time O(n2 (n + cost(score))). 3.4
Scoring Functions
In this section we describe particular settings of the function score. Given a clause candidate, this function has to predict a score for the candidate being a clause in the sentence. It is defined as a composition of classifiers, each of which, given a clause candidate, output a real number that encodes its prediction (sign) and confidence (magnitude). Below we define such classifiers, and in Sect.4 we describe the learning process. Let (s, t) be the clause candidate to be scored. The variants are the following:
40
Xavier Carreras et al.
Plain Scoring. The clause structure is not considered. A clause is scored by a classifier, which we refer to as plain, which recognizes clauses as plain structures. This scoring function is independent of the decision taken inside. score(s,t) = plain(s, t) The cost of this scoring function is the cost of the plain classifier. Structured Scoring. In this setting, the score of a clause depends on its internal structure. This may require exploring all possible subclauses which, to guarantee optimal solutions, may be exponentially expensive. We present here two variations of scoring functions that provide some trade-off between the computational cost and the global optimality. Let π be the set computed by the algorithm in Fig.1, which contains a linear number of splits, and let S ∗ be the optimal element in π. We define the cascade function C(f1 , f2 , . . . , fn ) as a function which returns f1 if f1 > 0 and C(f2 , . . . , fn ) otherwise, having C() = −1. Best Split Scoring. The optimal split is considered for the scoring. The function is composed of three classifiers. A base classifier recognizes base clauses, i.e. clauses that do not embed other clauses. A recC classifier recognizes clauses assuming that the complete split of clauses inside the candidate is given. Finally, a recP classifier recognizes clauses assuming that only a partial split of clauses is given. When no clause has been identified inside the candidate, the function first applies the base classifier, and if it predicts false, it applies the recP classifier, assuming that initial clauses were missed. When a split is given, the function cascades the three classifiers. The recC classifier may give an accurate prediction if the split is correct and complete. If it predicts false, the classifier recP is applied, assuming that some clauses in the split were missed. If it also predicts false, the candidate is tested as a base clause, despite the identified split. This function score depends both on the clauses identified inside and the choice of the optimal split. It is designed to overcome misses in the given split, but incorrect clauses in the split may damage the performance. if S ∗ = ∅ C(base(s, t), recP (s, t)) score(s,t) = ∗ ∗ C(recC (s, t, S ), recP (s, t, S ), base(s, t)) otherwise The computational complexity of this scoring function is the cost of the involved classifiers. Linear Average Scoring. All the splits in π are considered for scoring the candidate. The function uses the same classifiers as in the Best Split Scoring. The idea here is that the confidence of a clause depends on all the clause structures that can be embedded inside, not only on the optimal. Thus, the score of the clause is given by the function avg+ which computes the average only over the scores
Learning and Inference for Clause Identification
41
which give positive evidence. As in the previous function, incorrect clauses in the splits may damage the performance. C(base(s, t), recP (s, t)) if π = ∅ score(s,t) = C(rec (s, t, S), rec (s, t, S)), base(s, t)) otherwise C(avg+ C P S∈π This scoring function requires linear exploration of the structure, and hence its cost is n times the cost of the classifiers.
4
Learning the Decisions
Here we describe the learning process of the functions involved in the system. When identifying candidates two classifiers are involved, spoint and epoint. The scoring functions use up to four classifiers, namely plain, base, recC and recP . We use AdaBoost with confidence rated predictions as the learning method. 4.1
AdaBoost
The purpose of boosting algorithms is to find a highly accurate classification rule by combining many base classifiers. In this work we use the generalized AdaBoost algorithm presented in [10] by Schapire and Singer. This algorithm has been applied, with significant success, to a number of problems in different research areas, including NLP tasks [11]. Let (x1 , y1 ), . . . , (xm , ym ) be the set of m training examples, where each xi belongs to an input space X and yi ∈ Y = {+1, −1} is the corresponding class label. AdaBoost learns a number T of base classifiers, each time presenting the base learning algorithm a different weighting over the examples. A base classifier is seen as a function h : X → R. The output of each ht is a real number whose sign is interpreted as the predicted class, and whose magnitude is the confidence in the prediction. The AdaBoost classifier is a weighted vote of the base classifiers, given by the expression f (x) = Tt=1 αt ht (x), where αt represents the weight of ht inside the whole classifier. Again, the sign of f (x) is the class of the prediction and the magnitude is its confidence. The base classifiers we use are decision trees of fixed depth. The internal nodes of a decision tree test the value of Boolean predicate (e.g. “the first word of a clause candidate is that ”). The leaves of a tree define a partition over the input space X , and each leaf contains the prediction of the tree for the corresponding part of X . We follow the criterion presented in [10] for growing base decision trees and computing the predictions in the leaves. A maximum depth is used as the stopping criterion. 4.2
Features
An entity to be classified is represented by a set of binary features encoding local and global information in the entity. Features are grouped into several types:
42
Xavier Carreras et al.
Word Window. A word window of context size n anchored in the word wi encodes i+n the words in the fragment wi−n along with their position relative to the central word. For each word in the window, its pos forms a feature. For words whose pos are determiners, conjunctions, pronouns or verbs, the form is also a feature. When considered and available, features will also encode whether the words are S or E points. Chunk Window. A chunk window of context size n anchored in the word wi codifies the chunk containing the word wi , the previous n chunks and the following n chunks. For each chunk in the window, a feature is formed with the chunk tag and the distance to the central chunk. Patterns. A pattern represents the structure of a sentence fragment which is relevant for distinguishing clauses. The following elements are considered: a) Punctuation marks (’’, ‘‘, (, ), ,, ., :) and coordinate conjunctions; b) The word “that”; c) Relative pronouns; d) Verb phrases chunks; and e) CLAUSE constituents, already recognized. A pattern for the fragment wij is a feature formed by concatenating the relevant elements inside the fragment. Element Counts. Number of occurrences of relevant elements in a sentence fragment. Specifically, we consider the chunks which are verb phrases or relative pronouns, the word that, and the words whose pos is a punctuation mark. Given a sentence fragment, two features are generated for each element, one indicating the count of the element and the other indicating the existence of the element. If a clause split is given, elements inside clauses will not be counted. 4.3
Training the Classifiers
Classifiers in the decision scheme are used dynamically. Here we describe how to generate a static set of examples from a given set of annotated sentences. For the S and E identification, each word in the sentence produces an example to be classified. Since clause boundaries, by definition, only appear at chunk boundaries, we consider only the words at the beginning of a chunk as examples for the spoint classifier and the words at the end of a chunk as examples for the epoint classifier. Consistently, when labeling, the words between chunk boundaries are never considered S or E points. The system works from left to right, by first using the S predictor for the whole sentence and then the E. An example at a word is represented with word and chunk windows, considering the S and E already predicted, and a pattern and counts features for the fragments of the sentence before and after the word. The classifiers in the scoring function receive clause candidates as examples to be classified. The candidates are generated by the S and E identification so, clearly, the classifiers of the scoring functions depend on the performance of the S and E classifiers. In training, given a set of sentences, examples of candidates are generated with the correct set of S and E points plus a set of incorrect points which depends on the previously learned classifiers. Our criterion for selecting
Learning and Inference for Clause Identification
43
such incorrect points is to use negative examples which are closer to the decision boundary of the spoint and epoint classifiers. Given a set of candidates, we generate and typify training examples into four positive labels (‘+1’–‘+4’) and two negative labels (‘-1’,‘-2’) as follows: – Each candidate which is a base clause generates one example of type ‘+1’. – Each candidate which is a recursive clause generates: 1) one example of type ‘+2’, without considering its internal clause split; 2) one example of type ‘+3’, considering its complete clause split; and 3) k examples of type ‘+4’, each considering one of the k partial splits formed by removing clauses from the complete split for up to three levels deep. – Each candidate which is not a clause generates: 1) one example of type ‘-1’, without considering any clauses inside; and 2) k examples of type ‘-2’ considering possible splits with the clauses inside the candidate generated as in examples of type ‘+4’. For training, the plain classifier takes positive examples of type ‘+1’ and ‘+2’, and negative examples of type ‘-1’. The base classifier takes ‘+1’ positive examples and ‘-1’ negative examples. The recC takes ‘+3’ for positives and ‘-2’ for negatives. Finally, the recP takes positive examples of type ‘+2’, ‘+3’ and ‘+4’, and negative examples of both types. In these classifiers, a candidate is represented by word and chunk windows anchored both in the S and E point of the candidate, a pattern codifying the structure of the candidate and counts of the relevant elements in the candidate. Note that when a clause split is considered within a candidate, clauses in the split are represented in the pattern as reduced elements and elements inside the clauses are not counted.
5
Experiments
In this section we describe the experiments we performed to evaluate the presented algorithm with its variations. CoNLL 2001 Corpus. We used the Penn Treebank as data for training and testing the clause system, following the setting of the CoNLL 2001 shared task [13]. WSJ sections from 15 to 18 were used as training material (8,936 sentences), section 20 as development material (2,012 sentences), and 21 as test data (1,671 sentences).1 The data sets contain sentences with the words, the clause split solution, and automatically tagged pos tags and chunks. Baseline: Open-Close. The best system presented in the CoNLL task, which we call the Open-Close [3], is used as the baseline for comparison. The scheme it follows first identifies the S and E points in a sentence. Then, an open classifier decides how many open brackets correspond to each S point. After that, for 1
Corpus freely available at http://lcg-www.iua.ac.be/conll2001/clauses.
44
Xavier Carreras et al.
each open bracket a close classifier scores each E point as closing point for the bracket. A final procedure ensures the coherence of the solution by choosing the most confident decisions that form a correct split. The Open-Close scheme has a close relation to the Plain Scoring approach presented in this work, in the sense that the close and the plain classifiers score clause candidates in a similar way. Training classifiers. All the classifiers involved in the scheme were trained using base decision trees of depth 4 (four levels of predicates plus the leaves with the predictions). Initial experiments showed a great improvement in using depths around 4 rather than the usual decision stumps (depth 1). Only features with more than three occurrences in the training data were considered. Up to 4,000 trees were learned for each classifier, and the optimal number was selected as the one with the best Fβ=1 measure on the development set. Evaluating the scoring functions. In the first experiment we compared the performance of the three proposed scoring functions. The classifiers involved in the functions were learned without considering incorrect S and E points in the training set. Table 1 shows in the first three columns the results for each scoring function on the development set, together with the results of the Open-Close. The performance of the S and E points identification was 93.89% and 90.12% in F1 , respectively. Regarding the three results, the Best Split Scoring obtained the best rates. The Plain Scoring obtained the same recall but less precision. Our hypothesis is that considering reduced clauses simplifies the structures to be classified and yields more precise predictions. The Linear Average Scoring is significantly worse than the other variants. Thus, it seems that in this problem taking into account the optimal identified structure helps the decisions, but further explorations of non-optimal solutions confuses the decisions. Comparing to the best results in CoNLL, both the Best Split and Plain Scoring variants significantly outperformed the Open-Close method. In order to show the bottleneck that the S and E identification introduced, we ran the systems considering the correct S and E points instead of using the predictions. The results are shown in the right side of Table 1. In this ideal setting, the performance is very good, clearly indicating that errors in the S and
Table 1. Results on the development set. Overall performance of the presented variants, using the predicted (left) or the correct (right) S and E points predicted S E prec. rec. Fβ=1 Open-Close 87.18% 82.48% 84.77% Plain Scoring 88.33% 83.92% 86.07% Best Split Scoring 89.10% 83.92% 86.44% Linear Average Scoring 87.60% 81.91% 84.66% Robust Best Split Scoring 92.53% 82.48% 87.22%
correct S E prec. rec. Fβ=1 98.12% 96.16% 97.13% 95.04% 96.14% 95.59% 96.74% 96.90% 96.82% 95.88% 95.74% 95.81% 97.47% 90.29% 93.74%
Learning and Inference for Clause Identification
45
E layer significantly affect the general performance. The Best Split Scoring is again better than the other scoring variants. Here Open-Close achieves a very high precision, possibly due to a heuristics which opens a clause at each S point. Robust Training. The false positive errors in the S and E identification produce clause candidates that have not been considered when training the scoring classifiers. In this experiment we retrained such classifiers generating better sets of negative examples, and exploring different sizes of negative examples. As described in Sect.4.3, such training examples were generated with the correct set of S and E points plus a P % of incorrect points, selecting those which were close to the decision boundary of the learned spoint and epoint classifiers, respectively. We used P values ranging from 0 to 100. In general, the higher the P the more precise were the classifiers we obtained. The Plain Scoring, despite the improvement in precision, did not improve the F measure because the recall rates dropped faster. The Linear Average Scoring slightly improved its F rate, but did not outperform the other variants on the default training. Finally, the Best Split Scoring obtained significant improvements: the best performance was achieved when adding a 20% of incorrect S predictions and 40% incorrect E predictions, giving an F rate of 87.22%. Table 1 (left) shows the results only for this improved model. Naturally, the performance using the correct S and E (Table 1 right) deteriorated when incorrect predictions were also used. Table 2 presents the results obtained by the different scoring functions on the test set, together with the Open-Close results and the S and E performance. As observed in the systems tested in the CoNLL competition [13], the test set seems to be harder than the development set. Again, the Best Split Scoring performs significantly better than the other approaches, and the robust training of the function, with the setting tuned on the development, yields a significant improvement in precision and the F measure.
Table 2. Results on the test set, both for the S and E identification (above) and the general performance of the scoring functions (below) S points E points Open-Close Plain Scoring Best Split Scoring Linear Average Scoring Robust Best Split Scoring
prec. 93.96% 90.04% 84.82% 85.25% 86.44% 86.53% 90.18%
rec. 89.59% 88.41% 73.28% 74.53% 74.41% 72.54% 72.59%
Fβ=1 91.72% 89.22% 78.63% 79.53% 79.98% 78.92% 80.44%
46
6
Xavier Carreras et al.
Conclusions and Future Work
We have presented a framework for the identification of embedded structure in sentences and investigated experimentally several instantiations of it. All the decisions involved in the scheme are derived using learned classifiers, and thus it is a scheme for doing inference with classifiers. We have shown that this approach improves over the top-performing clausing system. Moreover, we believe that the general framework developed here can be generalized to the identification of embedded structures in other structure learning problems, such as information extraction problems and other natural language processing, and this is one of the important direction that we intend to explore in future work. Several questions remain open with respect to the specific problem studied here. The key one is that of incorporating the chunk parsing stage into the framework rather than using its outputs. The idea is to maintain the ambiguity in the classification longer, perhaps until it can be resolved using other information sources, as our framework suggests. Other problems include investigating the use of additional linguistic knowledge, such as the type of the clause, and avoiding the significant bottleneck introduced by the S and E layer.
References 1. S. P. Abney. Parsing by chunks. In R. C. Berwick, S. P. Abney, and C. Tenny, editors, Principle-based parsing: Computation and Psycholinguistics, pages 257– 278. Kluwer, Dordrecht, 1991. 35 2. S. Buchholz, J. Veenstra, and W. Daelemans. Cascaded grammatical relation assignment. In EMNLP-VLC’99, the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, June 1999. 35 3. Xavier Carreras and Lu´ıs M` arquez. Boosting trees for clause splitting. In Proceedings of CoNLL-2001, pages 73–75. Toulouse, France, 2001. 35, 43 4. J. Goodman. Parsing algorithms and metrics. In Proceedings of the 34th Annual Meeting of the ACL, pages 177–183, 1996. 36 5. G. Grefenstette. Evaluation techniques for automatic semantic extraction: comparing semantic and window based approaches. In ACL’93 workshop on the Acquisition of Lexical Knowledge from Text, 1993. 35 6. Z. S. Harris. Co-occurrence and transformation in linguistic structure. Language, 33(3):283–340, 1957. 35 7. M. Mu˜ noz, V. Punyakanok, D. Roth, and D. Zimak. A learning approach to shallow parsing. In EMNLP-VLC’99, the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, June 1999. 35, 36 8. V. Punyakanok and D. Roth. The use of classifiers in sequential inference. In NIPS13; The 2000 Conference on Advances in Neural Information Processing Systems, 2001. 35, 36 9. L. A. Ramshaw and M. P. Marcus. Text chunking using transformation-based learning. In Proceedings of the Third Annual Workshop on Very Large Corpora, 1995. 35 10. R. E. Schapire and Y. Singer. Improved Boosting Algorithms Using Confidencerated Predictions. Machine Learning, 37(3):297–336, 1999. 41
Learning and Inference for Clause Identification
47
11. R. E. Schapire. The Boosting Approach To Machine Learning: An Oveview. In Proceedings of the MSRI Workshop on Nonlinear Estimation and Classification, 2002. 41 12. E. F. Tjong Kim Sang and S. Buchholz. Introduction to the CoNLL-2000 shared task: Chunking. In Proceedings of CoNLL-2000 and LLL-2000, pages 127–132, 2000. 35 13. Erik F. Tjong Kim Sang and Herv´e D´ejean. Introduction to the CoNLL-2001 shared task: Clause identification. In Walter Daelemans and R´emi Zajac, editors, Proceedings of CoNLL-2001, pages 53–57. Toulouse, France, 2001. 36, 43, 45
An Empirical Study of Encoding Schemes and Search Strategies in Discovering Causal Networks Honghua Dai, Gang Li, and Yiqing Tu School of Computing and Mathematics, Deakin University 221 Burwood Highway, Vic 3125, Australia {hdai,gangli}@deakin.edu.au
[email protected]
Abstract. Efficiently inducing precise causal models accurately reflecting given data sets is the ultimate goal of causal discovery. The algorithm proposed by Wallace et al. [10] has demonstrated its ability in discovering Linear Causal Models from data. To explore the ways to improve efficiency, this research examines three different encoding schemes and four searching strategies. The experimental results reveal that (1) specifying parents encoding method is the best among three encoding methods we examined; (2) In the discovery of linear causal models, local Hill climbing works very well compared to other more sophisticated methods, like Markov Chain Monte Carto (MCMC), Genetic Algorithm (GA) and Parallel MCMC searching.
1
Introduction
Graphical Model is a powerful knowledge representation and reasoning tool under uncertainty [8]. However, the manually construction of Graphical Model is usually time-consuming and subject to mistakes. Therefore, algorithms for automatic construction, that occasionally use the information provided by an expert, can be of great help [5]. As Graphical Model can often be plausibly understood as describing causal relations, the automatic construction of Graphical Model is usually referred as Causal Discovery. In social sciences, there is a class of limited Graphical Model, usually referred as Linear Causal Models, including Path Diagram [12], and Structural Equations Model [1]. In Linear Causal Models, effect variables are strictly linear functions of exogenous variables. Although this is a significant limitation, its adoption allows for a comparatively easy environment in which to develop causal discovery algorithms. In 1996, Wallace et al. successfully introduced an information theoretic approach to the discovery of Linear Causal Models. This algorithm uses Wallace’s Minimum Message Length (MML) criterion [9] to evaluate and guide the search of Linear Causal Model, and their experiments indicated that MML criterion is capable of recovering Linear Causal Model which is quite accurate reflection of the original model [10]. In 1997, Dai et al. further studied the reliability and robustness issues in causal discovery [2], they closely examined the T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 48–59, 2002. c Springer-Verlag Berlin Heidelberg 2002
An Empirical Study of Encoding Schemes and Search Strategies
49
relationships among the complexity of the causal model to be discovered, the strength of the causal links, the sample size of given data set and the discovery ability of individual causal discovery algorithms. Two main issues involved in the process of Causal Discovery using MML are Encoding and Searching. In order to improve the efficiency of discovery algorithm, an optimal encoding scheme and an efficient search strategy are highly demanded. In this paper, we examine three different encoding schemes for describing the structure of Linear Causal Models, and compare four different search strategies to explore the possibilities to improve discovery efficiency while preserving discovery accuracy. The paper is organized into 5 sections. Section 2 describes two structure encoding schemes proposed in [10], and gives a new encoding scheme. Section 3 describes four different search strategies. In Sec. 4 three encoding schemes and four search strategies are compared. Finally, we conclude this paper in Sec. 5.
2
Causal Structure Encoding Schemes
As reported in [10], the basic idea of causal discovery via MML is that an encoding scheme based on the minimum message length principle needs to be provided to describe 1. the causal structure, which is a Directed Acyclic Graph (DAG) for Linear Causal Model 2. the strength (model parameters) of the causality in the Linear Causal Model 3. the data assuming the Linear Causal Model is true For each candidate model from the model space, we calculate the total message length based on the given data, and the one with the shortest total message length will be chosen as the best model. According to information theory, the total message length L(M, D) is given by, L(M, D) = −logP (M ) − logP (D|M ) = L(M ) + L(D|M )
(1)
where L(M ) is the cost of encoding the causal model M , and L(D|M ) is the cost of encoding the given data D assuming the model M is true. As a model is represented with a DAG and the path coefficients, so L(M ) is composed of two main parts: the cost of encoding the causal structure, L(s) , and the cost of encoding the model parameters, L(p) , i.e., L(M ) = L(s) + L(p)
(2)
In general, the encoding scheme for describing model parameters and the given data is relatively stable and mature. Here we mainly examine the encoding schemes for describing the casual structure, a Directed Acyclic Graph (DAG).
50
Honghua Dai et al.
2.1
Scheme 1: Specifying a Total Ordering and Arc Connections
A Directed Acyclic Graph (DAG) with K nodes can be encoded by specifying a total ordering (requiring log K! bits) and specifying which pairs of nodes are connected 1 ; this requires K(K−1) bits on the assumption that the probability that 2 a link is present is 1/2. It corresponds to maximal ignorance about the degree of connectedness of the graph. We avoid the use of explicit prior information about the causal models we are looking for. It is enough to specify the presence or absence of arcs, since directionality is implied by the ordering already provided. Since more than one ordering is consistent with the DAG, actually specifying a particular ordering is inefficient, so we reduce the message length by the number of bits needed to select among the φ total orderings consistent with the DAG. Hence, K(K − 1) (s) − log φ. (3) L1 = log K! + 2 2.2
Scheme 2: Specifying Total Acyclic Orientations
The second method for calculating the cost of describing a DAG begins by specifying the undirected graph, which costs K(K−1) bits, and then specifies the 2 particular direction which each arc is to assume. This results in an acyclic graph. That is, we count the number of possible acyclic orientations; the logarithm of that number is the number of additional bits required. In order to do this count, we can subtract the number of cyclic orientations ρ from the number of total orientations, which is 2ν , where ν is the number of undirected arcs. Hence, (s)
L2 =
K(K − 1) + log(2ν − ρ). 2
(4)
Previous experimental results show that these two methods result in MML costs that are very close for a wide variety of simple graph structures[10] we tested, so we can expect that the choice of encoding method will make little difference (s) to experimental results. In practice, until the introduction of L3 , our imple(s) mentation of L1 is faster. To further improve the efficiency of the discovery algorithm, we introduced the following new encoding scheme. 2.3
Scheme 3: Specifying Parents Set
The structure of a Directed Acyclic Graph can be described by specifying its parents set Parents(x) for each node of the DAG. This description consists of the number of parents, followed by the index of the set Parent(x) in some enumeration of all sets of its cardinality. So the cost for encoding the causal structure can be calculated using: K (s) L3 = log K + log (5) ri i
1
Schemes 1 and 2 were introduced by Wallace, Korb and Dai in 1996 [10].
An Empirical Study of Encoding Schemes and Search Strategies
51
Where ri = cost √ in calcu arents(Xi )|. To avoid intensive computational time |P x −x lating log K 2πx, so we ri , we use Stirling’s approximation formula x! = x e get, K K ≈ (K − r ) log log log K + r (6) i i ri K − ri ri Thus, we have, (s)
L3 =
log K + (K − ri ) log i
K K − ri
+ ri log
K ri
(7) (s)
(s)
This encoding scheme works much faster than using the formula L1 and L2 . This can be seen from the experimental results reported in section 4.
3
Model Space Search Strategies
For a given data set, the number of possible causal structures is exponential in the number of variables. To find out the best structure from this huge space, a efficient search strategy is highly demanded [5] 2 . Hill-climbing search was used in our previous work [3, 10] 3 . In the past decade, there is an increasing amount of work on the application of Markov Chain Monte Carlo and Evolutionary Algorithm to complex learning, search and optimization problems. In 1993, Madigan proposed the M C 3 algorithm4 which uses Metropolis sampling to search over structures for graphical models [7]. In 1996, Larra˜ naga et al. tackle the problem of searching for a Bayesian Network structure that maximizes the BDE metric with hybrid genetic algorithm, given a total order of all variables [6]. This algorithm was later extended to the general case that no ordering between the variables is assumed, and they make use of a repair operator to convert offspring structures that are cyclic into acyclic. In 1999, Wong et al. used MDL metric and evolutionary programming for the optimization in the search process [11]. In 1998, Holmes used genetic operators to inform the proposal distribution for a Metropolis sampling algorithm [4], proposed the Parallel Markov Chain Monte Carlo algorithm and found that their sampler converged quicker than standard Metropolis sampling. In this section, we describe four different search strategies for discovering causal models: Hill-climbing, Genetic Algorithm, Markov Chain Monte Carlo (MCMC) and parallel MCMC. Their performance will be compared in Sec. 4. 3.1
Hill-Climbing (HC)
This search method could start with a seed DAG provided by user, or a null graph without any edge, then attempt to add an edge if there is none or to 2 3 4
The search is guided by message length, and we have an assumption that the model with minimum message length is the best model. Random Restarting can be integrated to overcome the problem of local optima. M C 3 means Markov Chain Monte Carlo Model Composition.
52
Honghua Dai et al.
delete or to reverse it if there already is one. Such adding, deleting or reversing is done only if such changes result in a decrease of the total message length of the new structure. If the new structure is better, it is kept and then try another change. This process continues until no better structure is found within a given number of Hill-climbing steps, or the search from the whole structure space is completed. 3.2
Markov Chain Monte Carlo (MCMC)
Given a data set D, the posterior probability p(M |D) of each model M can be directly calculated from Bayes theorem, p(M |D) =
p(D|M )p(M ) p(D)
(8)
Where p(M ) is the prior probability for the model M , p(D|M ) is the likelihood of the model for the data set D, and p(D) can be thought as a normalizing constant. If we are interested in finding the model that maximizes the posterior probability, causal discovery can be formulated as an optimization problem that finds the maximum a posterior probability: M ∗ = arg max p(M |D) M
(9)
The MCMC method for solving this problem is to generate samples M from the distribution p(M |D) and select the best. If we can generate independent samples from the target distribution, the law of Large Number ensures that the approximation can be made as accurate as desired by increasing the sample size. MCMC method draws samples independently from the target distribution through a Markov chain having p(M |D) as its stationary distribution. To perform this sampling, we use a version of the Metropolis algorithm: The current model structure is represented by a Connectivity Matrix, in which a cell C[i][j] having a non-zero value indicates there exists an edge i → j in the model structure. Sampling from the posterior over model structures proceeds by a Markov process which steps from one DAG to another in such a way that the chance of visits to a DAG is proportional to its posterior probability. The proposal distribution is determined by the following four variation operators: – Select two distinct variables i and j uniformly from the domain. If there exists an edge between them, attempt to remove it. Otherwise, attempt to add an edge in either direction. – Select three distinct variables i, j and k uniformly from the domain, if there exists an edge i → j, then remove it and add another edge i → k. – Select three distinct variables i, j and k uniformly from the domain, if there exists an edge i → j, then remove it and add another edge k → j. – Check the resulting structure, if it contains a cycle, then randomly select and remove an edge from the cycle, so that the resulting structure is a DAG.
An Empirical Study of Encoding Schemes and Search Strategies
53
If we assume symmetricity in the proposal distribution 5 , then the candidate model M will replace current model M with the following probability: α = min{1,
p(M |D) } p(M |D)
(10)
This acceptance rule says that the candidate model is always accepted when its posterior probability is higher than that of current model. Otherwise, it is accepted according to the ratio of two probabilities. From the formula 8, it can be seen that the posterior probability can be calculated as the ratio of joint probabilities. On the other hand, the theory of MML inference shows that the total message length L(M, D) closely approximates the negative logarithm of the joint probability of model M and data set D, like this L(M, D) = − log p(M ) − log p(D|M ) = − log p(D, M )
(11)
So we have p(M |D) p(D, M ) = p(M |D) p(D, M ) = 2L(M,D)−L(M
,D)
(12)
Finally, the acceptance probability can be written as α = min{1, 2L(M,D)−L(M
,D)
}
(13)
The sampling proceeds until the chain is thought to be converged. 3.3
Genetic Algorithm (GA)
The third method we considered is a Genetic Algorithm. Chromosome representation for the model structure is an K × K connectivity matrix, in which a cell C[i][j] having a non-zero value indicates there exists an edge i → j in the model structure. The fitness of a chromosome is defined according to the MML cost of the corresponding model structure: The less MML cost, the higher its fitness. With matrix representation in mind, one crossover operator, three mutate operators, and one repair operator are defined as follows: Crossover Binary tournament selection is used to select pairs of structures for crossover. One structure with higher fitness from two randomly selected structures, and another one with higher fitness from another two randomly selected structures are selected to crossover. The crossover operator uniformly exchanges parent sets for each variable. 5
The algorithm is also referred as Metropolis-Hastings algorithm
54
Honghua Dai et al.
Simple Mutate Randomly select two distinct variables i and j from the domain. If there exists an edge between them, then attempt to remove it. Otherwise, attempt to add an edge in either direction. Parent Shift Mutate Randomly select an edge, randomly set the end of the edge to another variable. Child Shift Mutate Randomly select an edge, randomly set the start of the edge to another variable. Repair Illegal structures may be generated from above operators. Repair operator try to locate and break cycles in the structure, until a directed acyclic graph is got. During the process of evolution, Roulette wheel selection strategy is adopted to ensure that better structures have a higher probability to be selected. Evolution proceeds until a given termination criterion is satisfied (In the experiments of this paper the evolution process stops when the number of MML calculations reaches a pre-set limit). 3.4
Parallel MCMC (PMCMC)
In 1998, Holmes and Mallick suggested to exchange information between different samplers as a way to improve mixing in MCMC samplers [4]. They demonstrated this method on a parameter training problem for a neural network, and a knot selection problem. They found that their algorithm can propose large changes without sacrificing acceptance probabilities. The parallel MCMC method uses a population of samplers to estimate features of the target distribution p(M |D) in an attempt to select a proposal distribution as close as possible to the target distribution. In this algorithm, candidate structures are not only generated by those operators as defined in 3.2, but also generated by a crossover operator as defined in 3.3. However, instead of using the candidate directly as in a standard Genetic Algorithm, PMCMC uses the formula 13 to either accept the candidate or remain unchanged.
4
Empirical Results and Analysis
In this section, we compare three encoding schemes and evaluate four different search strategies. Eight Linear Causal Models are used in our experiments: Fiji, Evans, Blau, Rodgers&Maranto, case9, case10, case12 and case15, which have 4, 5, 6, 7, 9, 10, 12 and 15 variables respectively. 4.1
Comparison on Encoding Schemes
In order to compare the computational cost of these encoding schemes, we incorporate them with a Hill-Climbing search strategy. The CPU time cost of search process using different encoding schemes are compared in Table 1. From this table, we can see that scheme 1 is faster than scheme 2, this coincides with previous experimental reported in [10], and the encoding scheme 3 is the fastest
An Empirical Study of Encoding Schemes and Search Strategies
55
Table 1. Comparison of Time Cost of Discovery using different Encoding Schemes Data Set Fiji Evans Blau Case9 Case10 Case12 Case15
Scheme 1 0.84 2.29 5.18 19.23 59.97 126.20
Scheme 2
Scheme 3
seconds 0.92 seconds 0.96 seconds 3.01 seconds 2.25 seconds 6.35 seconds 3.42 seconds 23.19 seconds 16.32 seconds 71.19 seconds 20.10 seconds 181.61 seconds 36.20 265.50
seconds seconds seconds seconds seconds seconds seconds
one among three different encoding schemes. As the problem of computational cost, the search process using encoding scheme 1 or 2 can not give a result for data set (case15 ) which has more than 12 variables within reasonable time, however, the search process using encoding scheme 3 is capable of discovering more complicated models with larger number of variables. Although the encoding costs using different schemes are different, the final discovery results are very close for data sets we tested. As an example, we give results on data set Case10 and Case12 by Hill-climbing based on different encoding schemes in Fig. 1 and 2. From which, we can see that all these search process returns the same results. This indicates that encoding scheme 3 can improve the efficiency of the discovery process while preserving the discovery accuracy. 4.2
Test Results on Search Strategies
To evaluate the performance of different search strategies, a common representation of the causal structure is used with four different search strategies. All the message lengths of causal structures here are calculated using scheme 3 as described in Sec. 2.
X1
X5 0.4
X1
X6
0.5 X4
0.2
0.7 X3
X2 0.613
X7
X8
(a) Original
0.7
X9
0.476
X2 0.613
X7
X3
X8
(b) Scheme 1
X9
0.476
X2 0.613
X7
X3
(c) Scheme 2
X9
0.476 X7
X10
0.078
0.702
0.708
0.193 X8
0.499
X4
X10 0.702
0.708
0.193
X6
0.188
0.279
0.078
0.702
X5 0.350
0.499
X4
X10
0.078
X1
X6
0.188
0.279
0.708
0.193
X5 0.350
0.499
X4
X10
0.1
X1
X6
0.188
0.279
0.6
X2 0.6
X5 0.350
0.2
0.3
X3
X8
X9
(d) Scheme 3
Fig. 1. Comparison of Discovery Result on Case10 using different Schemes
56
Honghua Dai et al.
Table 2. Minimum Message Length Comparison Fiji Original HC MCMC GA PMCMC
Evans Blau Rodgers Case9 Case10 Case12 Case15
5495.3 5489.0 5488.6 5489.0 5489.6
5470.3 5466.7 5462.3 5466.7 5466.7
7348.2 7352.3 7348.2 7348.1 7351.5
9346.2 9352.6 9344.3 9345.2 9344.3
10208.1 10208.2 10217.1 10217.1
3302.4 3302.4 3305.0 3302.4 3302.4
4462.8 4462.8 4470.2 4468.4 4462.8
17355.7 17371.0 17355.7 17482.4 17355.7
Considering the fact that the most computationally intensive part in search process is the calculation of MML cost, we use the times of calculating the MML cost as the basis of the termination rule for all the algorithms. For those models with less than 8 nodes, the search proceeds until the times of calculating MML reaches 2000, for those network with 9 or more variables, the search proceeds until the calling times reaches 5000. For the genetic algorithm and the parallel MCMC algorithm, we run a set of exploratory experiments to find a proper set of parameters for them. In the following experiments, population size is set to 12, crossover probability is set to 0.5, and mutation probability is 0.3. Table 2 illustrates the message length of eight original models and the MML length of the corresponding models discovered by the HC, MCMC, GA and PMCMC search strategies. It should be noted that MML-based procedure is derived from asymptotic approximations, thus for models with few variables (like Fiji and Evans), minimizing MML doesn’t coincide with the original model. For models with 9 or more variables, MML-based procedure works well. From Table 2, we can see that Hill-climbing search can discover 3 original models from 8, MCMC and PMCMC search strategies can discover 4 original models from 8, and GA can find 2 original models out of 8. Although in some
X9
X1
X2
X4
X12
X10
0.7
X2 0.495
X4
X10
0.306
0.2
0.3
0.179
0.680
0.186
X9
X1 0.661
0.360
0.2 0.5
X9
X1
0.7
0.4
X12
X2 0.495
X4
X10
0.306
0.179
0.680
0.289
0.186
X9
X1 0.661
0.360
X12
0.661
0.360 X2 0.495
X4
X10
0.306
0.179
0.680
0.289
0.186
X12
0.289
0.3 X3
X11
X6
X3
X11
X6 0.301
X3
X11
X6 0.301
X3
X11
X6 0.301
0.3 0.6 X5
X7
X5
X7 0.6
X5
X7 0.6
X5
X7 0.6
0.8 X8
(a) Original
X8
0.784
(b) Scheme 1
X8
0.784
(c) Scheme 2
X8
0.784
(d) Scheme 3
Fig. 2. Comparison of Discovery Result on Case12 using different Schemes
An Empirical Study of Encoding Schemes and Search Strategies
---
-, =* Es ss , s -, (a) Curve on Blau model
c
"
-
*
,
m
,
m
-
s
:
-"
<"
r<
*--
,="
,-,
,<"
,a
x<
(b) Curve on Rodger~model
-,.
---. (a) Curve on CaseS model m
=.
:
-
s
m
*
m
~
-
m
-
-
-
-
---
-
-
-
-
(b) Curve on Case12 model
Fig. 3. Search Curve on Test Models
(a) Original
(b) HC
(c) MCMC
(d) GA
( e ) PMCMC
Fig. 4. Original Model and Search results on Blau model
cases the algorithms do not find original models, all these algorithms can approximate the original one accurately. Figure 3 illustrate the convergence speed of the four search strategies in discovering causal models from four different data sets. In these figures, X-axis is the number of MML calculation so far, and Y-axis is the best MML cost found. Although all different search strategies converge towards the original model, we can see that the Hill-climbing converge faster than the other three. In theory, the main drawback of Hill-climbing method is that its greedy search nature determines that this method can be very easily stuck in local optima. More theoretic-robust methods like MCMC, and GA were proposed as means to overcome this problem. However, from our experiments, we can see that within reasonable number of calculations, MCMC and GA seem to have no apparent advantages over Hill-climbing. As for Parallel MCMC, although Holmes' work
58
Honghua Dai et al.
X1
X2
X3
X4
X7
X5
X8
X6
X9
(a) Original
X1
X2
X3
(b) HC
X4
X7
X5
X8
X6
X9
X1
X2
X3
X4
X7
X5
X8
X6
X9
(c) MCMC
X1
X2
X3
(d) GA
X4
X7
X5
X8
X6
X9
X1
X2
X3
X4
X7
X5
X8
X6
X9
(e) PMCMC
Fig. 5. Original Model and Search results on Case9 model
showed that it can converge faster than standard MCMC, this method can be viewed as a Genetic Algorithm using a different selection strategy. So in order to overcome problem of the curse of dimension, standard GA or MCMC don’t seem to be a potential direction, at least for causal discovery from complete data set.
5
Conclusions
To improve the efficiency of discovery algorithm, this paper examined three different encoding schemes for describing the structure of a Linear Causal Model, and compared four different search strategies. Our empirical results indicated that the encoding scheme 3 is an improvement over our previous work in terms of learning efficiency while preserving the discovery accuracy. This paper also reported the comparison results of four search strategies for the discovery of causal models from the model space. The experimental results revealed that more sophisticated search strategies seem to have no apparent advantages over Hill-climbing, which works very well for the task of Causal Discovery from complete data sets. This result keeps consistent with what we have found previously.
References [1] K. A. Bollen. Structural Equations with Latent Variables. Wiley, New York, 1989. 48 [2] Honghua Dai, Kevin Korb, Chris Wallace, and Xindong Wu. A study of causal discovery with small samples and weak links. In Proceedings of the 15th International Joint Conference On Artificial Intelligence IJCAI’97, pages 1304–1309. Morgan Kaufmann Publishers, Inc., 1997. 48 [3] Honghua Dai and Gang Li. An improved approach for the discovery of causal models via MML. In Proceedings of The 6th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD-2002), pages 304–315, Taiwan, 2002. 51 [4] C. Holmes and B. Mallick. Parallel markov chain monte carlo sampling: an evolutionary based approach. Technical report, Imperial College, London, 1998. 51, 54
An Empirical Study of Encoding Schemes and Search Strategies
59
[5] Michael I. Jordan. Learning in Graphical Models. MIT Press, Cambridge, MA, 1 edition, 1998. 48, 51 [6] P. Larra˜ naga, M. Poza, Y. Yurramendi, R. H. Murga, and C. M. H. Kuijpers. Structure learning of bayesian networks by genetic algorithms: A performance analysis of control parameters. IEEE Transactions on Pattern analysis and Machine Intelligence, 18(9):912–926, 1996. 51 [7] D. Madigan and J. York. Bayesian graphical models for discrete data. Technical Report TR-259, University of Washington Department of Statistics, Seattle, WA, 1993. 51 [8] Judea Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, San Mateo, California, 1988. 48 [9] Chris Wallace and D. M. Boulton. An information measure for classification. Computer Journal, 11(2):185–194, 1968. 48 [10] Chris Wallace, Kevin B. Korb, and Honghua Dai. Causal discovery via MML. In Proceedings of the 13th International Conference on Machine learning (ICML’96), pages 516–524, San Francisco, 1996. Morgan Kauffmann Publishers. 48, 49, 50, 51, 54 [11] W. L. Wong, W. Lam, and K. S. Leung. Using evolutionary computation and minimum description length principle for data mining of probabilistic knowledge. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(2):174–178, 1999. 51 [12] Sewall Wright. The method of path coefficients. Annals of Mathematical Statistics, 5:161–215, 1934. 48
Variance Optimized Bagging Philip Derbeko1 , Ran El-Yaniv1 , and Ron Meir2, 1 Computer Science Department Technion - Israel Institute of Technology, Haifa 32000, Israel {philip,rani}@cs.technion.ac.il 2 Electrical Engineering Department Technion - Israel Institute of Technology, Haifa 32000, Israel
[email protected]
Abstract. We propose and study a new technique for aggregating an ensemble of bootstrapped classifiers. In this method we seek a linear combination of the base-classifiers such that the weights are optimized to reduce variance. Minimum variance combinations are computed using quadratic programming. This optimization technique is borrowed from Mathematical Finance where it is called Markowitz Mean-Variance Portfolio Optimization. We test the new method on a number of binary classification problems from the UCI repository using a Support Vector Machine (SVM) as the base-classifier learning algorithm. Our results indicate that the proposed technique can consistently outperform Bagging and can dramatically improve the SVM performance even in cases where the Bagging fails to improve the base-classifier.
1
Introduction
This paper is concerned with Bagging (Bootstrap Aggregation) of classifiers. Bagging works by applying a learning algorithm on a number of bootstrap samples of the training set. Each of these applications yields a classifier. The resulting pool of classifiers is combined by taking a uniform linear combination of all the constructed classifiers. This way a new (test) point is classified by the “master” classifier by taking a majority vote between the classifiers in the pool. Since its introduction in [3] Bagging attracted considerable attention, and together with Boosting is considered to be among the most popular techniques for constructing and aggregating an ensemble of classifiers. A number of theoretical and experimental studies attribute the success of Bagging to its ability to reduce variance; see e.g. [1], [10] and [4]. We ask and attempt to answer the following question: Is it possible to improve the performance of Bagging by optimizing the combined classifier over all weighted linear combinations so as to reduce variance? In the context of regression such a scheme is particularly appealing since the bias of a normalized weighted combination is unchanged if the original biases are all the same.
The research was supported by the fund for promotion of research at the Technion and by the Ollendorff center.
T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 60–72, 2002. c Springer-Verlag Berlin Heidelberg 2002
Variance Optimized Bagging
61
Although this result is not directly related to classification it may be suggestive, and if variance reduction of the base-classifier is one of the main effects of Bagging, one can expect that this question should be answered affirmatively. Indeed, we provide strong evidence that a Variance Optimized Bagging, which we term for short Vogging , consistently improves on Bagging. The main ideas behind the new technique are borrowed from Mathematical Finance. Specifically, we import the basic ideas of Markowitz Mean-Variance Portfolio Theory [16,17] that is used for generating low variance portfolios of financial assets, and use it in our context to construct optimized “portfolios” of bootstrapped classifiers.1 This paper is organized as follows. In Section 2 we briefly overview the basic ideas of Markowitz portfolio theory. We then use these ideas and introduce the new Vogging technique in Section 3. In Section 4 we discuss our experimental design and present our results in Section 5. Related work is discussed in Section 6 and finally, in Section 7, we summarize our conclusions and suggest directions for further research.
2
Markowitz Mean-Variance Portfolio Optimization
In this section we provide a brief overview of the main ideas of the Markowitz Single-Period Mean-Variance Portfolio optimization technique. These ideas set the path for a most influential theory in mathematical finance. They will later be utilized in our new classifier aggregation technique. The single period Markowitz algorithm solves the following problem. We consider m assets (e.g. stocks, bonds, etc.) S1 , . . . , Sm . We are given: (i) A predicted expected monetary return ri for each asset Si ; (ii) A predicted standard deviation σi of the return of Si ; and (iii) The m×m covariance matrix Q with Qii = σi and Qij = ρij σi σj where ρij is the correlation coefficient between the returns of Si and Sj . A portfolio is a linear combination of assets. It is given by a vector w of m weights w = (w , . . . , w ) with w 1 m i i = 1. The expected return of a portfolio w is i wi ri . The risk of a portfolio is traditionally measured by its variance σ 2 (w), σ 2 (w) = wi wj Qij = wt Qw. i,j
It is assumed that investors are interested in portfolios that yield high returns but are averse to large variance. The exact risk aversion pattern of an investor is modeled via a utility function. Nevertheless, an empirical fact (which is backed up by economic theories) is that the return of an asset typically trades-off its variance; that is, assets with large average return tend to exhibit large variance and vice versa. 1
Thirty-eight years after Markowitz published his paper “Portfolio Selection” [16] he shared a Nobel Prize with Miller and Sharpe for his study that has become a well established theory of portfolio selection.
62
Philip Derbeko et al.
The output of the Markowitz algorithm is a set of portfolios with expected return greater than any other with the same or lesser risk, and lesser risk than any other with the same or greater return. This set is called the efficient frontier. The efficient frontier is conventionally plotted on a curve with the standard deviation (risk) on the horizontal axis, and the expected return on the vertical axis. An efficient frontier illustration is given in Figure 12 . A useful feature of the single period mean-variance portfolio problem is that it is soluble using quadratic programming. Using the efficient frontier an investor seeking to invest in an “optimal” portfolio should choose one that lies on the frontier curve. The exact portfolio will be chosen using his/her personal utility function. A particular “off-the-shelf” recommended utility function was proposed by Sharpe and is called the Sharpe Ratio [22]. Sharpe’s ratio is a risk-adjusted measure of return that divides a portfolio’s return in excess of the riskless return by the portfolio’s standard deviation. Specifically, let R0 be the return of a risk-free asset (i.e. cash or treasury bills) and let (R(w), σ(w)) be the return and risk pair of a portfolio w. Then, the Sharpe ratio of w is Sharpe(w) =
3
R(w) − R0 . σ(w)
(1)
Bagging and Vogging
Let H be a binary hypothesis class of functions from the input space X to {±1} and let S = (x1 , y1 ), . . . , (xn , yn ) be a training sample where xi ∈ X and yi ∈ {±1}. Bagging works as follows. We generate T bootstrap samples B1 , . . . , BT from S. Each bootstrap sample is generated by sampling with replacement n points from S. We train T classifiers hj ∈ Fig. 1. Efficient Frontier illustration H, j = 1, . . . , T , such the sample B . Given a new point x ∈ X we predict that hj is trained using j 1 that its label is sign T j hj (x) . Thus, the aggregated classifier is simply a threshold applied on a uniform average of the base classifiers hj . The idea in Vogging (Variance Optimized Bagging) is to optimize a linear combination of classifiers so as to aggressively reduce variance while at2
A nice Java applet computing the efficient frontier of some familiar assets can be downloaded at http://www.duke.edu/ charvey/applets/EfficientFrontier/frontier.html.
Variance Optimized Bagging
63
tempting to preserve a prescribed accuracy. As in ordinary Bagging we generate T bootstrap samples from the training set and induce T hypotheses from H using these samples. Here again, let hj denote the resulting hypotheses. the empirical accuracy achieved by hj on the sample Bi . Let Aj (Bi ) denote 1 Let A¯j = T −1 i =j Aj (Bi ) be the average empirical accuracy over all the other bootstrap samples. Since each Bootstrap sample Bj only contains a fraction of the data (on average, approximately 63%), we can view A¯j as a proxy for an unbiased estimation of the error (more sophisticated “out-of-bag” methods can be considered, as discussed in Section 7). Consider the (column) vectors Aj = (Aj (B1 ), . . . , Aj (BT )), j = 1, . . . , T , and let A¯ be their average, A¯ = T1 Tj=1 Aj . Let Q be the empirical covariance matrix of these vectors, T
Q=
1 ¯ ¯t (Aj − A)(A j − A) T − 1 j=1
(2)
Using the empirical accuracies and covariance matrix Q we now employ the Markowitz algorithm to estimate the efficient frontier of combined minimumvariance “portfolios” of base-classifiers and use the classifier with the highest Sharpe ratio (see below). Specifically, we estimate the dynamic range of achievable accuracies using the end points minj A¯j and maxj A¯j and take k uniformly spread points, a1 , . . . , ak , in this interval. Each ai is an achievable empirical accuracy by some linear combination of classifiers. Using the ai ’s we interpolate the efficient frontier as follows. For each a ∈ {ai } we solve the following quadratic program (QP) with linear constraints: minimize (over w): subject to:
1 t 2 w Qw ¯ (A1 , . . . , A¯T )t w
j
≥a
wj = 1, w ≥ 0.
That is, by solving QP, we attempt to minimize variance while keeping the accuracy sufficiently large. Remark 1. The solution of QP with a lower bound accuracy constraint a, if it exists, is a weight vector w that corresponds to a mean accuracy and variance pair (a , σ 2 ) and a may be larger than a. In the Markowitz-Sharpe framework, in order to compute the weighted combination with the largest Sharpe ratio we need to use the return of a “riskless” asset (see Eq. (1)). The best analogy in our context is the expected accuracy of the trivial classifier that always predicts according to the label of the largest class in the training set. We call this classifier the baseline classifier. In Figure 2 we provide pseudo-code of the Vogging learning algorithm. The output of the algorithm is a single classifier based on the weighted combination that achieved the highest Sharpe ratio. We call this classifier the Sharpe Classifier. The motivation for using the Sharpe classifier is purely heuristic. While we would like
64
Philip Derbeko et al.
Input: 1. 2. 3. 4.
T (number of bagged classifiers) k (number of efficient frontier points) S = (x1 , y1 ), . . . , (xn , yn ) (training set) H (base classifier hypothesis class)
Training: Generate T bootstrap samples, B1 , . . . , BT from S Train T classifiers h1 , . . . , hT such that hj ∈ H is trained over Bj Evaluate A¯j , for all j = 1, . . . , T ; evaluate Q (see Eq. (2)) ¯j , maxj A ¯j ] Choose k uniformly spread points a1 , . . . , ak in [minj A Solve k instances of QP (Eq. (3)) with the accuracy constraints a1 , . . . , ak . For i = 1, . . . , k, let wi and (ai , σi ) be the resulting weight vector and mean-variance pair corresponding to ai . 6. Let p0 be the proportion of the larger class in S
1. 2. 3. 4. 5.
Output: “Vogging weight vector” wi∗ with i∗ = arg maxi
ai −p0 σi
Fig. 2. Pseudo-code for Vogging learning algorithm
to use a classifier with a small risk, this would make little sense if the variance of the classifier is very large. In order to reach a compromise between the risk and variance, we select a classifier with a small risk, subject to a constraint that its standard deviation is no too large. Eq. (1) provides an approximate implementation of this idea. Note also that a similar type of argument is used in the construction of the classic Fisher discriminant function.
4
Experimental Design
Our main goal in these experiments is to analyze and better understand the new Vogging technique and compare its performance to ordinary Bagging. Many previous studies of Bagging considered as their base-classifiers inductive learning algorithms such as decision trees, neural networks and naive Bayes; see e.g. [3], [19], [14], [7], [19], [1] and [5]. As argued by [3] and [4], Bagging becomes effective when the base-classifier is unstable; intuitively this means that its decision boundaries significantly vary with perturbations of the training set. We chose to use a Support Vector Machine (SVM) as our base classifier; see [21] and [23]. While SVM’s are considered to be rather stable classifiers, even an SVM classifier exhibits instabilities when trained with small samples, especially when polynomial kernels are used. Since our main focus here is on situations where only a small amount of data is available, in all experiments described below we always used 30% of the available labeled samples to train our classifiers while leaving the rest of the data for testing. While Support Vector Machiens have been widely used for many problems, and shown to yeild state-of-the-art results, their behavior for particularly small data sets has not
Variance Optimized Bagging
65
been thoroughly investigated. In ongoing work we are looking at other classifiers (including decision trees and neural networks). It should be emphasized that for small data sets variance is known to be a major problem, and thus we expect that variance reduction techniques should be particularly useful in this case. In fact, several exact calculations support this observation [18,13], stressing the particular advantage of sub-sampling. The new algorithm was tested on a number of datasets from the UCI repository [2]. Table 1 provides some essential properties of the datasets used. Note that the baseline (i.e. the proportion of the largest class in the training set) of each dataset is used for computing the Sharpe-ratio (Step 6 in the algorithm pseudo-code). In each experiment we used 10-fold cross-validation. Each fold consisted of a 30%-70% random partition where the 30% portion was used for training. The remaining 70% was used solely for testing, and was in no way accessible to the learning algorithm, for example, while the Ion dataset contains 351 labeled instances, in each of our folds we only used 105 labeled instances for training. Following [3] we generated, in most cases, T = 50 bootstrap samples (and 50 base-classifiers) from each training set. Due to the computational intensity, for the larger sets we generated 25 bootstrap samples3 . In all experiments we used a polynomial kernel SVM with degree 20. The polynomial kernel is particularly convenient to use in our method due to its relative instability compared to other popular kernels such as RBF and linear. In most of the experiments we report on the performance of the following classifiers: (i) The Vogging classifier; (ii) The Bagging classifier; (iii) The “fullset ” base-classifier, which is trained over the entire training set.
5
Results
As an illustration of the proposed algorithm we first present one experiment in some detail. In Figure 3 we depict the training results of a single fold of the Vogging algorithm on the Voting dataset (130 training examples). The figure shows the meanvariance points corresponding to the observed accuracy and variance (estimated based on the training) of 50 3
Table 1. Some essential details of the datasets used. The “Baseline” attribute is the trivial accuracy that can be achieved (proportion of the largest class in the training set) Dataset
Instances Attributes Baseline (training set size)
Voting Diabetes Ion Sonar Breast WDBC Credit-G Tic-Tac-Toe
435 (130) 768 (230) 351 (105) 208 (62) 683 (204) 569 (170) 653 (195) 958 (287)
In [1] 25 bootstrap samples were used in all experiments.
16 8 34 60 10 30 15 9
0.61 0.65 0.64 0.53 0.65 0.62 0.54 0.65
66
Philip Derbeko et al.
Table 2. 10-fold cross-validated mean/std error performance comparison between Vogging, Bagging,the full-sample classifier and Vogging advantage over Bagging; see also Figure 5 Dataset (training set size) Voting (130) Diabetes (230) Ion (105) Sonar (62) Breast (204) WDBC (170) Credit-G (195) Tic-Tac-Toe (287)
Vogging
Bagging
13.11±4.12 33.46±1.46 15.89±2.37 38.36±4.42 4.97±1.72 22.76±8.56 40.41±5.35 32.33±3.37
23.90±11.31 35.55±1.10 29.51±10.21 45.96±7.00 19.12±4.04 26.12±6.67 46.22±1.00 36.30±7.27
Average
25.16±3.92 32.84±6.08
Full-sample Vogging base-classifier advantage 37.22±12.18 45.15% 42.24±14.14 5.88% 32.64±15.31 46.16% ∗40.71±5.67 16.53% 23.68±11.61 74.00% 36.77±17.05 12.86% 48.41±8.29 12.57% 51.77±10.98 10.93% 39.18±11.9
28.01%
base-classifiers. On the left part of the figure we see the efficient frontier and the Sharpe classifier (on the frontier). As can be seen, the top composite classifiers on the efficient frontier achieve somewhat smaller training accuracy than the best base-classifiers in the pool, but the composite classifier show noticeable reduction in standard deviation. Unlike financial assets, which usually exhibit a trade-off between return and variance (see illustration in Figure 1), the training performance of the base-classifiers do not exhibit this trade-off and the better (high accuracy) classifiers also have smaller variance.4 In Figure 4 we see the final 10-fold cross-validation average accuracy of Vogging and Bagging on the Voting dataset. On the top left corner we depict the average accuracy and standard deviation of the Vogging classifier To the right of the Vogging classifier we see the Bagging classifier. The layered cloud of circles that fill the bulk of the figure is the test performance of all the 50 × 10 base-classifiers that were generated during the entire 10-fold experiment. Evidently, the Vogging classifier is significantly better both in terms of accuracy and variance. We should emphasize that the tiny circles (depicting base-classifiers) do not represent cross-validated performance. Interestingly, a large fraction of the base-classifiers converged to the baseline classifier performance (61%), while another subset reduced to the counter baseline classifier (39%). It is interesting to examine the components of the aggregated classifiers on the frontier. In Figure 3 we identify the 9 largest components of the Sharpe classifier. These components are numbered 1–9 in the figure (in a decreasing order of their weights in the composite classifier). Although the weights are diversified between more classifiers, these 9 classifiers hold most of the weight. It is evident that lower accuracy base-classifiers are included with large weights (e.g. classifier 2 has a weight of 0.15). 4
We could generate a (training set) behavior more similar to this financial assets’ pattern by aggressively over-fitting the base-classifiers.
Variance Optimized Bagging
67
In Figure 5 we see 10-fold cross-validated performance comparison on 8 datasets between Vogging, Bagging and the full-sample classifier. These results and the relative advantage of Vogging over Bagging are numerically presented in Table 2. In all these results the base-classifier is a degree 20 polynomial kernel SVM. In Table 2 the last column summarizes relative error improvement of Vogging over Bagging given by Err(Bagging) − Err(Vogging) . Err(Bagging) The asterisk in one entry in the second last column corresponds to a case where Bagging could not improve on the full-sample base-classifier. Note that in all cases Vogging outperformed both the Bagging and full-sample classifiers. We note the following. First it is striking that Vogging achieves higher accuracy than Bagging using polynomial kernel SVMs. Overall we see an error improvement average (over these datasets) of 28% over Bagging5. In most cases the standard deviations exhibited by Vogging was significantly smaller than Bagging (and the other base-classifiers). Overall, we see a 35% average variance reduction improvement over Bagging. The absolute errors reported here are not directly comparable to other published results on Bagging performance on the same datasets in [19,14,5], which used much larger training set sizes (e.g. most of these studies used 90% of the data for training). In general, our absolute errors are larger than those reported in these studies. In the full version of the paper we will include a comparison with other known algorithms.
6
Related Work
Bagging falls within the sub-domain of “ensemble methods”. This is an extensive domain whose coverage is clearly beyond the scope of this paper; see e.g. [6,12] and references therein. Bagging [3] and Boosting [9] are among the most popular re-sampling ensemble methods that generate and combine a diversity of classifiers using the same learning algorithm for the base-classifiers. Classical boosting algorithms like Adaboost are considered stronger than bagging on noise-free data. They also lie on more solid theoretical grounds in the form of convergence, consistency and generalization results, and have many more extensions and variations than bagging.6 However, there are strong empirical indications that Bagging is much more robust than boosting in noisy settings; see e.g. [5]. An intuitive explanation for this resilience of bagging suggested by Dietterich is that the bootstrap samples avoid a large fraction of the noise while still generating a diverse pool of classifiers. We note that new regularized and improved boosting algorithms that can resist noise were recently proposed. There have not been to-date very many theoretical analyses of Bagging. A recent paper [4] proves that bagging with a non-stable base-classifier (where 5 6
calculated over polynomial kernels only. See the Boosting web site at http://www.boosting.org/.
68
Philip Derbeko et al.
1 Sharpe
Base−classifiers 4
0.9
1 3
5
7
0.7
0.6
ron
tier
2
Efii cien tF
Accuracy
6
8
0.8
Sharpe Ratio
9
1
0.5
0.5 0
0.4 0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
Standard Deviation
Fig. 3. Voting dataset, estimated training accuracy and standard deviation of 50 base-classifiers (SVM, polynomial kernel of degree 20 with C = 100); the resulting (interpolated) efficient frontier is marked; The Sharpe classifier (on the frontier) is pointed by an arrow; the inner graph shows the Sharpe ratios of the various classifiers on the frontier (in the order of their appearance on the frontier)
stability is defined in an asymptotic sense similar to statistical consistency) will reduce variance. This paper also analyzes a sub-sampling variant of bagging called sub-agging. Other interesting discussions and analyses of bagging can be found at [8], [1] and [10]. The paper [11]7 proposes a heuristic method for generating weighted average of bagged classifiers. The proposed weighting for a base-classifier is a function of its relative accuracy advantage over the other classifiers where these quantities are estimated over “out-of-bag” training samples. According to this paper, in a comparative empirical evaluation using ID3 decision tree learning as the baseclassifier, this weighted bagging technique outperformed ordinary bagging and boosting on the majority of UCI datasets that were examined. The idea of employing Markowitz portfolio optimization for ensemble construction was proposed by [15] as a method for avoiding standard parameter tuning in neural networks (e.g., based on cross-validation). Specifically, Mani proposed to train a pool of neural nets with a diversity of parameters and then combine them using a Markowitz optimized linear combination. He did not address the issue of choosing a weighted combination from the efficient frontier. As 7
The exact date of this unpublished manuscript is unknown to the authors; we estimated it to be 2000.
Variance Optimized Bagging
Average Accuracy (10−fold CV)
1
Vogging
69
Base−classifiers (all folds)
0.9 0.8 0.7 0.6 Bagging 0.5 0.4 0.3 0.2
Fig. 4. Voting dataset, 10-fold test accuracy (and standard deviaition represented via error bars) of the Vogging classifier, Bagging and 50 × 10 baseclassifiers. Each of these base-classifier accuracies is not cross-validated and appeared at a single fold. The base-classifiers are grouped by the folds, each vertical strip contains the base-classifiers of a single fold
far as we know, this idea was never tested. Along these lines, in the context of regression [20] propose to consider a pool of predictors with a diversity of values to their parameters so as to span a reasonable range. Then they propose to use an “out-of-bootstrap” sampling technique to estimate least-squares regression weights of members of the pool.
7
Conclusions and Open Directions
In this paper we proposed a novel and natural extension of Bagging that optimizes the weights in linear combinations of classifiers obtained by Bootstrap sampling. The motivation and main ideas of this new weighted bootstrap technique are borrowed from mathematical finance where it is used to optimize financial portfolios. The proposed algorithm attempts to aggressively reduce the variance of the combined estimator, while trying to retain high accuracy. We presented the results of a number of experiments with UCI repository datasets. Using an SVM as the base-classifier we focused on situations where the training sample size is relatively small. Such cases are of particular interest in practical applications. Our results indicate that the new technique can dramatically improve the (out of sample) test accuracy and variance of the base-classifier and of Bagging. Although these results are striking, due to the moderate scope of our experimental study we view them only as a proof of concept for the proposed method.
70
Philip Derbeko et al.
1
0.4
2 Diabetes
0.5
Sharpe Bagging Full−set clfr.
7 Credit−G
8 Tic−Tac−Toe
4 Sonar
0.6
6 WDBC
3 Ion
0.7
5 Breast
0.8 1 Voting
Average Accuracy (10−fold CV)
0.9
0.3
Fig. 5. 10-fold cross-validation test results on 8 datasets using a degree 20 polynomial kernel SVM base-classifier. Each dataset entry consists of 4 mean/std error bars corresponding (from left to right) to Vogging, Bagging and full-set (training set) classifier. Dataset names appear to the left of their error bars. Numerical summary of these results appear in Table 2 We concentrated on small training sample scenarios where the effects of estimation variance are particularly harmful to classification accuracy. It appears that the utilization of Bootstrap samples allowed us to obtain a reliable estimates of the variance (and covariance), a parameter which cannot be reliably estimated from the same set of points used to train the classifiers. More sophisticated estimation techniques can possibly improve the estimation accuracy and the algorithmic efficiency. For instance, techniques similar to those used by [20] and [11] can potentially improve the sampling component of our algorithm. To the best of our knowledge the above results are the first reported experimental evidence of a successful use of SVM as the base-classifier in Bagging. Instability of the base-classifier learning algorithm is a major factor in the ability to generate diversity in the form of anti-correlations between the various base-classifiers in the pool, which is the key for variance reduction. Therefore, one can expect that the relative advantage of our technique will increase if used with more unstable base-classifiers such as decision trees and neural networks. We plan to investigate this direction.
References 1. E. Bauer and R. Kohavi. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36(1-2):105–139, 1999. 60, 64, 65, 68
Variance Optimized Bagging
71
2. C. L. Blake and C. J. Merz. UCI repository of machine learning databases, 1998. URL: http://www.ics.uci.edu/˜mlearn/MLRepository.html. 65 3. L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996. 60, 64, 65, 67 4. P. Buhlmann and B. Yu. Analyzing bagging. Annals of Statistics, 2001, in print. 60, 64, 67 5. T. G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40(2):139–157, 2000. 64, 67 6. T. G. Dietterich. Ensemble Methods in Machine Learning, pages 1–15. MIT Press, 2nd edition, 2001. 67 7. P. Domingos. Knowledge acquisition from examples via multiple models. In Proc. 14th International Conference on Machine Learning, pages 98–106. Morgan Kaufmann, 1997. 64 8. P. Domingos. Why does bagging work? A bayesian account and its implications. In D. Pregibon in D. Heckerman, H. Mannila and R. Uthurusamy eds, editors, Proceedings of the third international conference on Knowledge Discovery and Data Mining, pages 155–158. AAAI Press, 1997. 68 9. Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In International Conference on Machine Learning, pages 148–156, 1996. 67 10. J. Friedman and P. Hall. On bagging and nonlinear estimation, 2000. Preprint. URL: http://www-stat.stanford.edu/~jhf/ftp/bag.ps. 60, 68 11. D. Grossman and T. Williams. Machine learning ensembles: An empirical study and novel approach. Unpublished manuscript, 2000. URL: http://www.cs.washington.edu/homes/˜grossman/projects/573projects/learning. 68, 70 12. S. Hashem, B. Schmeiser, and Y. Yih. Optimal linear combinations of neural networks: An overview. In 1994 IEEE International Conference on Neural Networks, 1994. 67 13. A. Krogh and P. Sollich. Statistical mechanics of ensemble learning. Physical Review E, 55(1):811–825, 1997. 65 14. R. Maclin and D. Opitz. An empirical evaluation of bagging and boosting. In The Fourteenth National Conference on Artificial Intelligence, pages 546–551. AAAI/IAAI, 1997. 64, 67 15. G. Mani. Lowering variance of decisions by using artificial neural network portfolios. Neural Computation, 3(4):483–486, 1991. 68 16. H. Markowitz. Portfolio selection. Journal of Finance, 7:77–91, 1952. 61 17. H. Markowitz. Portfolio Selection: Efficient Diversification of Investments. New Haven: Yale University Press, 1959. 61 18. R. Meir. Bias, variance and the combination of least-squares estimators. In Advnces in Neural Information Processing Systems 7, pages 295–302. Morgan Kaufmann, San Francisco, CA, 1994. 65 19. J. Quinlan. Bagging, boosting and c4.5. In Proceedings of 13th Conference on AI, pages 725–730. MIT press, 1996. 64, 67 20. J. Rao and R. Tibshirani. The out-of-bootstrap method for model averaging and selection. Technical report, Statistics Department, Stanford University, 1997. URL http://www-stat.stanford.edu/~tibs/ftp/outofbootstrap.ps. 69, 70 21. B. Sch¨ olkopf and A. Smola. Learning with Kernels. MIT Press, 2002. 64 22. W. F. Sharpe. Adjusting for risk in portfolio performance measurement. Journal of Portfolio Management, Winter:29–34, 1975. 62
72
Philip Derbeko et al.
23. V. N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, 1995. 64
How to Make AdaBoost.M1 Work for Weak Base Classifiers by Changing Only One Line of the Code G¨ unther Eibl and Karl Peter Pfeiffer Institute of Biostatistics, Innsbruck, Austria
[email protected]
Abstract. If one has a multiclass classification problem and wants to boost a multiclass base classifier AdaBoost.M1 is a well known and widely applicated boosting algorithm. However AdaBoost.M1 does not work, if the base classifier is too weak. We show, that with a modification of only one line of AdaBoost.M1 one can make it usable for weak base classifiers, too. The resulting classifier AdaBoost.M1W is guaranteed to minimize an upper bound for a performance measure, called the guessing error, as long as the base classifier is better than random guessing. The usability of AdaBoost.M1W could be clearly demonstrated experimentally.
1
Introduction
A weak classifier is a map h : X → G (with G = {1, . . . , |G|}), which assigns an object with measurements x ∈ X to one of |G| prespecified groups with a high error rate. The task of a boosting algorithm is to turn a weak classifier into a strong classifier, that has a low error rate. To simplify notation we define, that arg max u(g) is the group g, which maximizes the function u. g∈G
Most papers about boosting theory consider twoclass classification problems (|G|=2). Multiclass problems can then be reduced to twoclass problems using for example error-correcting codes [1,2,4,5]. However if one has a multiclass problem and also a base classifier for multiclass problems like decision trees one would prefer a more direct boosting method. Freund and Schapire [3] proposed the algorithm AdaBoost.M1 (Fig.1), which is a straightforward generalization of AdaBoost for 2 groups for the multiclass problem using multiclass base classifiers. One of the main ideas of the algorithm is to maintain a distribution D of weights over the learning set L = {(x1 , g1 ), . . . , (xN , gN ); xi ∈ X, gi ∈ G}. The weight of this distribution on training example i on round t is denoted by Dt (i). On each round the weights of incorrectly classified examples are increased so that the weak learner h is forced to focus on the ”hard” examples in the training set. The goal of the weak learner is to find a hypothesis ht appropriate for the distribution Dt . The goodness of ht
T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 72–83, 2002. c Springer-Verlag Berlin Heidelberg 2002
How to Make AdaBoost.M1 Work for Weak Base Classifiers
73
is measured by its weighted error rate Dt (i)I(ht (xi ) = gi ) t = i
with I denoting the indicator function. In practice a subset of the training examples is sampled according to Dt , and these (unweighted) resampled examples are used to train the weak learner. After ht has been received, AdaBoost.M1 chooses αt , which measures the importance assigned to ht . The sampling distribution Dt is next updated, where the weights of examples misclassified by ht are increased and the weights of correctly classified examples are decreased. Thus, the weight tends to concentrate on the ”hard” examples. The final hypothesis is a weighted majority vote of T weak hypotheses where αt is the weight assigned to ht . The most important property of AdaBoost.M1 concerns its ability to reduce the training error. An exponential decrease of an upper bound of the training error rate is guaranteed as long as the error rates of the base classifiers are less than 1/2. This also leads to the criterion to stop, if t is greater or equal to 1/2. However for more than two groups the condition, that the error rate of the base classifier is less than 1/2, can be too restrictive, if one uses weak base classifiers as for example decision stumps.
Input: learning set L = {(x1 , g1 ), . . . , (xN , gN ); xi ∈ X, gi ∈ G}, G = {1, . . . , |G|}, classifier of the form h : X → G. T : number of boosting rounds Initialization: D1 (i) =
1 N
.
For t = 1, . . . , T : – Train the weak classifier ht with distribution Dt , where ht should minimize the weighted error rate X Dt (i)I(ht (xi ) = gi ) . t = i
– If t ≥ 1/2: goto output with T := t − 1. – Set αt = ln
– Update D:
1 − t
.
Dt+1 (i) = Dt (i)e−αt I(ht (xi )=gi ) /Zt
where Zt is a normalization factor (chosen so that Dt+1 is a distribution) Output: Set the final classifier H(x): H(x) = arg max f (x, g) = arg max g∈G
g∈G
T X
! αt I(ht (x) = g)
t=1
Fig. 1. Algorithm AdaBoost.M1
.
74
G¨ unther Eibl and Karl Peter Pfeiffer
Freund and Schapire [3] overcame this problem with the introduction of the pseudo-loss of a confidence-rated classifier h : X × G → [0, 1] 1 1 1 − h(xi , gi ) + h(xi , g) . pseudo − loss(h) = 2 G−1 g =gi
In the algorithm AdaBoost.M2 each base classifier has to minimize the pseudoloss instead of the error rate. As long as the pseudo-loss is less than 1/2, which is a very weak condition, an exponential decrease of an upper bound of the training error rate is guaranteed. The drawback of this approach lies in the need to redesign the base classifier in order to give confidences and to minimize the pseudo-loss. In this paper we make a simple modification in AdaBoost.M1 in order to make it applicable for weak base classifiers. We call this algorithm AdaBoost.M1W. The modification consists of a change in only one line of the code, which considers the definition of αt ( t ). First we give an ad-hoc derivation of the algorithm. Then we derive a theorem, which states, that an upper bound for a performance measure, called the guessing error, is guaranteed to be minimized as long as the base classifier is better than random guessing. Finally we performed experiments with multiclass datasets to look, if the algorithm also works in practice. We can clearly show, that this is the case, because for all 8 datasets, where the base classifier is too weak for AdaBoost.M1, AdaBoost.M1W still works.
2
Ad-hoc Derivation of AdaBoost.M1W
When we analyzed AdaBoost.M1 we wondered, why AdaBoost.M1 does not work with weak base classifiers and if it is possible to modify AdaBoost.M1 to make it work with weak base classifiers. We start by looking at the combination step T αt I(ht (x) = g) . H(x) = arg max g∈G
t=1
There each base classifier ht gives a vote for the group ht (x), to which it would assign x. The votings are weighted by the factor αt , which is bigger, if the base classifier is better. The key point for the modification is the property of the algorithm AdaBoost.M1, that 1 − t 1 . (1) αt = ln ≥ 0 ⇔ t ≤ 2 If the error rate is bigger than 1/2, the weight αt gets negative, so the ensemble classifier H does the opposite of what the base classifier ht proposes. In AdaBoost.M1 αt does not get negative, because in the derivation of the bound
How to Make AdaBoost.M1 Work for Weak Base Classifiers
75
for the training error of AdaBoost.M1 αt is assumed to be positive and AdaBoost.M1 stops, if t ≥ 12 (and therefore αt ≤ 0 because of (1)). Nevertheless this was the starting point for our modification. If one has a base classifier h with error rate greater than 1/2 but better than random guessing (which has an expected error rate of 1 − 1/|G|), the ensemble classifier H should not do the opposite of what the base classifier h proposes. So we wanted to find a choice for αt ( t ) such that αt ≥ 0 ⇔ t ≤ 1 −
1 . |G|
(2)
To derive αt ( t ) we assumed αt ( t ) to be basically of the same form as αt in AdaBoost.M1, so we set an t + b n αt ( t ) = ln (3) =: ln (z( t )) ad t + b d where n and d are subscripts for the nominator and denominator respectively. Then we wanted αt to fulfill (2) and two additional conditions, which are also fulfilled by αt ( t ) of AdaBoost.M1: 1 ) = 0 |G| αt → −∞ for t → 1
αt (1 −
αt → ∞ for t → 0 . For z( t ) this means that z(1 −
1 )=1 |G| z(1) = 0
t = 0 ⇒ Denominator of z = 0 . These conditions directly results in the following conditions for the 4 constants 1 1 an 1 − + b n = ad 1 − + bd |G| |G| bn = −an bd = 0 . Substitution for bn and bd in the first equation and solving for ad leads to an = ad (1 − |G|) . Now we substitute the constants an , bn and bd in (3), ad gets cancelled, and we get (|G| − 1)(1 − t ) αt = ln . (4) t
76
G¨ unther Eibl and Karl Peter Pfeiffer
Note, that up to this point this is just an ad-hoc modification without any proof for a decrease in the error rate. So we also don’t have a stopping criterion any more. An intuitive ad-hoc stopping criterion would stop, if t ≥ 1 −
1 . |G|
For the experiments we stopped after a big, prespecified number T of boosting rounds and investigated, if the stopping criterion above would have done well. Since the rest of the algorithm AdaBoost.M1 is left untouched we can already write down the algorithm in Fig. 2.
Input: learning set L = {(x1 , g1 ), . . . , (xN , gN ); xi ∈ X, gi ∈ G}, G = {1, . . . , |G|}, classifier of the form h : X → G. T : number of boosting rounds Initialization: D1 (i) =
1 N
.
For t = 1, . . . , T : – Train the weak classifier ht with distribution Dt , where ht should minimize the weighted error rate X Dt (i)I(ht (xi ) = gi ) . t = i
– Set αt = ln – Update D:
(|G| − 1)(1 − t ) t
.
Dt+1 (i) = Dt (i)e−αt I(ht (xi )=gi ) /Zt
where Zt is a normalization factor (chosen so that Dt+1 is a distribution) Output: Set the final classifier H(x): H(x) = arg max f (x, g) = arg max g∈G
g∈G
T X
! αt I(ht (x) = g)
.
t=1
Fig. 2. Algorithm AdaBoost.M1W
3
Theoretical Analysis of AdaBoost.M1W
Due to suggestions of the reviewers we also made a theoretical analysis of AdaBoost.M1W. We can show, that the algorithm doesn’t minimize an upper bound for the training error, but an upper bound for a new performance measure, which we call the guessing error. This performance measure compares the final classifier
How to Make AdaBoost.M1 Work for Weak Base Classifiers
77
with random guessing, which has a training error rate of 1 − 1/|G|. The guessing error guesserr is defined as the proportion of examples, where the classifier performs worse than random guessing. Definition 1. A classifier f : X × G → [0, 1] makes a guessing error in classifying an object x coming from group g, if f (x, g) <
1 . |G|
The corresponding estimate of the expected guessing error using the training set is called guesserr: N 1 I f (xi , gi ) < guesserr := . |G| i=1 Note, that by dividing f from AdaBoost.M1W by [0, 1].
t
αt we ensure, that f (x, g) ∈
The following theorem guarantees an exponential decrease of the guessing error of AdaBoost.M1W as long as the base classifier is better than random guessing. Theorem 1. If all base classifiers ht : X → G satisfy t = Dt (i)I(ht (xi ) = gi ) ≤ 1 − 1/|G| − δ i
for δ ∈ (0, 1 − 1/|G|), then the guessing error of the training set for AdaBoost.M1W fulfills guesserr <
t
1−1/|G|
2 t (1 − t )1/|G| ≤ e−δ T . 1−1/|G| 1/|G| (1 − 1/|G|) (1/|G|)
Proof. (i) Similar to the calculations used to bound the error rate of AdaBoost we begin by bounding guesserr in terms of the normalization constants Zt : We make a guessing error for example i, if
P α /|G|)
−(f (xi ,gi )− 1 f (xi , gi ) t
⇒e < αt |G|
t
>1 .
t
So guesserr :=
N
P α /|G|)
f (xi , gi ) 1 −(f (xi ,gi )− t < I
< e αt |G| i=1 i=1 t
N
t
.
(5)
78
G¨ unther Eibl and Karl Peter Pfeiffer
Now we unravel the update rule 1=
Dt+1 (i) =
i
Dt (i)
i
e−αt I(ht (xi )=gi ) = ... Zt
t 1 1 −αs I(hs (xi )=gi ) 1 1 −f (xi ,gi ) = e = e . Zs N i s=1 Zs N i s
s
So we get
Zt =
t
1 −f (xi ,gi ) e N i
and, together with (5), we get Zt et guesserr ≤
P α /|G| t
=
t
(ii) Now we bound
eαt /|G| Zt .
(6)
t
eαt /|G| Zt :
t
e
αt /|G|
Zt =
t
e
αt /|G|
t
=
t
=
Dt (i)e
−αt I(ht (xi )=gi )
.
i
eαt /|G|
Dt (i)e−αt +
i;ht (xi )=gi
t
=
t
t
=
t
Dt (i)
i;ht (xi ) =gi
eαt /|G| e−αt (1 − t ) + t
(|G| − 1)(1 − t ) 1/|G|
t + t |G| − 1
1−1/|G|
t (1 − t )1/|G| . (1 − 1/|G|)1−1/|G| (1/|G|)1/|G|
So together with (6) we get guesserr ≤
t
1−1/|G|
t (1 − t )1/|G| . (1 − 1/|G|)1−1/|G| (1/|G|)1/|G|
(7)
(iii) Now we show, that this bound for guesserr decreases exponentially, if t = 1 − 1/|G| − δ with δ ∈ (0, 1 − 1/|G|) for all t. We can rewrite (7) as guesserr ≤
1− t
δ 1 − 1/|G|
1−1/|G| 1+
δ 1/|G|
1/|G|
How to Make AdaBoost.M1 Work for Weak Base Classifiers
79
and bound both terms using the binomial series. The series of the first term have only negative terms. We stop after the term of first order and get 1−
δ 1 − 1/|G|
1−1/|G| ≤1−δ .
The series of the second term have both positive and negative terms. We stop after the positive term of first order and get 1+ Thus guesserr ≤
δ 1/|G|
1/|G| ≤ 1+δ .
(1 − δ)(1 + δ) = (1 − δ 2 ) . t
t
Using 1 + x ≤ ex for x ≤ 0 leads to guesserr ≤ e−δ
2
T
.
(8)
Due to the theorem not only the algorithm but also the ad-hoc stopping criterion of the previous section is theoretically confirmed now. There are some generalization possibilities of AdaBoost.M1W: the definition of the guessing error and the theorem can be generalized for any C ∈ (0, 1/2] replacing 1/|G| in a straightforward way leading to the performance measure errC :=
N
I (f (xi , gi ) < C)
i=1
and αt = ln
(1 − C)(1 − t ) C t
.
This generalization also contains AdaBoost.M1 by setting C = 1/2. One can easily verify, that for this case the theorem above and the theorem given in [3] coincide. Another apparent generalization would regard confidence-rated base classifiers h : X × G → [0, 1] instead of base classifiers h : X → G. We are currently working on generalizing the algorithm and the theorem to this case and are very confident to finish this work soon.
4
Experiments
In our experiments we analyzed 9 multiclass datasets (Table 1) with both the algorithm AdaBoost.M1 and AdaBoost.M1W using decision stumps as base classifiers. The aim is to compare AdaBoost.M1 with AdaBoost.M1W. We decided not
80
G¨ unther Eibl and Karl Peter Pfeiffer
Table 1. Properties
of the databases, initial and minimal training error of AdaBoost.M1W, Ht := ts=1 αs hs database
N
groups variables
err(h1 ) arg min err(Ht ) t
digitbreiman letter optdigits pendigits satimage segmentation vehicle vowel
5000 20000 5620 10992 6435 2310 846 990
10 26 10 10 6 7 4 11
7 16 64 16 34 19 18 10
81.1% 92.4% 79.7% 79.3% 55.3% 71.1% 58.1% 82.8%
25.6% 53.0% 0.0% 21.8% 20.7% 6.8% 32.6% 49.8%
waveform
5000
3
21
42.7%
15.1%
to compare it with AdaBoost.M2, because the latter uses confidence-rated base classifiers, which could give it a spurious advantage especially for big datasets [6]. However we plan to compare the generalization of AdaBoost.M1W, which also uses confidence-rated base classifiers, to AdaBoost.M2. The main question to be answered by the experiments is, if AdaBoost.M1W is able to boost base classifiers with error rates greater than 1/2. The answer to this question is yes. For the 8 datasets, where the error rate of a single decision stump exceeds 1/2, AdaBoost.M1 failed, because for all 8 datasets it couldn’t decrease the training error rate at all, whereas AdaBoost.M1W worked for all 8 datasets (Table 1 and Fig.3). Since AdaBoost.M1 didn’t work for any of these 8 datasets we wanted to make an additional check, that the algorithms are programmed properly. The waveform dataset is the only one, where the error rate of a single decision stump is less than 1/2 and therefore AdaBoost.M1 (which was programmed without stopping criterion) is expected to work. This is the case, both algorithms can decrease the training error from 42.7% below 20 % (Fig. 4)(the Bayes error for this dataset is about 14%). It was surprising, that AdaBoost.M1W was better than AdaBoost.M1 for this dataset. The base classifiers of AdaBoost.M1 had error rates greater than 1/2 already at iteration 35, the error rates of the base classifiers of AdaBoost.M1W were greater than 1 − 1/|G| from iteration 165 on. So AdaBoost.M1W is an ensemble of weaker trees, but the ensemble is bigger than the one of AdaBoost.M1. We don’t want to overrate the result, that AdaBoost.M1W also outperformed AdaBoost.M1, when the weak classifier had an initial error rate below 1/2, because it is a result for just for one dataset. Further experiments with other datasets and other base classifiers are necessary to confirm this result. We also investigated the stopping criterion, which would stop the algorithm at the first round tstop , where t ≥ 1 − 1/|G|. Figure 3 shows, that the stopping
How to Make AdaBoost.M1 Work for Weak Base Classifiers
81
1 0.8 0.8
0.6 0.4
0.6
digitbreiman
0.2 0 10
1
10
2
10
3
10
0
1
10
0.8
0.8
0.6
0.6
0.4
letter 10
2
10
3
10
0.4
0.2
optdigits
pendigits 0.2
0 0 10
1
10
2
10
3
10
0
1
10
10
2
10
3
10
0.6 0.6 0.4
0.4 0.2
satimage
0.2 0
1
10 0.7 0.6 0.5 0.4 0.3
10
2
10
3
10
segmentation
0 0 10
1
10
2
10
3
10
0.8 0.7 0.6 vehicle 0
10
vowel
0.5 1
10
2
10
3
10
0
10
1
10
2
10
3
10
Fig. 3. Training (solid) and test error (dash-dotted) of AdaBoost.M1W dependent on the number of boosting rounds. The vertical line denotes tstop
criterion is reasonable, but often stops before the training error has reached its minimum. This fact can be explained by Fig. 5. The training errors of the base classifiers by definition reach 1 − 1/|G| the first time at tstop , but then they can get below 1 − 1/|G| again. When the training errors of the base classifiers are consistently above 1 − 1/|G| (right of the second vertical line) the training error of the ensemble isn’t improved any more. So the stopping criterion makes sense, but should be treated in a softer way. For example one could stop, if the last 5 training errors of the base classifiers are all above 1 − 1/|G|.
G¨ unther Eibl and Karl Peter Pfeiffer
0.7
0.7
0.6
0.6
0.5
0.5
Training error
Training error
82
0.4
0.4
0.3
0.3
0.2
0.2
0.1 0 10
1
2
10 10 Number of boosting rounds
3
10
0.1 0 10
1
2
10 10 Number of boosting rounds
3
10
Fig. 4. Training error of the base (dashed) and the ensemble classifier (solid) for the waveform dataset for AdaBoost.M1W (left panel) and AdaBoost.M1 (right panel). The vertical line denotes tstop
5
Conclusion and Future Work
In this paper we proposed a new boosting algorithm AdaBoost.M1W, which directly boosts multiclass base classifiers for multiclass problems. The algorithm comes from the well known algorithm AdaBoost.M1. The difference to AdaBoost.M1 considers the definition of the weights of the base classifiers, which results in a change of only one line of the programming code. So everybody, who has implemented AdaBoost.M1, can easily get AdaBoost.M1W. We introduced a performance measure, called the guessing error, which is the proportion of examples, where the final classifier is worse than random guessing. Then we derived an upper bound for this guessing error, which gets minimized exponentially fast by AdaBoost.M1W as long as the base classifiers are better than random guessing. A generalization, which contains both AdaBoost.M1W and AdaBoost.M1 and which leads to the already known upper bounds for the corresponding performance measures is straightforward. The change of this one line has much impact, because it makes the algorithm work for weak base classifiers, which could be clearly demonstrated with experiments. AdaBoost.M1W also had a slightly better result for the one dataset, where the base classifier is strong enough for AdaBoost.M1 to work. To explore this further we plan to make more experiments with AdaBoost.M1W for stronger base classifiers. We will also work on generalizing the algorithm for confidence-rated base classifiers.
0.8
1
0.75
0.95
0.7
0.9
0.65
0.85
0.6
0.8
Training error
Training error
How to Make AdaBoost.M1 Work for Weak Base Classifiers
0.55
0.75
0.5
0.7
0.45
0.65
0.4
0.6
0.35
0.55
0.3 0 10
1
2
10 10 Number of boosting rounds
3
10
83
0.5 0 10
1
2
10 10 Number of boosting rounds
3
10
Fig. 5. Training error of the base (dashed) and the ensemble classifier (solid) for the vehicle (left panel) and letter (right panel) dataset. The first vertical line denotes tstop
References 1. E. L. Allwein, R. E. Schapire, Y. Singer 2000. Reducing multiclass to binary: a unifying approach for margin classifiers. Machine Learning 1, 113-141. 72 2. T. G. Dietterrich, G. Bakiri, 1995. Solving multiclass learning problems via errorcorrecting output codes. Journal of Artificial Intelligence Research 2, 263-286. 72 3. Y. Freund, R. E. Schapire, 1997. A decision-theoretic generalization of onlinelearning and an application to boosting. Journal of Computer and System Sciences 55 (1), 119-139. 72, 74, 79 4. V. Guruswami, A. Sahai, 1999. Multiclass learning, boosting, and error-correcting codes. Proceedings of the Twelfth Annual Conference on Computational Learning Theory 145-155. 72 5. R. E. Schapire, 1997. Using output codes to boost multiclass learning problems. Machine Learning: Proceedings of the Fourteenth International Conference, 313321. 72 6. R. E. Schapire, Y. Singer, 1999. Improved boosting algorithms using confidencerated predictions. Machine Learning 37, 297-336. 80
Sparse Online Greedy Support Vector Regression Yaakov Engel1 , Shie Mannor2 , and Ron Meir2
1
2
Center for Neural Computation, Hebrew University Jerusalem 91904, Israel
[email protected] Dept. of Electrical Engineering, Technion Institute of Technology Haifa 32000, Israel {shie@tx,rmeir@ee}.technion.ac.il
Abstract. We present a novel algorithm for sparse online greedy kernelbased nonlinear regression. This algorithm improves current approaches to kernel-based regression in two aspects. First, it operates online – at each time step it observes a single new input sample, performs an update and discards it. Second, the solution maintained is extremely sparse. This is achieved by an explicit greedy sparsification process that admits into the kernel representation a new input sample only if its feature space image is linearly independent of the images of previously admitted samples. We show that the algorithm implements a form of gradient ascent and demonstrate its scaling and noise tolerance properties on three benchmark regression problems.
1
Introduction
Kernel machines have become by now a standard tool in the arsenal of Machine Learning practitioners. Starting from the seventies a considerable amount of research was devoted to kernel machines and in recent years focused on Support Vector Machines (SVMs) [13]. A basic idea behind kernel machines is that under certain conditions the kernel function can be interpreted as an inner product in a high dimensional Hilbert space (feature space). This idea, commonly known as the “kernel trick”, has been used extensively in generating non-linear versions of conventional linear supervised and unsupervised learning algorithms, most notably in classification and regression; see [5,8,11] for recent reviews. SVMs have the noteworthy advantage of frequently yielding sparse solutions. By sparseness we mean that the final classifier or regressor can be written as a combination of a relatively small number of input vectors, called the support vectors (SVs). Besides the practical advantage of having the final classifier or regressor depend on a small number of SVs, there are also generalization bounds that depend on the sparseness of the resulting classifier or regressor (e.g., [5]). However, the
The research of R. M. was supported by the fund for promotion of research at the Technion and by the Ollendorff center.
T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 84–96, 2002. c Springer-Verlag Berlin Heidelberg 2002
Sparse Online Greedy Support Vector Regression
85
solutions provided by SVMs have been empirically shown to be not maximally sparse [1,6]. Additionally, support vector regression involves using twice as many variables as the number of samples, rendering the computational problem more difficult. The solution of SVM problems usually involves a non-trivial quadratic optimization problem. There have been several attempts to find efficient algorithms for this problem (e.g., [9,4]), most of which are based on the special form of the dual quadratic optimization problem. These methods display a super-linear dependence of the computation time on the number of samples and require repeated access to the training samples, making them suitable only for batch learning. Achieving sparseness by using linear dependence was suggested in [6]. The idea there is to solve the SVM problem using standard methods and then simplify the solution by eliminating linear dependencies in feature space. This simplification procedure effectively reduces the number of SVs while keeping the final classifier/regressor unchanged. An important requirement of online algorithms is that their per-time-step complexity should be bounded by a constant independent of t (t being the timestep index), for it is assumed that samples arrive at a constant rate. Since the complexity of SV learning is super-linear in the number of samples [4], performing aggressive sparsification concurrently with learning is essential. In this work we take advantage of an efficient greedy sparsification method in order to obtain an approximate, provably convergent, online SVR algorithm that we call SOG-SVR (sparse online greedy SVR). The remainder of the paper is organized as follows: In Section 2 we briefly overview support vector regression along with a sequential implementation which inspired our algorithm. In Section 3 we introduce a method for detecting linear dependencies in feature space and for sparsifying the solution based on these dependencies. Section 4 presents the SOG-SVR algorithm along with a convergence proof. In Section 5 we apply SOG-SVR to two benchmark problems, and we conclude with Section 6.
2
Support Vector Regression
Consider a training sample S = {(x1 , y1 ), . . . , (x , y )}, xi ∈ IRd and yi ∈ IR, where yi = g(x) + ζi , for some unknown function g and noise variable ζi . The objective is to reconstruct a good approximation to g(·) from the finite data set S. The derivation of the SVR equations can be found in [11] and will not be repeated here, but for completeness we recall the main results. Let φ be a nonlinear mapping from input space to some high-dimensional feature space. For def the linear regressor (in feature space) defined by f (·) = w, φ(·) + b, we wish to minimize 1 R(ξ, ξ∗ , w) = w2 + C (ξi + ξi∗ ), (2.1) 2 i=1
86
Yaakov Engel et al.
subject to (for ∀i ∈ {1, . . . , }) yi − f (xi ) ≤ ε + ξi∗ ,
f (xi ) − yi ≤ ε + ξi ,
ξi , ξi∗ ≥ 0,
(2.2)
where ε defines the width of the error-insensitive zone of the cost function and ξi∗ and ξi are slack variables measuring the deviation of yi − f (xi ) from the boundaries of the error-insensitive zone. This primal problem is transformed to its dual quadratic optimization problem. Maximize 1 L(α, α∗ ) = − (α∗ − α)T K(α∗ − α) − ε(α∗ + α)T e + (α∗ − α)T y 2
(2.3)
subject to (α∗ − α)T e = 0 ∗
0 ≤ α , α ≤ Ce
(2.4) (2.5)
(∗) (∗) where k(x, x ) = φ(x), φ(x ) , [K]i,j = k(xi , xj ), α(∗) = (α1 , . . . , α )T and e = (1, . . . , 1)T . In order to simplify notation, here and in the sequel we use the standard notation α(∗) to refer to either α or α∗ . The Representer Theorem [15] assures us that the solution to this optimization problem may be expressed solely in terms of the kernel function over the training set: βi k(xi , ·) + b . (2.6) f (·) = i=1
In SVR, once the dual problem (2.3) is solved for α∗ and α, the optimal regressor may be written as f (·) = i=1 (α∗i − αi )k(xi , ·) + b. ¯ = (φT , λ)T and By redefining the feature vectors and the weight vector φ T T ¯ = (w , b/λ) we can “homogenize” the regressor: w f (x) =
¯ i , x), (α∗i − αi )k(x
(2.7)
i=1
¯ , x) = k(x , x) + λ2 . The transformation to the homogeneous form is where k(x equivalent to adding a usually small, positive constant term to the kernel func¯ 2 in the primal Lagrangian tion. Note however, that the regularization term w (2.1) now includes the free term b. Homogenizing the regressor not only simplifies the analysis but also allows us to get rid of the first constraint (2.4) in the dual problem. This is significant because the remaining constraints can be enforced locally, in the sense that they do not mix different components of α(∗) . From this point on we assume that this transformation has been performed and we drop the “bar” notation. In [14] two sequential algorithms for learning support vector classification and regression, respectively, were introduced. While conventional SV learning algorithms operate in batch mode, requiring all training samples to be given in
Sparse Online Greedy Support Vector Regression
87
advance; in these two algorithms a single training sample is observed at each time step and the update is based solely on that sample, keeping the time complexity per time step O(1). The regression algorithm they propose (SVRseq) solves the kernel regression problem sequentially, and is presented in pseudo-code form in Table 1. The notation u is shorthand for the truncation function defined by u = max(0, min(C, u)); i.e., values are truncated so as to remain within the bounds specified by the constraints (2.5). η is a learning rate parameter.
Table 1. The SVRseq Algorithm 1. Parameters: η, ε, C. 2. Initialize: ∗ = 0, = 0. 3. For i = 1, . . . , di = yi − f (xi ) ∆α∗i = η(di − ε) ∆αi = −η(di + ε) α∗i = α∗i + ∆α∗i αi = αi + ∆αi 4. If training has converged stop, else repeat step 3.
3
Sparsity by Approximate Linear Dependence
When dealing with very large data sets we may not be able to afford, memorywise, to maintain the parameter vectors α∗ and α for the entire training sample. A large number of SVs would also slow the system down at its testing/operation phase. Moreover, sequential SV algorithms rely on multiple passes over the training set, since each update affects only one component of α∗ and α and makes only a small uphill step on the dual Lagrangian. In online learning no training sample is likely to be seen more than once, so, an alternative approach is required. Hence, we must devise a way to select which input samples should be remembered in the kernel representation, and to update their corresponding coefficients. In [6] a sparsification step is performed after a conventional SV solution is obtained. This is done by eliminating SVs that can be represented as linear combinations of other SVs in feature space, and appropriately updating the coefficients of the remaining ones in order to obtain exactly the same classifier/regressor1. Our aim is to incorporate a similar selection mechanism into an online learning scheme based on the sequential SV learning algorithm of [14]. The general idea is as follows. Let us assume that, at time step t, after having observed t − 1 training samples we have collected a dictionary of m linearly independent basis xj represent those elements in the training set which were vectors {φ(˜ xj )}m j=1 (˜ retained up to the t-th step). Now we are presented with a new sample xt . We 1
Note however, that the resulting solution does not necessarily conform to the constraints of the original problem.
88
Yaakov Engel et al.
test whether φ(xt ) is linearly dependent on the dictionary vectors. If not, we add it to the dictionary and increment m by 1. Due to the online mode of operation we do not wish to revisit and revise past decisions, and for this reason the addition of vectors to the dictionary must be done in a greedy manner. The requirement for exact linear dependence may lead us to numerical instabilities. Instead, for training sample xt , we will be content with finding coefficients {at,j }m j=1 with at least one non-zero element satisfying the approximate linear dependence condition 2 m δt = a φ(˜ x ) − φ(x ) (3.8) t,j j t ≤ ν . j=1 In other words, φ(xt ) can be approximated to within a squared error of ν by some linear combination of current dictionary members. By minimizing the left hand side of (3.8) we can simultaneously check whether this condition is satisfied and obtain the coefficient vector for input sample xt , at = (at,1 , . . . , at,m )T that best satisfies it. For reasons of numerical stability, we may sometimes prefer to sacrifice the optimality of at in return for a reduction in the size of its components. In this case we can add an 2 norm regularization term of the form γa2 to the minimization problem defined in (3.8), ending up with the optimization problem ˜t + ktt ˜ + γI)a − 2aT k minm aT (K (3.9) a∈IR
˜ i,j = k(˜ ˜t )i = k(˜ xi , x ˜j ), (k xi , xt ), ktt = k(xt , xt ), with i, j = 1, . . . , m. where [K] ˜t , and the condition for approximate ˜ + γI)−1 k Solving (3.9) yields at = (K linear dependence becomes ˜t + γat )T at ≤ ν . δt = ktt − (k
(3.10)
By defining [A]i,j = ai,j and taking the linear dependence (3.8) to be exact (i.e., ˜ T . For ν = 0), we may express the full × training set Gram matrix K as AKA T ˜ ν sufficiently small, AKA is a good approximation of K, and from this point ˜ T. on we will assume that indeed K = AKA
4
Sparse Online Greedy SVR
In this section we show how the idea of sparsification by linear dependence can be utilized in an online SVR algorithm. 4.1
A Reduced Problem
As already hinted above, we would like to end up with a SVR algorithm with time and memory requirements that are independent of the number of training samples that, for online algorithms, equals the time index t. Our aim is to
Sparse Online Greedy Support Vector Regression
89
obtain an algorithm with memory and time bounds, per time step, that are dependent only on the intrinsic dimensionality m of the data in feature space. In order to do that we define a new set of 2m “reduced” variables α ˜ (∗) = AT α(∗) ,
(4.11)
where it should be observed that α ˜ (∗) ∈ IRm , while α(∗) ∈ IR , and typically m . For clarity let us first consider the case where and m are fixed, i.e., we have a predefined set of dictionary vectors on which all other −m training-set feature ˜ α ˜ ∗ − α). ˜ In this vectors are linearly dependent. Let y = (y1 , . . . , y )T , f = AK( case the following rule can be used repeatedly to update the reduced variables: ∆α ˜ ∗ = η ∗ AT (y − f − εe) ∆α ˜ = −ηAT (y − f + εe) α ˜∗ = α ˜ ∗ + ∆α ˜∗ α ˜=α ˜ + ∆α ˜
(4.12)
where η ∗ , η are small positive learning rate parameters. There are several points worth stressing here. First, while the update in SVRseq is scalar in nature, with only a single component of α∗ and α updated in each step; here, in each update, all of the components are affected. Second, in the online case A becomes At which is a growing t × m matrix to which the row vector aTt is appended at each time step t. yt , ft and e are also increasingly large t-dimensional vectors. Luckily, we need only maintain and update their m-dimensional images under the trans˜ α ˜ ∗ − α) ˜ and ATt e. Third, the evaluation formation ATt : ATt yt , ATt ft = ATt At K( of the regressor at point xt can be performed using the reduced variables: f (xt ) =
t
(α∗i − αi )k(xi , xt )
i=1
=
t i=1
(α∗i − αi )
m
ai,j k(˜ xj , xt )
j=1
˜t = (α ˜t = (α∗ − α)T At k ˜ ∗ − α) ˜ Tk
ν→0
(4.13)
where the third equality above is exact only when ν = 0. Fourth, due to the quadratic nature of the dual Lagrangian, optimal values for the learning rates η (∗) can be analytically derived. We defer discussion of this issue to a longer version of the paper. Finally, note that here we do not truncate the updates ˜ (∗) vectors are not bounded within some box, nor do they ∆α ˜ (∗) , namely the α necessarily correspond to any α(∗) obeying the constraints (2.5). We will discuss the implications of this omission later. If, at time t, a new training sample is encountered for which the approximate linear dependence condition (3.8) is not satisfied, we append φ(xt ) to the dictionary, increasing m by 1. This has the following consequences:
90
Yaakov Engel et al.
– At is first appended with an additional column of t-1 zeros and then with the row vector aTt . Consequently, At is lower triangular. ˜ is also appended with a column and a row, first the column k ˜t and then – K T ˜ ˜ the row (kt , ktt ). Note that, contrary to K, K has always full rank and is therefore invertible. ˜ are each appended with an additional component, initialized at 0. – α ˜ ∗ and α A detailed pseudo-code account of the algorithm is given in Table 2. Concerning notation, we use “,” to denote horizontal concatenation, similarly, we use “;” to denote vertical concatenation. In order to make the distinction between matrices and vectors clear, we use [·] for a matrix and (·) for a vector.
Table 2. The SOG-SVR Algorithm – Parameters: ε, C, ν, γ.
= max(0, y1 /k1,1 ) , 1 = min(0, −y1 /k1,1 ) , – Initialize: ∗1 ˜ 1 = [k1,1 ], AT1 A1 = [1], AT1 y1 = (y1 ), AT1 e = (1), 0 = (0), I = [1], K e = (1), m = 1. – for t = 2, 3, . . . 1. Get new sample: (xt , yt ) ˜t )i = k(˜ ˜ t : (k xi , xt ) 2. Compute k 3. Approximate linear dependence test: ˜ m + γI)−1 k ˜t at = ( K ˜t + γat )T at δt = ktt − (k if δt > ν h% add xt to dictionary i ˜ m, k ˜ m+1 = K ˜t ; k ˜Tt , ktt K at = (0, . . . ; 1) ATt e = (ATt−1 e; 1) ATt yt = (ATt−1 yt−1 ; yt ) ATt At = [ATt−1 At−1 , 0; 0T , 1] ∗t−1 = (∗t−1 ; 0) t−1 = (t−1 ; 0) I = [I, 0; 0T , 1] 0 = (0; 0) m = m+1 else % dictionary remains unchanged ATt e = ATt−1 e + at ATt yt = ATt−1 yt−1 + at yt ATt At = ATt−1 At−1 + at aTt 4. Update ˜ ∗ and ˜: T ˜ ˜ t−1 ) ˜ ∗t−1 − At ft = ATt At K( ∆ ˜ ∗ = ηt∗ ATt (yt − ft − εe) ∆ ˜ = −ηt ATt (yt − ft + εe) ˜ ∗t = ˜ ∗t−1 + ∆˜ ∗ ˜ t = ˜ t−1 + ∆˜
Sparse Online Greedy Support Vector Regression
4.2
91
SOG-SVR Convergence
We now show that the SOG-SVR performs gradient ascent in the original Lagrangian, which is the one we actually aim at maximizing. Let us begin by noting that L(α∗ , α) from (2.3) may be written as the sum of two other Lagrangians, each defined over orthogonal subspaces of IR . ˆα ¯α L(α∗ , α) = L( ˆ ∗ , α) ˆ + L( ¯ ∗ , α) ¯
(4.14)
where α ˆ (∗) = AA† α(∗) , α ¯ (∗) = (I − AA† )α(∗) , (4.15) 1 ˆα L( ˆ ∗ , α) ˆ ∗ − α) ˆ = − (α ˆ T K(α ˆ ∗ − α) ˆ + (α ˆ ∗ − α) ˆ T y − ε(α ˆ ∗ + α) ˆ T e ,(4.16) 2 ¯α L( ¯ ∗ , α) ¯ = (α ¯ ∗ − α) ¯ T y − ε(α ¯ ∗ + α) ¯ Te , (4.17) and A† = (AT A)−1 AT is the pseudo-inverse of A. It is easy to see that Lˆ may be written entirely in terms of α ˜ (∗) (AA† is symmetric and therefore α ˆ = (A† )T α ˜ ∗ ): 1 ∗ ˜ α Lˆ = − (α ˜ ∗ − α) ˜ − α) ˜ T K( ˜ + (α ˜ ∗ − α) ˜ T A† y − ε(α ˜ ∗ + α) ˜ T A† e , 2
(4.18)
Theorem 1. For η ∗ and η sufficiently small, using the update rule (4.12) causes ˆ i.e., a non-negative change to the Lagrangian L; def ˆα ˆα ∆Lˆ = L( ˆ ∗ + ∆α ˆ ∗, α ˆ + ∆α) ˆ − L( ˆ ∗ , α) ˆ ≥0 .
(4.19)
Proof. To first order in η ∗ , ∆Lˆ is proportional to the inner product between the update ∆α ˜ ∗ and the gradient of Lˆ w.r.t. α ˜ ∗ . Differentiating (4.18) yields ∂ Lˆ ˜ α = A† y − K( ˜ ∗ − α) ˜ − εA† e = A† (y − f − εe) . ∂α ˜∗
(4.20)
The inner product mentioned above is: ∆α ˜∗
T
∂ Lˆ T = η ∗ (y − f − εe) AA† (y − f − εe) ≥ 0 . ∂α ˜∗
(4.21)
The last inequality is based on the fact that AA† is positive semi-definite2 . The exact same treatment can be given to the update ∆α, ˜ completing the proof. 2
More specifically, AA† is the projection operator on the subspace of IR spanned by the columns of A; therefore, its eigenvalues equal either 1 or 0.
92
Yaakov Engel et al.
A direct consequence of Theorem 1 is that, since α ¯ (∗) is not updated during learning, the change to the Lagrangian L is also non-negative. However, for a positive ε, if α ˜ (∗) and α ¯ (∗) are not constrained appropriately, neither Lˆ nor L¯ may possess a maximum. Currently, we do not see a way to maintain the feasibility of α ˜ (∗) w.r.t. the original constraints (2.5) on α(∗) with less than O( ) work per time-step. For the purposes of convergence and regularization it is sufficient to maintain box constraints similar to (2.5), in IRm ; i.e, ˜ ≤α ˜ . −Ce ˜∗, α ˜ ≤ Ce
(4.22)
Proving the convergence of SOG-SVR under the constraints (4.22) is a straightforward adaptation of the proof of Theorem 1, and will not be given here for lack of space. In practice, maintaining such constraints seems to be unnecessary.
5
Experiments
Here, we report the results of experiments comparing the SOG-SVR to the stateof-the-art SVR implementation SVMTorch [4]. Throughout, except for the parameters whose values are reported, all other parameters of SVMTorch are at their default values. We first used SOG-SVR for learning the 1-dimensional sinc function sin(x)/x defined on the interval [−10, 10]. The kernel is Gaussian with standard deviation σ = 3. The other SVR parameters are C = 104 / and ε = 0.01, while the SOG-SVR specific parameters are λ = 0.1, ν = 0.01 and γ = 0. Learning was performed on a random set of samples corrupted by additive i.i.d. zero-mean Gaussian noise. Testing was performed on an independent random sample of 1000 noise-free points. All tests were repeated 50 times and the results averaged. Figure 1 depicts the results of two tests. In the first we fixed the noise level (noise std. 0.1) and varied the number of training samples from 5 to 5000, with each training set drawn independently. We then plotted the generalization error (top left) and the number of support vectors as a percentage of the training set (top-right). As can be seen, SOG-SVR produces an extremely sparse solution (with a maximum of 12 SVs) with no significant degradation in generalization level, when compared to SVMTorch. In the second test we fixed the training sample size at 1000 and varied the level of noise in the range 10−6 to 10. We note that SVMTorch benefits greatly from a correct estimation of the noise level by its parameter ε. However, at other noise levels, especially in the presence of significant noise, the sparsity of its solution is seriously compromised. In contrast, SOG-SVR produces a sparse solution with complete disregard to the level of noise. It should be noted that SOG-SVR was allowed to make only a single pass over the training data, in accordance with its online nature. We also tested SOG-SVR on two real-world data-sets - Boston housing and Comp-activ, both from Delve3 . Boston housing is a 13-dimensional data-set with 3
http://www.cs.toronto.edu/ delve/data/datasets.html
Sparse Online Greedy Support Vector Regression Generalization Errors
93
Percent of SVs
2
120 TORCH
TORCH
SOG
1
SOG 100
0
% SVs
log m.s.e.
80 −1 60
−2 40 −3
20
−4
−5
0
1
2 log sample size
3
0
4
0
1
Generalization Errors
3
4
Percent of SVs
1
120 TORCH
TORCH 0
2 log sample size
SOG
SOG
100
−1
% SVs
log m.s.e.
80 −2 60
−3 40 −4 SOG 20
−5
−6 −8
−6
−4 −2 log sample size
0
2
0 −8
−6
−4 −2 log sample size
0
2
Fig. 1. SOG-SVR compared with SVMTorch on the sinc function. The horizontal axis is scaled logarithmically (base 10). In the generalization error graph we use a similar scale for the vertical axis, while on the SV percentage graph we use a linear scale 506 samples. Our experimental setup and parameters are based on [11]. We divided the 506 samples randomly to 481 training samples and 25 test samples. The parameters used were C = 500, ε = 2 and σ = 1.4. The SOG-SVR parameters are λ = 0.1, ν = 0.01 and γ = 0. Preprocessing consisted of scaling the input variables to the unit hyper-cube, based on minimum and maximum values. since this is a relatively small data-set, we let SOG-SVR run through the training data 5 times, reporting its generalization error after each iteration. The results shown in Table 3 are averages based on 50 repeated trials. The Comp-activ data-set is a significantly larger 12-dimensional data-set with 8192 samples. Training and test sets were 6000 and 2192 samples long, respectively, and the same preprocessing was performed as for the Boston data-set. Due to the larger size of the training set, SOG-SVR was allowed to make only a single pass over the data. We made no attempt to optimize learning parameters for neither algorithm. The results are summarized in Table 4. As before, results are averaged over 50 trials.
94
Yaakov Engel et al.
Table 3. Results on the Boston housing data-set, showing the test-set meansquared error, its standard deviation and the percentage of SVs. SVMTorch is compared with SOG-SVR after 1–5 iterations over the training set. SOGSVR performs comparably using less than 1/2 of the SVs used by SVMTorch. Throughout, Percentage of SVs has a standard deviation smaller than 1% Boston SVMTorch SOG-SVR 1 SOG-SVR 2 SOG-SVR 3 SOG-SVR 4 SOG-SVR 5 MSE 13.3 40.9 14.1 13.6 13.3 13.1 STD 11.8 69.2 9.3 8.9 8.6 8.4 % SV 37 17 17 17 17 17
Table 4. Results on the Comp-activ data-set, again comparing SOG-SVR with SVMTorch. Here SOG-SVR delivers a somewhat higher test-set error, but benefits from a much more sparse solution Parameters Comp-activ SVMTorch SOG-SVR = 6000, C = 106 / = 167, MSE 8.8 10.9 = 1, σ = 0.5, λ = 0.1, STD 0.4 1.0 ν = 0.001, γ = 0 % SV 63 9
6
Conclusions
We presented a gradient based algorithm for online kernel regression. The algorithm’s per-time-step complexity is dominated by an O(m2 ) incremental matrix inversion operation, where m is the size of the dictionary. For this reason sparsification, resulting in a small dictionary, is an essential ingredient of SOG-SVR. Somewhat related to our work are incremental algorithms for SV learning. [12] presented empirical evidence indicating that large data-sets can be split to smaller chunks learned one after the other, augmenting the data in each new chunk with the SVs found previously; with no significant increase in generalization error. This idea was refined in [2], where an SVM algorithm is developed which exactly updates the Lagrange multipliers, based on a single new sample. While the method of [12] lacks theoretical justification, it seems that both methods would be overwhelmed by the increasingly growing number of support vectors found in large online tasks. In [10] an incremental SVM method is suggested in which the locality of RBF-type kernels is exploited to update the Lagrange multipliers of a small subset of points located around each new sample. It is interesting to note that, for RBF kernels, proximity in input space is equivalent to approximate linear dependence in feature space. However, for other, non-local kernels (e.g., polynomial), sparsification by eliminating linear dependencies remains a possible solution. In [7] an incremental kernel classification method is presented which is capable of handling huge data-sets ( = 109 ). This method results from a quadratic unconstrained optimization problem more closely related to regularized least squares algorithms than to SVMs. The reported algorithm performs linear separation in input space - it would be interesting to see if it
Sparse Online Greedy Support Vector Regression
95
could be extended to non-linear classification and regression through the use of Mercer kernels. The current work calls for further research. First, a similar algorithm for classification can be developed. This is rather straightforward and can be deduced from the current work. Second, application of SOG-SVR to problems requiring online learning is underway. In particular, we plan to apply SOG-SVR to Reinforcement Learning problems. As indicated by the results on the Boston data-set, SOG-SVR may also be used in an iterative, offline mode simply in order to obtain a sparse solution. Additional tests are required here as well. Third, some technical improvements to the algorithm seem worthwhile. Specifically, the learning rates may be optimized resulting in faster convergence. Fourth, self tuning techniques may be implemented in the spirit of [3]. This would make the algorithm more resilient to scaling problems. Fifth, conditions on the data distribution and the kernel under which the effective rank of the Gram matrix is low, should be studied. Preliminary results suggest that for “reasonable” kernels and large sample size the effective rank of the Gram matrix is indeed much lower than the sample size, and is in fact asymptotically independent of it.
References 1. C. J. C. Burges and B. Sch¨ olkopf. Improving the accuracy and speed of support vector machines. In Advances in Neural Information Processing Systems, volume 9. The MIT Press, 1997. 85 2. G. Cauwenberghs and T. Poggio. Incremental and decremental support vector machine learning. In Advnaces in Neural Information Systems, pages 409–415, 2000. 94 3. O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple parameters for support vector machines. Machine Learning, 46:131–160, 2002. 95 4. R. Collobert and S. Bengio. SVMTorch: Support vector machines for large-scale regression problems. Journal of Machine Learning Research, 1:143–160, 2001. 85, 92 5. N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, Cambridge, England, 2000. 84 6. T. Downs, K. Gates, and A. Masters. Exact simplification of support vector solutions. Journal of Machine Learning Research, 2:293–297, December 2001. 85, 87 7. G. Fung and O. L. Mangasarian. Incremental support vector machine classification. In Second SIAM Intrnational Conference on Data Mining, 2002. 94 8. R. Herbrich. Learning Kernel Classifiers. The MIT Press, Cambridge, MA, 2002. 84 9. J. Platt. Fast training of support vector machines using sequential minimal optimization. In Advances in Kernel Methods - Support Vector Learning, pages 42–65. MIT Press, 1999. 85 10. L. Ralaivola and F. d’Alch´e Buc. Incremental support vector machine learning: a local approach. In Proceedings of ICANN. Springer, 2001. 94 11. B. Sch¨ olkopf and A. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002. 84, 85, 93
96
Yaakov Engel et al.
12. N. Syed, H. Liu, and K. Sung. Incremental learning with support vector machines. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI-99), 1999. 94 13. V. N. Vapnik. Statistical Learning Theory. Wiley Interscience, New York, 1998. 84 14. S. Vijayakumar and S. Wu. Sequential support vector classifiers and regression. In Proceedings of the International Conference on Soft Computing (SOCO’99), 1999. 86, 87 15. G. Wabha. Spline models for observational data. CBMS-NSF Regional Conference Series in Applied Mathematics, Vol. 59, Philadelphia: SIAM, 1990. 86
Pairwise Classification as an Ensemble Technique Johannes F¨ urnkranz Austrian Research Institute for Artificial Intelligence Schottengasse 3, A-1010 Wien, Austria
[email protected]
Abstract. In this paper we investigate the performance of pairwise (or round robin) classification, originally a technique for turning multi-class problems into two-class problems, as a general ensemble technique. In particular, we show that the use of round robin ensembles will also increase the classification performance of decision tree learners, even though they can directly handle multi-class problems. The performance gain is not as large as for bagging and boosting, but on the other hand round robin ensembles have a clearly defined semantics. Furthermore, we show that the advantage of pairwise classification over direct multi-class classification and one-against-all binarization increases with the number of classes, and that round robin ensembles form an interesting alternative for problems with ordered class values.
1
Introduction
In a recent paper (F¨ urnkranz, 2001), we analyzed the performance of pairwise classification (which we call round robin learning) for handling multi-class problems in rule learning. Most rule learning algorithms handle multi-class problems by converting them into a series of two-class problems, one for each class, each using the examples of the corresponding class as positive examples, and all others as negative examples. This procedure is known as one-against-all class binarization. Round robin binarization, on the other hand, converts a c-class problem into a series of two-class problems by learning one classifier for each pair of classes, using only training examples for these two classes and ignoring all others. A new example is classified by submitting it to each of the c(c − 1)/2 binary classifiers, and combining their predictions via simple voting. The most important result of the previous study was that this procedure not only increases predictive accuracy, but that it is also no more expensive than the more commonly used one-against-all approach. Obviously, round robin classifiers may also be interpreted as an ensemble classifier that, similar to error-correcting output codes (Dietterich and Bakiri, 1995), constructs an ensemble by transforming the learning problem into multiple other problems and learning a classifier for each of them.1 In this paper, we will investigate the question whether round robin class-binarization can also improve 1
In fact, Allwein et al. (2000) show that pairwise classification (and other class binarization techniques) are a special case of a generalized version of error-correcting
T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 97–110, 2002. c Springer-Verlag Berlin Heidelberg 2002
98
Johannes F¨ urnkranz
performance for learning algorithms that can naturally handle multi-class problems, in our case decision tree learners. We will start with a brief recapitulation of our previous results on round robin learning (Section 2), and then investigate two questions: First, in Section 3, we will investigate the performance of roundrobin binarization as a general ensemble technique and compare its performance to bagging and boosting. We will also evaluate a straight-forward integration of bagging and round robin learning. As more classes result in a larger ensemble of classifiers, it is reasonable to expect that the performance of round robin ensembles depends crucially on the number of classes of the problem. In Section 4, we will investigate this relation on classification problems with identical attributes but varying numbers of classes, which we obtain by discretizing the target variables of regression problems. Our results will show that round robin learning can indeed improve the performance of the c4.5 and c5.0 decision tree learners, and that a higher number of classes increases its performance, in particular in comparison to a one-against-all binarization.
2
Round Robin Classification
In this section, we will briefly review round robin learning (aka pairwise classification) in the context of our previous work in rule learning (F¨ urnkranz, 2001; 2002). Separate-and-conquer rule learning algorithms (F¨ urnkranz, 1999) are typically formulated in a concept learning framework, where the goal is to find a definition for an unknown concept, which is implicitly defined via a set of positive and negative examples. Within this framework, multi-class problems, i.e., problems in which examples may belong to (exactly) one of several categories, are usually addressed by defining a separate concept learning problem for each class. Thus the original learning problem is split into a series of binary concept learning problems—one for each class—where the positive training examples are those belonging to the corresponding class and the negative training examples are those belonging to all other classes. This technique for dealing with multiclass problems in rule learning has been proposed by Clark and Boswell (1991), but is also well-known in other areas such as neural networks, support vector machines, or statistics (cf. multi-response linear regression). A variant of the technique, in which classes are first ordered (e.g., according to their relative frequencies in the training set) is used in the ripper rule learning algorithm (Cohen, 1995). On the other hand, the basic idea of round robin classification is to transform a c-class problem into c(c − 1)/2 binary problems, one for each pair of classes. Note that in this case, the binary decision problems not only contain fewer training examples (because all examples that do not belong to the pair of classes are ignored), but that the decision boundaries of each binary problem may also be considerably simpler than in the case of one-against-all binarization. output codes, which allows to specify that certain classes should be ignored for some problems (in addition to assigning them to a positive or a negative class, as conventional output codes do).
Pairwise Classification as an Ensemble Technique
99
Fig. 1. One-against-all class binarization (left) transforms each c-class problem into c binary problems, one for each class, where each of these problems uses the examples of its class as the positive examples (here o), and all other examples as negatives. Round robin class binarization (right) transforms each c-class problem into c(c−1)/2 binary problems, one for each pair of classes (here 0 and x) ignoring the examples of all other classes
In fact, in the example shown in Figure 1, each pair of classes can be separated with a linear decision boundary, while more complex functions are required to separate each class from all other classes.2 While this idea is known from the literature (cf. Section 8 of (F¨ urnkranz, 2002) for a brief survey), in particular in the area of support vector machines (Hsu and Lin, 2002, and references therein), the main contributions of (F¨ urnkranz, 2001) were to empirically evaluate the technique for rule learning algorithms and to show that it is preferable to the one-against-all technique that is used in most rule learning algorithms. In particular, round robin binarization helps ripper to outperform c5.0 on multiclass problems, whereas c5.0 outperforms the original version of ripper on the same problems. Our second, more important contribution was an analysis of the computational complexity of the approach. We demonstrated that despite the fact that its complexity is quadratic in the number of classes, the algorithm is no slower than the conventional one-against-all technique. It is easy to see this, if one considers that in the one-against-all case each training example is used c times (namely in each of the c binary problems), while in the round robin approach each example is only used c − 1 times, namely only in those binary problems, where its own class is paired against one of the other c − 1 classes. Furthermore, 2
Similar evidence was also seen in practical applications: Knerr et al. (1992) observed that the classes of a digit recognition task were pairwise linearly separable, while the corresponding one-against-all task was not amenable to single-layer networks, while Hsu and Lin (2002) obtained a larger advantage of round robin binarization over unordered binarization for support vector machines with a linear kernel than for support vector machines with a non-linear kernel.
100
Johannes F¨ urnkranz
Fig. 2. Error reductions ratios of boosting vs. round robin
the advantage of pairwise classification increases for computationally expensive learning algorithms. The reason is that super-linear learning algorithms learn many small problems faster than a few large problems. For details we refer to (F¨ urnkranz, 2002).
3
Round Robin Ensembles
In this section we suggest that round robin classification may also be interpreted as an ensemble technique, and its performance gain may be viewed in this context. Like with conventional ensemble techniques, the final prediction is made by exploiting the redundancy provided by multiple models, each of them being constructed from a subset of the original data. However, contrary to subsampling approaches like bagging and boosting, these datasets are constructed deterministically.3 In this respect pairwise classification is quite similar to error-correcting output codes (Dietterich and Bakiri, 1995), but differs from them through its fixed procedure for setting up the new binary problems, and the fact that each of the new problems is smaller than the original problem. In particular, the latter fact may often cause the subproblems in pairwise classification to be conceptually simpler than the original problem (as illustrated in Figure 1). In previous work (F¨ urnkranz, 2001), we observed that the improvements in accuracy obtained by r3 (a round robin version of ripper) over ripper were quite similar to those obtained by c5.0-boost (c5.0 called with the option -b, i.e., 10 iterations of boosting) over c5.0 on the same problems. Round robin binarization seemed to work whenever boosting worked, and vice versa. Figure 2 plots the error ratios of r3 /ripper versus those of c5.0-boost/c5.0. The correlation coefficient r2 is about 0.618, which is in the same range as correlation coefficients 3
Boosting is also deterministic if the base learner is able to directly use weighted examples. Often, however, the example weights are interpreted as probabilities which are used for drawing the sample for the next boosting iteration.
Pairwise Classification as an Ensemble Technique
101
Table 1. Boosting: A comparison between round robin binarization and boosting, both with c5.0 as a base learner. The first column shows the error-rate of c5.0, while the next three column pairs show the results of round robin learning, boosting, and the combination of both, all with c5.0 as a base learner. For these, we give both the error rate and the performance ratio relative to the base learner c5.0. The last line shows the geometric average of all ratios (except letter). The final four columns show the run-times of all algorithms dataset letter abalone car glass image lr spectrometer optical page-blocks sat solar flares (c) solar flares (m) soybean thyroid (hyper) thyroid (hypo) thyroid (repl.) vehicle vowel yeast average
c5.0
round robin
12.48 78.48 7.58 35.05 3.20 51.22 9.20 3.09 13.82 15.77 4.90 9.66 1.11 0.58 0.72 26.24 21.72 43.26
8.80 75.08 5.84 24.77 2.90 51.79 5.04 2.98 13.16 15.69 4.90 6.73 1.14 0.69 0.74 29.20 19.49 40.63
0.705 0.957 0.771 0.707 0.905 1.011 0.547 0.964 0.953 0.995 1.000 0.697 1.024 1.182 1.037 1.113 0.898 0.939 0.909
boosting 5.78 77.88 3.82 27.57 1.60 46.70 2.46 2.58 9.32 16.41 5.90 6.59 1.03 0.32 0.90 24.11 8.89 41.85
0.463 0.992 0.504 0.787 0.500 0.912 0.267 0.834 0.675 1.041 1.206 0.682 0.929 0.545 1.259 0.919 0.409 0.967 0.735
both 5.45 74.67 1.85 22.90 1.73 51.98 2.54 2.78 9.00 16.70 5.83 6.44 1.33 0.53 0.90 23.17 14.75 40.77
0.437 0.951 0.244 0.653 0.541 1.015 0.277 0.899 0.651 1.059 1.191 0.667 1.190 0.909 1.259 0.883 0.679 0.942 0.757
run-times (for training) 7.04 2.587 0.351 0.228 0.230 0.051 0.052 0.050 5.582 0.033 5.398 0.991 4.854 0.662 0.111 0.246 0.597 0.341
107.06 38.532 6.710 5.252 5.259 2.520 1.386 0.903 93.241 2.351 44.581 6.559 21.997 6.181 9.245 1.602 3.845 3.996
70.17 28.019 2.725 1.960 2.150 0.184 0.481 0.410 58.605 0.265 44.319 8.653 44.672 5.260 0.788 2.686 5.443 3.880
325.94 81.883 9.319 8.710 7.523 3.863 1.912 1.132 102.235 2.883 92.370 19.376 70.576 9.000 9.578 4.009 6.453 9.417
for bagging and boosting (Opitz and Maclin, 1999). We interpreted this as weak evidence that the performance gains of round robin learning may be comparable to those of other ensemble methods and that it could be used as a general method for improving a learner’s performance on multi-class problems. We will further investigate this question in this section and will in particular focus upon a comparison of round robin learning with boosting (Section 3.1) and bagging (Section 3.2), and upon the potential of combining it with these techniques. Large parts of this section also appeared in (F¨ urnkranz, 2002). 3.1
Comparison to Boosting
As a first step, we perform a direct comparison of the performance of c5.0 and c5.0-boost to c5.0-rr, a round robin procedure with c5.0 as the base learning algo-
rithm. It transforms each c-class problem into c(c − 1)/2 binary problems and uses c5.0 to learn a decision tree for each of them. For predicting its class, a test example is submitted to all c(c − 1)/2 classifiers and their predictions are combined via unweighted voting. Ties are broken in favor of larger classes. Table 1 shows the results of 18 datasets with 4 or more classes from the UCI repository (Blake and Merz, 1998). For all datasets we estimated error rates with a 10-fold
102
Johannes F¨ urnkranz
stratified cross-validation, except for letter, where we used the standared 4000 examples hold-out set.4 The first thing to note is that the performance of c5.0 does indeed improve by about 10% on average5 if round robin binarization is used as a pre-processing step for multi-class problems. This is despite the fact that c5.0 can directly handle multi-class problems and does not depend on a class binarization routine. However, the direct comparison between round robin classification and boosting shows that the improvement of c5.0-rr over c5.0 is not as large as the improvement provided by boosting: although there are a few cases where round robin outperforms boosting, c5.0-boost seems to be more reliable than c5.0-rr, producing an average error reduction of more than 26% on these 17 datasets. The correlation between the error reduction rates of c5.0-boost and c5.0-rr is very weak (r2 = 0.276), which refutes our earlier hypothesis, and brings up the question whether there is a fruitful combination of boosting and round robin classification. Unfortunately, the last column of Table 1 answers this question negatively: although there are some cases where the combination performs better than both of its constituents, the results of using round robin classification with c5.0-boost as a base learner does—on average—not lead to performance improvements over boosting. In some sense, these results are analogous to those of Schapire (1997) who found that integrating error-correcting output codes into boosting did not improve performance. With respect to run-time, the performance of c5.0-rr (2nd column) cannot match that of c5.0 (first column). This was not to be expected, as c5.0 can directly learn multi-class problems and does not need to perform a class binarization (as urnkranz 2001). opposed to ripper, where round robin learning is competitive; F¨ However, in many cases, c5.0-rr, despite its inefficient implementation as a perl program that repeatedly writes training sets for c5.0 to the disc, can match the performance of c5.0-boost (3rd column), which tightly integrates boosting into c5.0. 3.2
Comparison to Bagging
A natural extension of the round robin procedure is to consider training multiple classifiers for each pair of classes (analogous to sports and games tournaments where each team plays each other team several times). For algorithms with random components (such as ripper’s internal split of the training examples, or the random initialization of back-propagation neural networks) this could simply be performed by running the algorithm on the same dataset with different random seeds. For other algorithms there are two options: randomness could be injected into the algorithm’s behavior (Dietterich, 2000) or random subsets of the available data could be used for training the algorithm. The latter procedure is 4 5
For this reason, we did not include the letter dataset into the computation of averages in this and subsequent sections As these are relative performance measures, we use a geometric average so that x and 1/x average to 1.
Pairwise Classification as an Ensemble Technique
103
Table 2. Bagging: A comparison of round robin learning versus bagging and of the combination of both using ripper, c5.0 and c5.0-boost as the base classifiers ripper c5.0 c5.0-boost
base 1.000 1.000 1.000
round robin bagging 0.747 0.811 0.909 0.864 1.029 0.977
both 0.685 0.838 1.019
more or less equivalent to bagging (Breiman, 1996). We will evaluate this option in this section. Bagging was implemented by drawing 10 samples with replacement from the available data. Ties were broken in the same way as for round robin binarization, i.e., by simple voting using the a priori class probability as a tie breaker. Similarly, bagging was integrated with round robin binarization by drawing 10 independent samples of each pairwise classification problem. Thus we obtained a total of 10c(c − 1)/2 predictions for each c-class problem, which again were simply voted. The number of 10 iterations was chosen arbitrarily (to conform to c5.0-boost ’s default number of iterations) and is certainly not optimal (in both cases). Table 2 shows the results of a comparison of round robin learning, bagging, and a combination of both for ripper, c5.0, and c5.0-boost as base learners. We omit the detailed results here and show only the geometric average of the improvement rates of the ensemble techniques.6 The results show that the performance of the simple round robin (2nd column) can be improved considerably by integrating it with bagging (last column), in particular for ripper. The bagged round robin procedure reduces ripper’s error on the datasets to about 68.5% of the original error. Again, the advantage of the use of round robin learning is less pronounced for c5.0 (it is even below the improvement given by our simple bagging procedure), and the combination of c5.0-boost and round robin learning does not result in additional gains. Note that these average performance ratios are always relative to the base learner. This means they are only comparable within a line but not between lines. For example, c5.0’s performance as a base learner was considerably better than ripper’s by a factor of about 0.891. In terms of absolute performances, the best performing algorithm (on average) was bagged c5.0-boost, which has about 64% of the error rate of basic ripper. This confirms previous good results with combinations of bagging and boosting (Pfahringer, 2000; Krieger et al., 2001). In comparison, the combination of round robin and bagging for ripper (68.5% of ripper’s error rate) is relatively close behind, in particular if we consider the bad performance of ripper in comparison to c5.0. An evaluation of a boosting variant of ripper (such as slipper; Cohen and Singer, 1999) would be of interest. Even though they do not reach the same performance level as alternative ensemble methods, we believe that round robin ensembles nevertheless deserve 6
Detailed results for ripper can be found in (F¨ urnkranz, 2002).
104
Johannes F¨ urnkranz
attention because of the fact that each classifier in the ensemble has a clearly defined semantics, namely to predict whether an unseen example is more likely to be of class i or class j. This may result in a better comprehensibility of the predictions of the ensemble. In fact, Pyle (1999, p.16) proposes a very similar technique called pairwise ranking in order to facilitate human decision-making in ranking problems. He claims that it is easier for a human to determine an order between n items if one makes pairwise comparisons between the individual items and then adds up the wins for each item, instead of trying to order the items right away.
4
Dependence on Number of Classes
The size of a round robin ensemble depends on the number of classes in the problem. In this section, we will analyze the behavior of round-robin learning when varying the number of classes. With this goal in mind, we decided to follow the experimental set-up described by Frank and Hall (2001). They used a set of regression problems and transformed each of them into a series of classification problems, each with a different number of classes. The transformation was performed using equal-frequency discretization on the target variable. Thus the resulting problems were class-balanced. We use exactly the same datasets for our evaluation, and compare j48 (the c4.5 clone implemented in the Weka data mining library; Witten and Frank 2000) to j48-rr, a version that uses pairwise classification with j48 as a base learner.7 Table 3 shows the 10-fold cross-validation error rates of each algorithm on the 29 problems, together with a sign that indicates whether j48 (+) or the round robin version (−) had the higher estimated accuracy. No significance test was used to compute these individual differences, but in all three settings, j48-rr outperformed j48 in 22 out of 29 datasets. Even with the conservative sign test, which has a comparably high Type II error, we can reject the null hypothesis that the overall performance of j48 and j48-rr is identical on these 29 datasets with 99% confidence. However, four of the datasets (Pole Telecom, MV Artificial, Auto MPG, and Triazines) seem to be completely unamenable to pairwise classification, i.e., j48 performs better in all three classification settings. This, however, tell us nothing about the size of the improvement. Inspection of a few cases in Table 3 reveals that on several datasets the advantage of j48-rr over j48 seems to increase with the number of classes, at least for the step from three to five classes (cf., e.g., Abalone). In an attempt to make this observation more objective, we summarized the results of these two algorithms in Table 4, and also included the results of j48-1a, a version of j48 that uses a oneagainst-all binarization. We show the average performance of all algorithms, and 7
The implementation of j48-rr was provided by Richard Kirkby, which gave us the opportunity to check our previous findings with an independent implementation of the algorithm.
Pairwise Classification as an Ensemble Technique
105
Table 3. Comparison of the error rates of j48 and j48-rr on 29 regression datasets, which were class-discretized to classification problems with 3, 5, and 10 class values. 3 classes
dataset Abalone Ailerons Delta Ailerons Elevators Delta Elevators 2D Planes Pole Telecom Friedman Artificial MV Artificial Kinematics CPU Small CPU Act House 8L House 16H Auto MPG Auto Price Boston Housing Diabetes Pyrimidines Triazines Machine CPU Servo Wisconsin Breast C. Pumadyn 8NH Pumadyn 32H Bank 8FM Bank 32NH California Housing Stocks
j48 j48-rr 36.10 25.21 19.67 37.76 30.13 13.39 4.37 19.73 0.47 36.29 21.54 19.25 30.46 31.79 21.98 14.97 25.10 48.84 50.54 46.72 28.04 24.31 63.20 34.02 22.44 13.98 44.21 20.97 8.75
35.37 24.87 19.61 35.30 29.17 13.39 4.40 19.52 0.48 35.74 21.01 18.82 29.53 30.52 24.10 14.40 24.11 53.26 46.35 46.88 25.79 19.88 63.35 33.43 21.54 14.16 43.53 20.68 8.51
5 classes
− − − − − − + − + − − − − − + − − + − + − − + − − + − − −
j48 j48-rr 53.66 43.02 44.46 52.24 52.31 24.63 4.95 35.15 0.81 56.37 36.19 33.23 49.75 50.98 40.50 37.92 40.38 74.42 58.24 61.08 42.87 44.67 76.60 53.96 37.35 26.86 62.59 36.66 13.09
49.37 41.83 42.54 47.80 49.00 24.66 5.08 34.31 0.82 53.70 34.32 31.46 47.32 47.65 41.58 34.53 38.44 64.19 60.81 63.17 39.86 39.52 74.43 49.22 35.17 26.25 60.84 35.45 13.42
10 classes
− − − − − + + − + − − − − − + − − − + + − − − − − − − − +
j48 j48-rr 73.16 63.35 58.73 71.38 63.09 46.95 9.16 58.96 1.82 75.70 57.81 54.27 69.59 70.72 60.40 63.21 61.05 77.44 75.81 83.06 63.44 65.33 88.87 76.18 58.12 50.29 75.74 57.30 27.40
68.83 61.17 54.80 66.29 57.64 45.75 9.38 56.79 1.91 72.92 54.83 51.63 65.81 65.97 62.74 66.29 58.16 75.12 78.51 83.17 60.96 57.72 86.65 71.72 54.93 48.33 71.04 56.41 28.05
− − − − − − + − + − − − − − + + − − + + − − − − − − − − +
the geometric averages of the performance ratios of j48-rr over j48, and j48-1a over j48.8 The results show that the performance improvement of round robin over a one-against-all approach increases steadily by both measures. The performance improvement over j48 also increases in absolute terms, but stays about the same in relation to the error rate of j48 (the improvement is always approximately 3% of j48’s error rate). This seems to indicate that the one-against-all class binarization becomes more and more dangerous for larger numbers of classes. A possible reason could be that the class distributions of the binary problems in 8
Note that both measures are somewhat problematic: the average is dominated by results with large variations among the algorithms (particularly so for the run-time results, which are discussed below), while the performance ratios, which may be viewed as differences normalized by the performance of j48, are somewhat influenced by the fact that the default accuracy of the problems decreases with an increasing number of classes. Consequently, error differences for problems with more classes receive a lower weight (assuming there is some correlation of the performance of the algorithms and the default accuracy of the problem).
106
Johannes F¨ urnkranz
Table 4. Error and training time for a round robin version of j48, a one-againstall version of j48, regular j48, and the binarization technique for ordered classification of Frank and Hall (2001)
3 5 10
error rates j48-rr j48-1a j48 j48-ORD 26.82 26.57 0.99 27.39 1.02 26.30 40.92 42.48 1.04 42.93 1.04 41.43 58.40 63.83 1.10 60.63 1.03 58.92
run-times (for training) j48-rr j48-1a j48 17.65 35.34 1.66 15.99 0.82 27.90 53.52 1.53 24.58 0.78 45.47 84.78 1.38 35.76 0.64
the one-against-all case become more and more skewed for an increasing number of classes (because the number of examples for each class decreases). The fact that we chose almost the same experimental setup as Frank and Hall (2001) allows us to evaluate the performance of round robin learning in domains with ordered classification. The only difference is that we only used a single 10-fold cross-validation, while Frank and Hall (2001) averaged ten 10-fold crossvalidation runs. However, these differences are negligible: in the six experiments that we both performed—those using j48 and j48-1a—their average accuracy estimates and our estimates differed by at most 0.05. Hence we are quite confident, that the results for j48-ORD, which we computed from the tables published by Frank and Hall (2001), are comparable to our results for j48-rr. The interesting result is that there is almost no difference between the two. Apparently general round robin learning is as good for ordered classification as the modification to one-against-all learning that was suggested by Frank and Hall (2001). This opens up the question whether a suitable adaptation of round robin learning could further improve these results, which we leave open for future work. We also used these experiments to get the confirmation of an independent implementation for round robin’s favorable run-time results over one-against all. The right-most part of Table 4 shows the summaries for the training times. As expected, round robin binarization is considerably faster than a one-againstall approach, despite the fact that round robin binarization generates c(c − 1)/2 binary problems for a c-class problem, while the one-against-all technique generates only c problems. However, the advantage seems to decrease with an increasing number of classes. This is not consistent with our expectations that the performance loss induced by the class binarization decreases with an increasing number of classes (F¨ urnkranz, 2002, Theorem 11). We are not exactly sure about the reason for this failed expectation. One explanation could be that the overhead for initializing the binary learning problems (which we did not take into account in our theoretical analysis) is worse than expected and may dominate the total run-time. Another reason could be memory swapping if not all c(c−1)/2 training sets can be held in memory. The first hypothesis is confirmed when we look at the average run-times, which are dominated by the performance on a few slow datasets. There, round robin is consistently almost twice as fast as one-against all, which is approximately what we would expect from our theoretical results.
Pairwise Classification as an Ensemble Technique
5
107
Conclusions
Pairwise classification is an increasingly popular technique for efficiently and effectively converting multi-class problems into binary problems. In this paper, we obtained two main results: First, we showed that round robin class binarization may be used as an ensemble method and improve classification performance even for learning algorithms that are in principle capable of directly handling multi-class problems, in particular the decision tree algorithms of the c4.5 family. However, the observed improvements are not as significant as the improvements we have obtained in previous experiments for the ripper rule learning algorithm, and do in general not reach the same performance level as boosting and bagging. We also showed how a straight-forward extension of round robin learning (namely to perform multiple experiments for each binary problem) may improve over the performance of both its constituents, round robin and bagging. Despite the fact that they did not reach the performance levels of bagging and boosting, we believe that round robin ensembles have advantages that make them a viable alternative, most notably the clearly defined semantics of each member in the ensemble. Our second main result shows that the performance improvements of round robin ensembles increase with the number of classes in the problem (at least for ordered classes). While the improvement over j48 grows approximately linearly with j48’s error rate, the growth of the performance increase over one-against-all class binarization is even more dramatic. We believe that this illustrates that handling many classes is a major problem for the one-against-all binarization technique, possibly because the resulting binary learning problems have increasingly skewed class distributions. At the same time, we were unable to confirm our expectations that the relative efficiency of round robin learning should improve with a larger number of classes. This might be due to the fact that our previous theoretical results underestimated the effect of the constant overhead that has to be spent for each binary problem. Nevertheless, run-times are still comparable to those of regular c4.5, so that the accuracy gain provided by round robin classification comes at very low additional costs. Finally, we also showed that round robin binarization is a valid alternative to learning from ordered classification. We repeated the experiments of Frank and Hall (2001) and found that round robin ensembles perform similar to the special-purpose technique that was suggested in their work. The most pressing issue for further research is an investigation of the effects of different voting schemes. At the moment, we have only tried the simplest technique, unweighted voting where each classifier may vote for exactly one class. A further step ahead might be to allow multiple votes, each weighted with a confidence estimate provided by the base classifier, or to allow a classifier only to vote for a class if it has a certain minimum confidence in its prediction. Several studies in various contexts have compared different voting techniques for combining the predictions of the individual classifiers of an ensemble (e.g., Mayoraz and Moreira, 1997; Allwein et al., 2000; F¨ urnkranz, to appear). Although the final word on this issue remains to be spoken, it seems to be the
108
Johannes F¨ urnkranz
case that techniques that include confidence estimates into the computation of the final predictions are in general preferable, and should be tried for round robin ensembles (cf. also Hastie and Tibshirani, 1998; Schapire and Singer, 1999).
Acknowledgments I would like to thank Eibe Frank and Mark Hall for providing the regression datasets (which were originally collected by Luis Torgo), Richard Kirkby for providing his implementation of pairwise classification in Weka, and the maintainers of and contributors to the UCI collection of machine learning databases. The Austrian Research Institute for Artificial Intelligence is supported by the Austrian Federal Ministry of Education, Science and Culture. This work is supported by the Austrian Fonds zur F¨ orderung der Wissenschaftlichen Forschung (FWF) under grant no. P12645-INF and an APART stipend of the Austrian Academy of Sciences.
References E. L. Allwein, R. E. Schapire, and Y. Singer. Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1:113–141, 2000. 97, 107 C. L. Blake and C. J. Merz. UCI repository of machine learning databases. http://www.ics.uci.edu/~mlearn/MLRepository.html, 1998. Department of Information and Computer Science, University of California at Irvine, Irvine CA. 101 L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996. 103 P. Clark and R. Boswell. Rule induction with CN2: Some recent improvements. In Proceedings of the 5th European Working Session on Learning (EWSL-91), pages 151–163, Porto, Portugal, 1991. Springer-Verlag. 98 W. W. Cohen. Fast effective rule induction. In A. Prieditis and S. Russell, editors, Proceedings of the 12th International Conference on Machine Learning (ML-95), pages 115–123, Lake Tahoe, CA, 1995. Morgan Kaufmann. 98 W. W. Cohen and Y. Singer. A simple, fast, and effective rule learner. In Proceedings of the 16th National Conference on Artificial Intelligence (AAAI99), pages 335–342, Menlo Park, CA, 1999. AAAI/MIT Press. 103 T. G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40(2):139–158, 2000. 102 T. G. Dietterich and G. Bakiri. Solving multiclass learning problems via errorcorrecting output codes. Journal of Artificial Intelligence Research, 2:263–286, 1995. 97, 100 E. Frank and M. Hall. A simple approach to ordinal classification. In L. D. Raedt and P. Flach, editors, Proceedings of the 12th European Conference on Machine Learning (ECML-01), pages 145–156, Freiburg, Germany, 2001. Springer-Verlag. 104, 106, 107
Pairwise Classification as an Ensemble Technique
109
Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997. J. F¨ urnkranz. Separate-and-conquer rule learning. Artificial Intelligence Review, 13(1):3–54, February 1999. 98 J. F¨ urnkranz. Round robin rule learning. In C. E. Brodley and A. P. Danyluk, editors, Proceedings of the 18th International Conference on Machine Learning (ICML-01), pages 146–153, Williamstown, MA, 2001. Morgan Kaufmann Publishers. 97, 98, 99, 100, 102 J. F¨ urnkranz. Round robin classification. Journal of Machine Learning Research 2:721–747, 2002. 98, 99, 100, 101, 103, 106 J. F¨ urnkranz. Hyperlink ensembles: A case study in hypertext classification. Information Fusion, to appear. Special Issue on Fusion of Multiple Classifiers. 107 T. Hastie and R. Tibshirani. Classification by pairwise coupling. In M. Jordan, M. Kearns, and S. Solla, editors, Advances in Neural Information Processing Systems 10 (NIPS-97), pages 507–513. MIT Press, 1998. 108 C.-W. Hsu and C.-J. Lin. A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks, 13(2):415–425, March 2002. 99 S. Knerr, L. Personnaz, and G. Dreyfus. Handwritten digit recognition by neural networks with single-layer training. IEEE Transactions on Neural Networks, 3(6):962–968, 1992. 99 A. Krieger, A. J. Wyner, and C. Long. Boosting noisy data. In C. E. Brodley and A. P. Danyluk, editors, Proceedings of the 18th International Conference on Machine Learning (ICML-2001), pages 274–281, Williamstown, MA, 2001. Morgan Kaufmann Publishers. 103 E. Mayoraz and M. Moreira. On the decomposition of polychotomies into dichotomies. In Proceedings of the 14th International Conference on Machine Learning (ICML-97), pages 219–226, Nashville, TN, 1997. Morgan Kaufmann. 107 D. Opitz and R. Maclin. Popular ensemble methods: An empirical study. Journal of Artificial Intelligence Research, 11:169–198, 1999. 101 B. Pfahringer. Winning the KDD99 classification cup: Bagged boosting. SIGKDD explorations, 1(2):65–66, 2000. 103 J. C. Platt, N. Cristianini, and J. Shawe-Taylor. Large margin DAGs for multiclass classification. In S. A. Solla, T. K. Leen, and K.-R. M¨ uller, editors, Advances in Neural Information Processing Systems 12 (NIPS-99), pages 547– 553. MIT Press, 2000. D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, San Francisco, CA, 1999. 104 J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993. J. R. Quinlan. Bagging, boosting, and C4.5. In Proceedings of the 13th National Conference on Artificial Intelligence (AAAI-96), pages 725–730. AAAI/MIT Press, 1996.
110
Johannes F¨ urnkranz
R. E. Schapire. Using output codes to boost multiclass learning problems. In D. H. Fisher, editor, Proceedings fo the 14th International Conference on Machine Learning (ICML-97), pages 313–321, Nachville, TN, 1997. Morgan Kaufmann. 102 R. E. Schapire and Y. Singer. Improved boosting algorithms using confidencerated predictions. Machine Learning, 37(3):297–336, 1999. 108 I. H. Witten and E. Frank. Data Mining — Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann Publishers, 2000. 104
RIONA: A Classifier Combining Rule Induction and k-NN Method with Automated Selection of Optimal Neighbourhood Grzegorz G´ora and Arkadiusz Wojna Institute of Informatics, Warsaw University ul. Banacha 2, 02-097 Warsaw, Poland {ggora,wojna}@mimuw.edu.pl
Abstract. The article describes a method combining two widely-used empirical approaches: rule induction and instance-based learning. In our algorithm (RIONA) decision is predicted not on the basis of the whole support set of all rules matching a test case, but the support set restricted to a neighbourhood of a test case. The size of the optimal neighbourhood is automatically induced during the learning phase. The empirical study shows the interesting fact that it is enough to consider a small neighbourhood to preserve classification accuracy. The combination of k-NN and a rule-based algorithm results in a significant acceleration of the algorithm using all minimal rules. We study the significance of different components of the presented method and compare its accuracy to well-known methods.
1
Introduction
Many techniques of inductive concept learning from its instances have been developed so far [10]. Empirical comparison of these approaches shows that each performs well on some, but not all, domains. A great progress has been made in multistrategy learning to combine these approaches in order to construct a classifier that has properties of two or more techniques. Although the problem of inductive generalisation has no general solution (what is known as the conservation law for generalisation performance [11]), the goal is to increase the average accuracy for the real-world domains at the expense of accuracy for the domains that never occur in practice. We present a multi-strategy learning approach combining the rule induction [9] and the instance-based techniques [3,5]. There has been a lot of work done in this area [4,6,7]. Our algorithm considers all minimal decision rules, i.e. the most general rules consistent with training examples. It simulates classification based on the most frequent class in the support set of minimal rules covering a test object. The main idea is that the support set is restricted to the neighbourhood of a test example. The neighbourhood of a test example consists of either the objects within some distance from a test example or a number of objects closest to a test example (like in k-NN method). The appropriate size of T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 111–123, 2002. c Springer-Verlag Berlin Heidelberg 2002
112
Grzegorz G´ ora and Arkadiusz Wojna
a neighbourhood to be taken for classification is automatically induced during the process of learning. The crucial empirical observation is that taking a neighbourhood that is much smaller than the whole training set preserves or even improves accuracy. It enables both to induce the optimal neighbourhood during the learning phase and to classify objects effectively. The paper is organised as follows. In Section 2 the paper will be placed in the context of related work. Section 3 outlines the main features of two techniques that are most relevant to this work, i.e. rule induction and instance-based learning. Our algorithm, combining these approaches, is presented in Section 4. Section 5 provides experimental results evaluating the accuracy and the speed of the presented system. Section 6 concludes this paper with a brief summary and discussion of possible directions for future research.
2
Related Work
In recent literature there has been a number of works combining instance-based and decision rule induction methods. RISE system [4] is based on unification of these two methods. The difference between RISE system and our approach is that RISE selects the class for a test object on the basis of the closest rule. First, RISE generates decision rules. At the beginning instances are treated as maximally specific rules and these rules are then gradually generalised as long as the global leave-one-out accuracy is improving. An object is classified according to the closest rule. The distance between an object and a rule is measured with the metric combining the normalised Manhattan metric for numerical attributes and the Simple Value Difference Metric (SVDM) for symbolic attributes. An approach more similar to our method is presented in DeEPs and DeEPsNN [7]. The first difference is that DeEPs uses a different form of rule conditions and different criteria for rule selection. DeEPs classifies objects on the basis of all rules that have high frequency-changing rate (a measure similar to confidence). While classifying a test object the system computes the support set using all rules with high frequency-changing rate and selects the most frequent class in the support set. In our system the computed support set is limited to a certain neighbourhood of a test object. DeEPsNN combines 3-NN and DeEPs: if a certain fixed neighbourhood of a test object covers at least one training object, 3-NN is applied, otherwise DeEPs is used. In [1] an algorithm with the lazy rule induction approach is presented. It computes the support set of all minimal rules covering a test object in the following way. For each training object the algorithm constructs the local rule containing the common conditions of the test and the training objects and checks whether the training objects supporting the local rule are in the same decision class. Finally, the algorithm selects the class most frequent in the support set. This algorithm treats all attributes as symbolic. We generalised this approach for symbolic attributes and extended it to numerical attributes.
RIONA: A Classifier Combining Rule Induction and k-NN Method
113
A detailed study of k-NN algorithms is presented in [12]. In particular, that paper describes research on selection of the optimal value of k. The experiments presented in that paper showed that the accuracy of k-NN is insensitive to the exact choice of k when the optimal k is large enough. Different methods for adapting the value of k locally within different parts of the input space have also been investigated. The local selection of k improves accuracy for data that contain noise or irrelevant features. Our approach combines the idea used in [1] (extended as described above) with k-NN method in such a way that it considers local rules only for the training examples from the k-nearest neighbourhood of a test example. The distance is measured with the metric used in RISE [4]. Moreover, the algorithm searches for the global optimal value k during the learning phase. This combination improves the accuracy of a k-NN classifier with a fixed value k and helps to reach the accuracy comparable to a rule-based classifier in case when the accuracy of the k-NN method is low.
3
Preliminaries and Definitions
We assume that a training set, denoted in the paper trnSet, consists of a finite set of examples. Each example is described by a finite set of attributes (features) / A denotes the decision A ∪ {d}, i.e. a : trnSet → Va for a ∈ A ∪ {d}, where d ∈ attribute and Va is a value domain of the attribute a. Two groups of attributes are considered: symbolic and numerical (real-valued). We denote by Class(v) a subset of training examples with a class v. We also assume that Vd = {1, ..., |Vd |}. 3.1
Minimal and Lazy Rule Induction
Rule induction algorithms induce decision rules from a training set. A decision rule consists of a conjunction of attribute conditions and a consequent. The commonly used conditions are equations attribute = value for symbolic attributes and interval inclusion for numerical attributes, e.g. IF (a1 = 2 ∧ a3 ∈ [3, 7] ∧ a6 = 5) T HEN (d = 1). Many systems compute a set of such decision rules and then use it in the classification process. Another approach is the lazy concept induction that does not require calculation of decision rules before classification of new objects. An example of such an algorithm is presented in [1]. It generates only decision rules relevant for a new test object and then classifies it like algorithms generating rules in advance. Below we briefly describe this algorithm generalised for symbolic attributes and extended to the case of numerical attributes. Definition 1. For objects tst, trn we denote by ruletst (trn) the local rule with decision d(trn) and the following conditions ci for each attribute ai : ai ∈ [min(ai (tst), ai (trn)), max(ai (tst), ai (trn))] when ai is numerical ci = when ai is symbolic ai ∈ B (ai (tst), δ(ai (tst), ai (trn)))
114
Grzegorz G´ ora and Arkadiusz Wojna
where B(c, R) is a ball centered in c with radius R and δ is a measure of attribute value similarity. The conditions in Definition 1 are chosen so that both the training and the test example satisfy the rule and the conditions are maximally specific. The condition used in [1] is a particular case of the above condition defined for symbolic attributes when Hamming metric is used (δ(x, y) = 1 if x = y and 0 otherwise). Below we present the lazy rule induction algorithm (RIA). The function isConsistent(r,verifySet) checks whether a local rule r is consistent with a verifySet. Algorithm 1 RIA(tst) 1.for each class v ∈ Vd 2. supp(v) = ∅ 3. for each trn ∈ trnSet with d(trn) = v 4. if isConsistent(ruletst (trn), trnSet) 5. then supp(v) = supp(v) ∪ {trn} |supp(v)| 6.RIA = arg max |Class(v)| v∈Vd
It was shown in [1] that RIA is equivalent to the algorithm based on calculating all rules that are maximally general and consistent with the training set. The time complexity of RIA for a single test object is O(n2 ), where n = |trnSet|. One of the motivations behind our work was to reduce this complexity. 3.2
Instance-Based Learning
A commonly used instance-based learning method is the k nearest neighbours algorithm (k-NN ). It is based on the concept of similarity. Given a number of training examples the class for a test case is inferred from the k nearest examples in the sense of a similarity measure. Different measures are used for numerical and symbolic domains. For domains with both types of attributes a combination of these approaches may be used: δa (x, y) (x, y) = a∈A
where x, y are objects and δa (·, ·) is a measure of attribute value similarity. In the paper we used the normalised Manhattan distance for numerical attributes and SVDM (see e.g. [4]) for symbolic attributes: a(x)−a(y) for a - numerical amax −amin δa (x, y) = v∈Vd |P (Class(v)|a(x)) − P (Class(v)|a(y))| for a - symbolic
4
Rule Induction with Optimal Neighbourhood Algorithm (RIONA)
Instead of considering all training examples in building a support set like in RIA, we can limit it to a certain neighbourhood of a test example. The intuition
RIONA: A Classifier Combining Rule Induction and k-NN Method
115
behind it is that training examples far from a test object are less relevant for classification than closer examples. We consider two classes of a neighbourhood: Definition 2. For each test example tst we define S(tst, k) as the set of k training examples that are most similar to tst according to a similarity measure . Definition 3. For each test example tst we define B(tst, R) as the set of training examples trn such that (tst, trn) ≤ R. The former neighbourhood is similar to the one used in the k-NN algorithm. From now on, we use in the paper S(tst, k) neighbourhood, although we studied both classes of neighbourhoods in parallel and the empirical difference between them will be discussed in Section 5. Now we are ready to present an approach to induction that is a kind of combination of case-based learning (see Section 3.2) and lazy minimal rule induction (see Section 3.1). The main idea is that we apply the following strategy for conflict resolving: supp(r) ∩ S(tst, k) r∈MinRulesvtst NormNStrength(tst, v) = (1) |Class(v)| where v denotes the v-th class, tst is a test example, supp(r) is the set of training examples matching a rule r, M inRulesvtst is the set of all rules maximally general and consistent with the training set, whose premise is satisfied by tst and the consequent is the class v. In the classification process we assume that the parameter k of the neighbourhood is fixed. The proper size of the neighbourhood is found in the learning phase (see Section 4.1). In order to calculate the measure (1) we used a modified version of Algorithm 1. First, in the line 3 of the algorithm only the examples trn ∈ S(tst, k) should be considered. Furthermore, it is not necessary to consider all the examples from the training set to check the consistency of the ruletst (trn). Please note that from Definition 1 we have that: Proposition 1. If trn satisfies ruletst (trn) then (tst, trn ) ≤ (tst, trn). Hence, the examples that are distanced from the test example tst more than the training example trn can not cause inconsistency of ruletst (trn). The resulting classification algorithm is presented below. It predicts the most common class among the training examples that are covered by the rules satisfied by a test example and that are in the specified neighbourhood.
116
Grzegorz G´ ora and Arkadiusz Wojna
Algorithm 2 RIONA(tst) neighbourSet = S(tst, k) for each class v ∈ Vd supp(v) = ∅ for each trn ∈ neighbourSet with d(trn) = v if isConsistent(ruletst (trn), neighbourSet) then supp(v) = supp(v) ∪ {trn} |supp(v)| RIONA = arg max |Class(v)| v∈Vd
For the maximal neighbourhood the algorithm RIONA works exactly as RIA algorithm. On the other hand, taking a neighbourhood as a single nearest training example we obtain the nearest neighbour algorithm. In this sense RIONA belongs between the nearest neighbour and the rule induction classifier. 4.1
Selection of Optimal Neighbourhood
During the experiments (see Section 5) we found that the performance of the algorithm can significantly depend on the size of a chosen neighbourhood and a different size is appropriate for different problem domains. In fact, it is possible to estimate the optimal value k for S(tst, k) neighbourhood. It would be similar if the optimal value k for k-NN method were estimated. The idea is that one can use the leave-one-out method on a training set to estimate the accuracy of the classifier for different values of k (1 ≤ k ≤ kmax ) and then choose the value k for which the estimation is the greatest. Applying it directly would require repeating the leave-one-out estimation kmax times. However, we emulated this process in a time comparable to the single leave-one-out test for k equal to the maximal possible value k = kmax . This idea is realised in Algorithm 3. Algorithm 3 findOptimalK(kmax ) for each trn ∈ trnSet Atrn = getClassificationVector(trn, kmax ) return arg max |{trn ∈ trnSet : d(trn) = Atrn [k]}| k
function getClassificationVector(tst, kmax ) N N = vector of kmax training examples N N1 , . . . , N Nkmax nearest to tst sorted according to a distance (tst, ·) for each class v ∈ Vd decStrength[v] = 0 currentDec= the most frequent class in trnSet for k = 1, 2, ..., kmax if isConsistent(ruletst (N Nk ), N N ) then v = d(N Nk ) decStrength[v] = decStrength[v] + 1 > decStrength[currentDec] then currentDec = v if decStrength[v] |Class(v)| |Class(currentDec)| D[k] = currentDec return D Ignoring the consistency checking in the function getClassif icationV ector(·, ·) we obtain the k nearest neighbours algorithm with selection of the optimal k
RIONA: A Classifier Combining Rule Induction and k-NN Method
117
(ONN ). An experimental comparison of RIONA and ONN is presented in the next section.
5
Experimental Study
Table 1 presents experimental results for 24 data sets from UCI repository [2]. For data that are split into a training and a testing set the experiments were performed for joined data. The accuracy for C5.0, DeEPs and DeEPsNN are taken from the paper [8]. The remaining algorithms were tested on a 800MHz PentiumIII PC, with 512M bytes of RAM. The algorithm RIA is time expensive so it was tested only for smaller data sets. The results were obtained by performing 10-fold cross-validation 10 times for each data set. All implemented algorithms: RIONA, RIA, ONN, 3-NN and RIONA(B) were tested with exactly the same folds and the significance of difference between algorithms was estimated using one-tailed paired t test.1 SVDM metric and the optimal neighbourhood were computed from a training set independently for each run in a cross-validation test. The total average accuracy was computed over all data sets except breast, bupa-liver and primary (for RIA it was computed only over the data sets that are given the accuracy). For all data sets the presented results were obtained for the metric described in Section 3.2 and N ormN Strength measure for conflict resolving (see Section 4). Although during the preliminary experiments we tried other types of a metric, no one appeared better then the presented one in terms of accuracy on a range of problem domains. We also tried to omit normalisation factor in the measure N ormN Strength what gave almost identical results. The optimal size of a neighbourhood was searched during the process of learning on the basis of the training examples. From the time complexity perspective it was important to limit searching for the optimal k to a small fixed range of possible values from 1 to kmax in such a way that sorting and consistency checking of kmax nearest neighbours were efficient. Since the values kmax optimal in this sense are the values close to the square root of the training set size (see Section 5.2) we set kmax = 200 (it is close to the square root of the size of the largest domains). In the next subsection we examine the significance of this setting. In Table 1 one can see that significant differences in accuracy between RIONA and ONN (k-NN with selection of the optimal neighbourhood) occurred mostly for smaller data sets (breast, bupa-liver, chess, primary, solar-flare and yeast ). The only difference between RIONA and ONN is the operation of consistency checking. In order to explain the similarity of results we checked what part of the k-neighbourhood for the optimal k is eliminated by the operation of consistency-checking and found that only for the domains breast, primary 1
The result of a single cross-validation test was the accuracy averaged over all 10 folds and the final average accuracy and the confidence level for difference between RIONA and the corresponding algorithm were computed from 10 repeats of the cross-validation test (for census-income and shuttle only 4 repeats).
118
Grzegorz G´ ora and Arkadiusz Wojna
Table 1. The average optimal k, the average accuracy (%) and the standard deviation for RIONA with the optimal k-best neighbourhood and the average accuracy (%) for the other systems: RIA, ONN, 3-NN, RIONA with the optimal B(tst, R) neighbourhood, C5.0, DeEPs and DeEPsNN. The superscripts denote the confidence levels: 5 is 99.9%, 4 is 99%, 3 is 97.5%, 2 is 95%, 1 is 90%, and 0 is below 90%. Plus indicates that the average accuracy of an algorithm is higher than in RIONA and minus otherwise Domain (size, attr, classes)
kopt
australian (690, 14, 2)
41,2
86,1±0,4 65,0−5 85,7−2 85,0−4
RIONA
RIA
breast (277, 9, 2)
77,9
73,4±1,0
73,00
breast-wis (683, 9, 2)
3,0
97,0±0,3 89,7−5 97,00
bupa-liver (345, 6, 2)
40,6 66,6±1,7 63,0−5 64,1−4
census (45222, 16, 2)
42,1
83,8±0,0
-
chess (3196, 36, 2)
11,9
98,0±0,1
-
73,90
−5
ONN
3-NN
RIONA(B)
C5.0
85,7−2
85,9
84,9
68,6−5
73,60
-
-
-
97,10
96,1−5
95,4
96,4
96,3
88,4
66,40
-
-
-
84,1+5 82,0−5
83,9+5
85,8
85,9
85,9
96,9−5 97,0−5 −1
66,00
DeEPs DeEPsNN
97,5−5
99,4
97,8
97,8
−5
73,1−4
71,3
74,4
74,4 68,0
german (1000, 20, 2)
29,2 74,5±0,5 70,1
glass (214, 9, 6)
2,1
70,7±1,9 39,5−5 70,70 71,9+1
63,9−5
70,0
58,5
heart (270, 13, 2)
19,4
83,2±1,0 62,8−5 83,10
81,3−5
83,40
77,1
81,1
81,1
iris (150, 4, 3)
37,1
94,6±0,6 90,5−5 94,40
95,3+4
94,70
94,0
96,0
96,0
letter (20000, 16, 26)
3,8
95,8±0,1
95,80
95,80
94,0−5
88,1
93,6
95,5
lymph (148, 18, 4)
1,4
85,4±1,3 76,4−5 86,3+1 84,4−2
81,4−5
74,9
75,4
mushroom (8124, 22, 2)
1,0 100,0±0,0
-
100,00 100,00
nursery (12960, 8, 5)
-
74,1
72,1
100,00 100,0 100,0
84,1 100,0
43,3 99,3±0,0
-
99,30 98,1−5
99,2−4
97,1
99,0
99,0
pendigits (10992, 16, 10) 1,2
99,4±0,0
-
99,40
99,40
97,4−5
96,7
98,2
98,8
pima (768, 8, 2)
34,3
74,7±0,9 65,2−5 74,40
72,2−5
72,7−5
73,0
76,8
73,2
primary (336, 15, 21)
75,9
31,7±0,8 32,4+1 40,3+5 33,5+4
31,60
-
-
-
satimage (6435, 36, 6)
3,7
91,3±0,1
91,30 91,4+2
87,7−5
86,7
88,5
90,8
segment (2310, 19, 7)
1,7
97,4±0,1 45,3−5 97,5+2 97,3−2
92,1−5
97,3
95,0
96,6
shuttle (58000, 9, 7)
1,3
99,9±0,0
99,90
99,8−5
99,6
97,0
99,7
solar-flare (1066, 10, 8)
70,9
81,2±0,3 81,4+1 82,7+5 78,1−5
81,7+5
82,7
83,5
83,5
splice (3186, 60, 3)
17,3
93,9±0,2
93,90
94,00
94,6+5
94,2
69,7
69,7
wine (178, 13, 3)
10,1 97,2±0,6 40,1−5 97,20
96,90
94,5−5
93,3
95,6
95,5
yeast (1484, 8, 10)
23,0 59,8±0,6 45,9−5 58,1−5 54,9−5
59,1−4
56,1
59,8
54,6
87,3
86,6
86,1
87,1
Total Average
88,7±0,4
-
64,3
99,90
88,7
87,8
and solar-flare the fraction of eliminated nearest neighbours was significant. For other domains the number of consistent objects from the optimal neighbourhood in RIONA algorithm is close to the number of all objects from the optimal neighbourhood of k-NN algorithm. Therefore the differences in classification accuracy are small. These observations suggest that the operation of consistency checking in RIONA is not very significant and it should be considered to be more restrictive. On the other hand, the accuracy of RIONA and ONN is comparable or better than well-known classifiers, in particular, their accuracy is generally better than the accuracy of RIA and 3-NN. It suggests the conclusion that RIONA and ONN may replace successfully both the rule-based algorithm using all minimal rules and a k-NN with a fixed k. It also proves that using a properly selected subset
RIONA: A Classifier Combining Rule Induction and k-NN Method
119
of rules in rule-based systems gives better results than using all minimal rules. The range of tested data sets indicates that the presented algorithms work well for domains with both numerical and symbolic attributes. In particular, it works well for numerical attributes without preprocessing. 5.1
Further Study
In this section we describe more experiments and conclusions that can help us to understand important aspects of RIONA. First, we performed the experiments that helped us to compare two types of a neighbourhood: thr radial B(tst, R) and the k-best S(tst, k). For each data set we estimated the optimal value of the radius R and the optimal value of k from a training set and compared classification accuracy for both types of a neighbourhood. Looking at the third and the seventh columns in Table 1 one can see that the accuracy of the algorithm for the neighbourhood B(tst, R) is significantly worse than S(tst, k) on 14 domains (with the confidence level 4, -5) and significantly better on 3 domains (with the confidence level +4, +5). Therefore in further experiments we focused our attention on the neighbourhood S(tst, k). The setting kmax = 200 preserved the efficiency of RIONA but the interesting question was how significantly this setting influenced the classification results. Please note that the maximal possible value k is just the size of a training set. In order to answer this question the following experiment was performed: for the smaller sets (less than 4000 objects) the classification accuracy was measured for all possible values of k and for the greater sets the maximal value k was set to kmax = 500 (for the set nursery we made the exception kmax = 1000). The classification accuracy was measured for the leave-one-out method applied to the whole sets. Figures 1, 2 present the dependence of classification accuracy on the value of k for exemplary domains. For most data sets we observed that while increasing k beyond a certain small value the classification accuracy is falling down (see Figure 1). In particular, while comparing the third and the fourth column in Table 1, one can see that
Fig. 1. Accuracy for german
Fig. 2. Accuracy for census-income
120
Grzegorz G´ ora and Arkadiusz Wojna
for most data sets the results for the total neighbourhood are significantly worse than the results for the neighbourhood found by the algorithm RIONA. For the remaining data sets (breast, census-income, nursery, primary, solar-flare) the accuracy becomes stable beyond a certain value k (see Figure 2). For the former group we examined the neighbourhood size (the value of k) for which the maximum accuracy was obtained. In the latter case we examined both the value of k beyond which the accuracy remains stable and the fluctuations in accuracy while increasing k. For most domains the optimal value of k appeared to be much less than 200. On the other hand, for the domains where the optimal k was greater (australian, census-income and nursery) the loss in accuracy related to this setting was insignificant: it remained within the range of 0,15%. Moreover, the accuracy became stable for values of k also much lower than 200. Therefore we could conclude that the setting kmax = 200 preserved good time complexity properties and did not change the results significantly for tested data sets. For data sets split originally into a training and a testing set (splice, satimage, pendigits, letter, census-income, shuttle) we performed the experiments to compare the accuracy for two cases: when the value k was estimated either from a training set or from a test set (the optimal k). Experiments showed that for pendigits accuracy obtained by RIONA differs by about half percent from the accuracy with an optimal number k and for the other domains the difference remains in the range of 0.2%. It means that the used algorithm finds almost optimal number k in terms of obtained accuracy. Analogical experiments were done for the neighbourhood B(tst, R) and we observed that after the value R exceeded a constant Rmax (where Rmax was relatively small in comparison to the maximal possible value of R) the accuracy either became worse or did not improve significantly. This suggests the similar conclusion, i.e. the best accuracy is obtained for a small radius. 5.2
Time Complexity of RIONA
First, the learning algorithm performs two phases for each training object. In the first phase it selects kmax nearest objects among n = |trnSet| objects. On average it is done in the linear time. In the second phase the algorithm sorts all kmax 2 selected objects and checks consistency among them. It takes O(kmax ). Finally, for the whole training set the algorithm computes leave-one-out accuracy for each 1 ≤ k ≤ kmax , which takes O(nkmax ). Summing up, the average complexity of 2 the learning algorithm is O(n(n + kmax )). In practice the component O(n2 ) is dominant. Testing is analogical to learning. The classification algorithm finds kopt nearest examples and then checks consistency among them. Since kopt ≤ kmax , the 2 ) for a single test object and the total average comcomplexity is O(n + kmax 2 plexity of the testing algorithm is O(m(n + kmax )) where m is a number of test objects. In Table 2 one can see that for all the presented data sets the average time of classification for a single object is less than 0.6 s. Moreover, for larger data sets it is comparable with a single object test time in the algorithm ONN and is much shorter than a single test object time in the algorithm RIA.
RIONA: A Classifier Combining Rule Induction and k-NN Method
121
Table 2. Single object test time (in seconds) for RIONA, RIA and ONN Domain
tRION A tRIA tON N Domain
tRION A tRIA tON N
australian
0,026
0,087 0,022 breast
0,016
0,021 0,014
breast-wis
0,032
0,063 0,017 bupa-liver
0,009
0,016 0,006
census
0,572
> 5, 0 0,568 chess
0,130
0,891 0,126
german
0,047
0,188 0,042 glass
0,010
0,012 0,006
heart
0,019
0,024 0,014 iris
0,003
0,006 0,003
letter
0,236
> 5, 0 0,224 lymph
0,017
0,019 0,014
mushroom
0,223
> 5, 0 0,219 nursery
0,169
> 5, 0 0,167
pendigits
0,133
> 5, 0 0,130 pima
0,013
0,055 0,010
primary-tumor
0,018
0,028 0,018 satimage
0,174
> 5, 0 0,169
segment
0,046
0,557 0,042 shuttle
0,378
> 5, 0 0,376
solar-flare
0,025
0,082 0,023 splice
0,405
3,194 0,393
wine
0,010
0,891 0,007 yeast
0,017
0,104 0,014
In case when the number of test objects is approximately equal to the number of training objects, taking into account both the learning and the classification phase, the average time complexity of RIONA is in practise O(n2 ), while the average time complexity of RIA is O(n3 ) what is quite a significant acceleration.
6
Conclusions and Future Research
The research reported in the paper attempts to bring together the features of rule induction and instance-based learning in a single algorithm. As the empirical results indicate the presented algorithm obtained the accuracy comparable to the well-known systems such as: 3-NN, C5.0, DeEPs and DeEPsNN. The experiments show that the choice of a metric is very important for classification accuracy of the algorithm. The combination of the normalised Manhattan metric for numerical attributes and SVDM metric for symbolic attributes proved to be very successful. It did not require discretisation for numerical attributes. We have compared two types of a neighbourhood: the k-nearest neighbours S(tst, k) and the ball B(tst, R). The former type of a neighbourhood gave generally better results, although the latter seemed more natural. This may suggest that the topology of the space induced by the used metric is rather complex. We found that the appropriate choice of the neighbourhood size is also an important factor for classification accuracy. It appeared that for all domain problems the optimal accuracy is obtained for a small neighbourhood (a small number of nearest neighbours k in S or a small radius R in B neighbourhood). This leads us to the conclusion that generally it is enough to consider only a small neighbourhood instead of the maximal neighbourhood related to the whole training set. This is interesting from the classification perspective, because it suggests that usually only a small number of training examples is relevant for accurate classification. It also illustrates the empirical fact that while using rule-based classifiers one can obtain better results by rejecting some rules instead of using
122
Grzegorz G´ ora and Arkadiusz Wojna
all minimal rules like the algorithm RIA does. We propose an approach to use only the rules that are built on the basis of a neighbourhood of the test case. The fact mentioned above is also the key idea that allowed us to make the original algorithm RIA efficient without loss in classification accuracy. In practice the complexity of learning and classification is only squarely and linearly dependent on the size of a learning sample respectively. Although a great effort was put into accelerating the algorithm, we think that further acceleration is possible, for instance by more specialised data structures and an approximate choice of nearest examples (see e.g. [10]). The facts that RIONA and ONN algorithms have similar classification accuracy and the fraction of objects eliminated by the consistency checking operation is very small indicate that this operation has rather small influence on the accuracy of the algorithm. It suggests that the k-NN component remains a dominant element of RIONA and shows that either the construction of local rules should be more general or the operation of consistency checking should be more restrictive. In RIONA the selection of the optimal value of k is performed globally. One possible extension of this approach is to apply a local method to searching for the appropriate value of k (see e.g. [12]). The interesting topic is the dependence of the average number of training examples on the distance to a test case. Empirically it was noticed that the dependence was close to linear, what seemed surprising to us.
Acknowledgements The authors are very grateful to professor Andrzej Skowron for his useful remarks on this presentation. This work was supported by the grants 8 T11C 009 19 and 8 T11C 025 19 from the Polish National Committee for Scientific Research.
References 1. Bazan, J. G. (1998). Discovery of decision rules by matching new objects against data tables. In: L. Polkowski, A. Skowron (eds.), Proceedings of the First International Conference on Rough Sets and Current Trends in Computing (RSCTC-98), pages 521-528, Warsaw, Poland. 112, 113, 114 2. Blake, C. L., Merz, C. J. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/˜mlearn/MLRepository.html], Department of Information and Computer Science, Irvine, CA: University of California. 117 3. Cost, S. and Salzberg, S. (1993). A weighted nearest neighbor algorithm for learning with symbolic features. Machine Learning, 10, pages 57-78. 111 4. Domingos, P. (1996). Unifying instance-based and rule-based induction. Machine Learning, 24(2), pages 141-168. 111, 112, 113, 114 5. Duda, R. O. and Hart, P. E. (1973). Pattern classification and scene analysis. New York, NY: Wiley. 111 6. Golding, A. R., Rosenbloom, P. S. (1991). Improving rule-based systems through case-based reasoning. Proceedings of AAAI-91, pages 22-27, Anaheim, CA. 111
RIONA: A Classifier Combining Rule Induction and k-NN Method
123
7. Li, J., Ramamohanarao, K. and Dong, G. (2001). Combining the strength of pattern frequency and distance for classification. The Fifth Pacific-Asia Conference On Knowledge Discovery and Data Mining, pages 455-466, Hong Kong. 111, 112 8. Li, J., Dong, G., Ramamohanaro, K. and Wong, L. (2001). DeEPs: A new instancebased discovery and classification system. [http://sdmc.krdl.org.sg:8080/˜limsoon/ limsoonpapers.html], School of Computing, National University of Singapore. 117 9. Michalski, R. S., Mozetic, I., Hong, J. and Lavrac, N. (1986) The Multi-Purpose Incremental Learning System AQ15 and its Testing to Three Medical Domains. Proceedings of AAAI-86, pages 1041-1045, San Mateo: Morgan Kaufmann. 111 10. Mitchell T. M. (1997). Machine learning. Portland: McGraw-Hill. 111, 122 11. Schaffer, C. (1994). A conservation law for generalisation performance. Proceedings of the Twelfth International Conference on Machine Learning, pages 259-265, New Brunswick, NJ: Morgan Kaufmann. 111 12. Wettschereck, D. (1994). A study of Distance-Based Machine Learning Algorithms. Doctor of Philosophy dissertation in Computer Science, Oregon State University. 113, 122
Using Hard Classifiers to Estimate Conditional Class Probabilities Ole Martin Halck Norwegian Defence Research Establishment (FFI) P.O. Box 25, NO-2027 Kjeller, Norway
[email protected]
Abstract. In many classification problems, it is desirable to have estimates of conditional class probabilities rather than just h“ ard” class predictions. Many algorithms specifically designed for this purpose exist; here, we present a way in which hard classification algorithms may be applied to this problem without modification. The main idea is that by stochastically changing the class labels in the training data in a simple way, a classification algorithm may be used for estimating any contour of the conditional class probability function. The method has been tested on a toy problem and a problem with real-world data; both experiments yielded encouraging results.
1
Introduction
Classification is one of the most studied problems in machine learning research. In the simplest case, the task is to use training data to infer a classifier function c : X → {0,1}, where X is the instance or input space from which data points are taken. The usual measure of the quality of a classifier is the proportion of correct classifications it yields when given data not present in the training set. In the following, a classifier that outputs this kind of binary-valued predictions will be called a hard classifier. Hard classifiers are clearly most useful when there is reason to believe that there is a deterministic relationship between each data points’ position in input space and its class. In many problems this is not the case; this may be because all relevant information is not encoded in the input representation, or it may be due to real randomness in the problem domain. In such cases, it is desirable to have an indication of the certainty that a data point belongs to a given class. Learning machines that estimate actual probabilities of class membership of input data are particularly useful in this respect, for instance in problems where the costs of misclassification of positive and negative examples are different.. Soft two-class classification algorithms of this kind are essentially regression algorithms where a Bernoulli probability model is assumed. Many methods have been devised for the estimation of conditional class probability functions, ranging from classical logistic regression, via neural networks T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 124-134, 2002. Springer-Verlag Berlin Heidelberg 2002
Using Hard Classifiers to Estimate Conditional Class Probabilities
125
and Bayesian methods (e.g. [1]), to recent advances in the field of kernel methods and support vector machines (SVMs) (e.g. [2,3,4,5]). In the latter area, we shall return to the work of Platt [2], who proposed fitting a logistic function to the unthresholded output of an SVM classifier, and of Wahba [3,4], who modified the cost function of a kernel method to obtain probabilistic outputs. These previous approaches have in common that they require implementation (or modification) of methods specifically for conditional class probability estimation. In this paper, we present a novel approach where hard classifiers are used for addressing this task. In our method, only the training data need to be modified – in a simple way – while existing (hard) classification software can be used as-is. Our approach is motivated and presented in Section 2 and subjected to experimental evaluation in Section 3. Section 4 discusses some theoretical and practical issues relating to this method, while Section 5 concludes the paper.
2
Problem and Algorithm Formulation
In this section, we formulate the problem of estimating conditional class probabilities, and relate it to the pure classification problem. This serves as a motivation for our algorithm, which is subsequently described. The presentation is kept at a somewhat informal level; a more thorough examination of the simplifications and assumptions made is given in Section 4. 2.1 Conditional Class Probabilities and Hard Classifiers
In the problem we study here, we assume that each data point x is drawn from an input space X according to the probability distribution p(x) and assigned to class y = 1 with probability p( y = 1 x) and to class y = 0 otherwise. The task is to
estimate the true conditional class probability p( y = 1 x) as a function of x, based on training data {(xi , yi )}iL=1. What is the relation between hard classification and this task? The goal in hard classification, assuming equal misclassification costs, is usually to obtain a function c : X → {0,1} that minimizes the expected classification error on unseen data. For a given data point x, the expected error is minimized if the equivalence c(x) = 1 ⇔ p( y = 1 x) > 12 holds. Thus, the classification boundary of a hard classifier can be seen as an estimate of the surface in input space where p( y = 1 x) = 12 . 2.2 Motivation for Our Algorithm The main idea of our algorithm is motivated by the simple relationship described above. There, the boundary p( y = 1 x) = 12 is considered; clearly, if we can find the
126
Ole Martin Halck
boundaries p( y = 1 x) = q for other values of q as well, we are able to estimate p( y = 1 x) for a new example x, by checking its place in the instance space in relation to the probability contours for the various values of q. The idea of our algorithm is to s“ hift” these contours by modifying the training data, so that the surface at which p( y = 1 x) = q in the original data corresponds to the surface p ( y = 1 x) = p ( y = 1 x) =
1 2 1 2
for the new data set. A hard classifier estimating the boundary can then be trained on this data set, and thus give an estimate of the
contour p( y = 1 x) = q in the original data set. Specifically, consider the (unknown) surface in input space where p( y = 1 x) has a given constant value q < 12 . Loosely stated, we may then expect that in our training set, a larger proportion than q of the data points on one side of this surface belong to class 1, while this is true of less than a proportion q of the points on the other side.
a
b
p = 0.75
p = 0.25
p = 0.5
p = 0.75
p = 0.25
p = 0.5
Fig. 1. Illustration of the idea of the algorithm. Training data are shown by dots (positive examples) and crosses (negative examples). The grey lines show contours of the true probability function; the dashed lines illustrate how a classifier might draw its decision boundary. a) Original training data; the decision boundary is an estimate of the 0.5 contour. b) Negative examples have been flipped to positives with a probability of 1/3; the decision boundary now estimates the 0.25 contour
Now, assume that we create a new data set from the training set, where x is unchanged for each data point (x, y ), while the class y is set to 1 with probability s = ( 12 − q ) (1 − q ) regardless of its actual class, and otherwise left unchanged. In a region of input space where the original data set contains a proportion r of positive examples, we may now expect that a proportion of approximately r + (1 − r ) s = (r + 1 − 2q) 2(1 − q) of the modified data points are positive. This
Using Hard Classifiers to Estimate Conditional Class Probabilities
quantity clearly grows monotonously with r, and is equal to
1 2
127
if and only if r = q.
Thus, the decision boundary of a hard classifier trained on this modified data set may serve as an estimate of the p( y = 1 x) = q contour surface in input space. Figure 1 illustrates this relationship in the case when q = 14 . For q > 12 , a similar operation, where the class labels are set to 0 with a suitable probability, may of course be performed. In this way, we may estimate any contour surface 0 < p ( y = 1 x) < 1, and by building a collection of estimates of this kind we are able to approximate p( y = 1 x) by a discrete function having the resolution we require. The following section describes this method in more detail, along with practical solutions to some problems that arise.
2.3 Description of the Algorithm Given a set D = {di }iL=1 of training data, where di = (xi , yi ), the first steps in using hard classifiers to form an estimate of the conditional class probability function p( y = 1 x) are as follows: 1.
Choose the coarseness of the estimating function. For simplicity, we divide the K 1 interval (0,1) into K equally-sized parts at the points {qk }k =1 , where qk = k K for k = 0,1,..., K . 2. For each k from 1 to K − 1, estimate the contour surface p( y = 1 x) = qk : • Make a new data set D ( k ) = {(x(i k ) , yi( k ) )}iL=1 from D as follows: For each i from 1 to L, set xi( k ) ← xi ; if qk < 12 , set yi( k ) ← 1 with probability ( 12 − qk ) (1 − qk ) and yi( k ) ← yi otherwise; if qk > 12 , set yi( k ) ← 0 with probability (qk − 12 ) qk and yi( k ) ← yi otherwise. • Train a hard classifier ck on the data set D ( k ) . Seen in isolation, the interpretation of a single classifier ck classifying a data point x as positive is that p( y = 1 x) is estimated to be larger than qk . A seemingly obvious way of obtaining a full estimate for p( y = 1 x) is therefore to find k such that ci (x) = 1 for i ≤ k and ci (x) = 0 for i > k , and use (qk + qk +1 ) 2 as our estimate. Unfortunately, due to the stochasticity of the algorithm, it may (and does) happen that a classifier ck classifies a point as negative, while another classifier ck ′> k does not. To address this problem, we use a simple remedy that reduces to the procedure above when the outputs of the classifiers are indeed consistent with each other. The final step in the estimation of the conditional class probability function is then: 3.
Estimate p( y = 1 x) by
128
Ole Martin Halck
1 pˆ HC ( y = 1 x) = K This function clearly ranges from
1 2K
K −1
k =1
1
∑ ck (x) + 2K .
to 1 − 21K in increments of
(1) 1 K
, and constitutes
the output of our algorithm.
2.4 Related Work The idea of using modified data sets in training is not in itself new. In the field of ensemble methods, machine-learning algorithms are run repeatedly on different data sets resampled from the original one, and the votes of the resulting classifiers are used as a basis for classification. The proportions of votes could also of course be regarded as estimates of class probabilities, but this probabilistic interpretation is not necessarily well-founded. These methods differ from our approach in that only the selection of chosen data points varies; the class labels themselves are not changed. To our knowledge, the only other algorithm that changes class labels of the training examples is MetaCost [6], which operates in a slightly different context, that of costsensitive classification. In MetaCost, an ensemble of classifiers is trained using resampled versions of the training set. The training examples are then relabelled with their optimal classifications, given the chosen misclassification costs and votes from the single classifiers. Any classification algorithm can then be trained on this modified set to provide cost-sensitive classification. Thus, another similarity between MetaCost and the algorithm presented here is that both are able to use any hard classification method as a b“ lack box”. We shall return to the link between class probability estimation and cost-sensitive classification in connection with the experiments described in Section 3.2.
3
Experiments
In this section, we present results from testing our algorithm on two problems. First, we apply it to a simple illustrative toy problem; the second test employs a data set from the UCI machine learning repository and compares the results achieved with results reported in previous research.
3.1 Toy Problem Experiment As an illustration of our approach, we defined a problem with input space
( − 12 , 12 ) × ( − 12 , 12 )
and class probability function
(
(
p( y = 1 x) = 1 + exp 9 x
2
))
−1
.
A contour plot of this function is shown in Figure 2a. A training set D of L = 250 data points x were drawn with uniform probability over X and labelled with y values in {0,1} according to the probability function p( y = 1 x) – see Figure 2b.
Using Hard Classifiers to Estimate Conditional Class Probabilities
129
As a basis for comparison, we first estimated p( y = 1 x) by forming a distanceweighted average function using a simple radial basis function (RBF) algorithm, with a Gaussian kernel function K ( xi − x ) =
x −x2 exp − i 2 2πσ 2σ 1
(2)
around each data point (xi , yi ). This yields the estimate L
pˆ RBF ( y = 1 | x) = ∑ yi K ( xi − x ) i =1
L
∑ K ( xi − x ) .
(3)
i =1
Based on some initial experiments, we set σ = 0.15; Figure 2c shows the resulting contour plot of pˆ RBF ( y = 1 x). Next, we estimated the conditional class probability function using only hard classifiers in combination with the algorithm we have described. We used the classification algorithm that results from thresholding pˆ RBF ( y = 1 x) at 12 . When run on the original data set D, the result is clearly a hard classifier with a decision boundary following the 0.5 contour in Figure 2c. We partitioned the interval (0,1) into K = 50 parts, so that qk = k / 50 for k = 0,1,...,50, and ran the algorithm as described in Section 2.3. That is, for each qk , k = 1, 2,..., 49, we created a modified data set, trained a hard classifier on this
modified set, and used Equation (1) as an estimate pˆ HC ( y = 1 x) of the conditional class probabilities. Figure 2d shows a contour plot of the results. A comparison to Figure 2c shows that our algorithm yields similar results, apart from being less smooth – this latter property is not surprising, given the algorithm’s discrete and stochastic nature. A more quantitative assessment of the performance of the two algorithms can be gained from estimating the expected negative loglikelihood, according to the probability estimate, of a new data point drawn from p(x) and p( y = 1 x). This quantity was estimated using a uniform grid of 51× 51 points over the input space in the following way: − E LL ( pˆ ) = −
1 2
51
∑ ∑ ∑ p ( y = c x = ( 50i , 50 ) ) lnˆ p ( y = c x = ( 50i , 50 ) ) . 25
25
1
j
j
i =−25 j =−25 c = 0
Table 1. Expected negative log-likelihood of a new data point according to each model
pˆ
− E LL ( pˆ )
p
0.460
pˆ RBF
0.481
pˆ HC
0.482
(4)
130
Ole Martin Halck
Fig. 2. Experiments with the toy problem. a) Contours of the true conditional class probability function. b) Randomly generated training data; dots represent positive and crosses negative examples. c) RBF regression estimate of the probability function. d) Estimate using modified data sets and hard RBF classifiers
The results for pˆ RBF and pˆ HC , as well as for the true function p, are given in Table 1; this confirms that the two algorithms have similar performance. 3.2 Experiment on the UCI Adult Benchmark As a more challenging test for the algorithm, we used the Adult data set from the UCI machine learning repository [7]. Each data point x here consists of fourteen values (six numeric and eight categorical) taken from a census form for a household; the class label indicates whether the households’ yearly income was greater or smaller than USD 50,000. This data set, comprising 32,561 training examples and 16,281 test examples, was used in order to enable direct comparison to the work of Platt [2], in which he presents a method where the outputs of a support vector machine (SVM) are mapped to probabilities by fitting a logistic function. Platts’ paper compares the negative log-likelihoods of the test data (held out during training) given the conditional class probability estimates resulting from this approach and from a regularized-likelihood (RL) kernel method [3,4]. We evaluate our algorithm by
Using Hard Classifiers to Estimate Conditional Class Probabilities
131
comparing its performance to these reported results, using the same hold-out set and criterion. For classification, we downloaded the decision tree classification program CRUISE version 1.09 [8] over the Internet. The algorithm used in CRUISE is described in [9]. The motivations for choosing this software package were that it was a) readily available in executable form for the Windows NT platform, and b) free. The point of requiring an executable classification program was that the experiments would, if successful, support our claim that software for hard classification may be used for conditional class probability estimation without modification. Again, we set K = 50, and for each k made a modified version of the training data set as described above. CRUISE was then run on each of these sets, where, for simplicity, the default values were used for all parameters except one. The exception was the number of cross-validation folds that CRUISE uses for pruning after building the full decision tree; this parameter was set to 2 rather than the default of 10 in order to reduce running time. The performance was evaluated by calculating the negative log-likelihood for the hold-out test data set T according to the resulting estimate of the probability function: − LL ( pˆ ) = −
∑
( xi , yi )∈T
ln ˆ p ( y = yi xi ).
(5)
Table 2 shows the results, averaged over five runs, alongside those reported in [2], and also gives the equivalent, but more intuitive, geometric mean of the predicted likelihoods of the test examples: LL( pˆ ) L ( pˆ ) = exp = ∏ ˆ p ( y = yi xi ) ( x , y )∈T T i i
1 T
.
(6)
Table 2. Negative log-likelihoods and equivalent geometric means of predicted likelihood for three methods, using the test data in the UCI Adult data set
Algorithm SVM + logistic RL kernel method Hard classifiers
− LL ( pˆ )
L ( pˆ )
5323
0.721
5288
0.723
5394
0.718
In this experiment, our algorithm shows slightly poorer performance than the other two methods. However, those methods were specifically designed for this kind of problem, and also employed some preprocessing of the input data representation [2]. We simply applied a ready-made hard classification algorithm – with no data preprocessing and hardly any parameter tuning1 – to stochastically modified data sets. 1
The one exception – decreasing the number of cross-validation folds used by CRUISE for tree pruning – should, if anything, probably worsen the performance of the method.
132
Ole Martin Halck
Seen in this light, we consider our results to be remarkably close to those previously obtained. We have mentioned cost-sensitive classification as one context where estimates of class probabilities are useful. This relationship can also be seen the other way – a set of cost-sensitive classifiers can be used for estimating class probabilities. To see this, consider the case when the cost of classifying a negative example as positive is 1, while the cost of misclassifying positive examples is k. Classifying an example as positive then means that the expected cost of this choice is less than the expected cost of a negative classification, that is, p( y = 0 x) ⋅1 is less than p( y = 1 x) ⋅ k , or p( y = 1 x) > 1 (1 − k ). Thus, the contour for the probability level p( y = 1 x) = q can be estimated by running a cost-sensitive classifier (again as a black box), setting the cost of misclassifying positive examples to k = (1 − q) q. Since our chosen hard classifier supports cost-sensitive classification, we also compared our algorithm with the results achieved when estimating the same 49 probability levels by this method, using the same way of estimating the final probability estimates. The result for the cost-sensitive estimation was − LL( pˆ ) = 5716, which is clearly inferior to the results achieved by our algorithm.
4
Discussion
In the informal exposition in Section 2, we considered the decision boundary of a hard classifier as an estimate of the contour in input space where p( y = 1 x) = 12 . In fact, this view is the basis of the motivation for our algorithm. However, although it is clearly always optimal that a given single data point x is classified as positive if and only if p( y = 1 x) ≥ 12 , the assumption that a classifier will show this behaviour over the whole input space does not necessarily hold, even given unlimited training data. The reason for this lies in the inductive bias of any practical classification algorithm – a given algorithm is not able to output an arbitrary classification function c : X → {0,1}, but has a limited hypothesis space from which to choose its classifier. This means that the globally best classifier in this hypothesis space, according to classification accuracy, may not classify according to p everywhere. Platt [2] shows, for example, that the decision boundary of a support vector machine trained on the UCI data set used here does not estimate the p = 12 contour well. This means that although we do not need to do any modification to a classification algorithm in order to apply it the way we have described, it is important that we know something about the relationship between its classification properties and class probabilities. This point of view sheds some light on the good results obtained here using the CRUISE software – like most other classification algorithms based on decision trees, CRUISE assigns to each of its leaf nodes the class to which the majority of the nodes’ examples belongs. When we consider that the set of tests leading to a given leaf node in effect describes a region in input space, it becomes
Using Hard Classifiers to Estimate Conditional Class Probabilities
133
clear that this way of assigning class labels makes the resulting decision boundary approximate the p = 12 contour. A natural objection to our approach is that it may seem unappealing to introduce more noise into an already noisy data set by modifying the training examples. Our answer to this is that the notion of n“ oise” is really not the correct one in this probabilistic setting. Unlike in many regression problems, where the observations of the function to be learnt consist of the true values with an added noise component, the data in our setting are simply the natural realisations of an underlying Bernoulli probability function. Thus, flipping the labels in a consistent way should be seen as merely shifting the probability levels of the function, rather than introducing more noise. These considerations also give a hint about which problems the algorithm is likely to solve well. Being based on shifting the probability levels in the instance space, it thus implicitly assumes that such a probability-based model is natural for the problem at hand. Consequently, the method should perform best if the problem is indeed truly probabilistic in nature. The value of our algorithm lies in the fact that it makes classification algorithms applicable to a new class of problems. This approach may thus be useful if a classification algorithm suited to the problem at hand is readily available, but a conditional probability estimation algorithm is not. If the hypothesis representation of the chosen classification algorithm is easily understandable to humans – as is the case for decision trees, for example – this method also has another advantage, namely that it yields descriptions of the regions of input space where the probability of the positive class is greater than a given level. The main practical disadvantage of our method is that it is somewhat expensive in terms of runtime – for each probability level, a new data set must be created and the classification algorithm run on this set. On the other hand, the user is free to choose an appropriate trade-off between runtime efficiency and the resolution of the estimating function, by selecting the number of probability levels to estimate. Another disadvantage is that it does not naturally generalize to multi-class problems. Of course, the process can be repeated for each class, but it is then unlikely that the estimated probabilities for the classes sum to one. In the special case where the classes have a natural order, however, the process can be run repeatedly by considering the upper n classes as positive examples, with n ranging from 1 to the number of classes minus one, each run yielding a probability function estimate pn. The probability levels for class m can then be estimated by considering the difference between the probability functions pm and pm–1 [10]. Even in this case, though, consistent estimates are not guaranteed, due to the stochasticity of the algorithm – pm – pm–1 may erroneously be negative in parts of the instance space.
5
Conclusion
We have described a method for obtaining probabilistic estimates of class membership in classification problems, using only hard classifiers. The algorithm works by generating a succession of data sets from the original one. In each of these sets, a proportion of the examples have their class labels flipped in a way that allows the decision boundary of a hard classifier, trained on the modified data, to be
134
Ole Martin Halck
interpreted as an estimate of a given probability contour. By collecting the resulting set of hard classifiers, we may form an estimate of the true conditional class probability function. The algorithm has been tested on a toy problem and a problem from the UCI data set repository, with encouraging results.
References 1.
MacKay, D. J. C.: The evidence framework applied to classification networks. Neural Computation 4 (1992) 720–736. 2. Platt, J. C.: Probabilities for SV machines. In: Smola, A.J., Bartlett, P., Schölkopf, B., Schuurmans, D. (eds.): Advances in Large Margin Classifiers, MIT Press (2000) 61–74. 3. Wahba, G.: Multivariate function and operator estimation, based on smoothing splines and reproducing kernels. In: Casdagli, M., Eubank, S. (eds.): Nonlinear Modeling and Forecasting, SFI Studies in the Sciences of Complexity, Proc. Vol. XII, Addison-Wesley (1992) 95–112. 4. Wahba, G.: Support vector machines, reproducing kernel Hilbert spaces and the randomized GACV. In: Schölkopf, B., Burges, C. J. C., Smola, A. J. (eds.): Advances in Kernel Methods – Support Vector Learning, MIT Press (1999) 69–88. 5. Tipping, M.: Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research 1 (2001) 211–244. 6. Domingos, P.: MetaCost: A general method for making classifiers cost-sensitive. In: Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining (KDD-99), ACM Press (1999) 155–164. 7. Blake, C. L., Merz, C. J.: UCI Repository of machine learning databases. Irvine, CA: University of California, Department of Information and Computer Science (1998). URL: http://www.ics.uci.edu/~mlearn/MLRepository.html. 8. Loh, W.-Y.: CRUISE v 1.09 web page. URL: http://www.stat.wisc.edu/~loh/ cruise.html. 9. Kim, H., Loh, W.-Y.: Classification trees with unbiased multiway splits. Journal of the American Statistical Association 96 (2001) 589–604. 10. Frank, E., Hall, M.: A simple approach to ordinal classification. In: De Raedt, L., Flach, P. (eds.): Proceedings of the 12th European Conference on Machine Learning (ECML 2001). Lecture Notes in Artificial Intelligence vol. 2167, Springer (2001) 1451– 56.
Evidence that Incremental Delta-Bar-Delta Is an Attribute-Efficient Linear Learner Harlan D. Harris University of Illinois at Urbana-Champaign, Department of Computer Science MC-258, Urbana, IL 61801 USA
[email protected]
Abstract. The Winnow class of on-line linear learning algorithms [10,11] was designed to be attribute-efficient. When learning with many irrelevant attributes, Winnow makes a number of errors that is only logarithmic in the number of total attributes, compared to the Perceptron algorithm, which makes a nearly linear number of errors. This paper presents data that argues that the Incremental Delta-Bar-Delta (IDBD) second-order gradient-descent algorithm [14] is attribute-efficient, performs similarly to Winnow on tasks with many irrelevant attributes, and also does better than Winnow on a task where Winnow does poorly. Preliminary analysis supports this empirical claim by showing that IDBD, like Winnow and other attribute-efficient algorithms, and unlike the Perceptron algorithm, has weights that can grow exponentially quickly. By virtue of its more flexible approach to weight updates, however, IDBD may be a more practically useful learning algorithm than Winnow.
1
Introduction
Linear learning algorithms make predictions by computing linear functions of their inputs. Since linear learners aren’t capable of representing or learning nonlinear concepts, practical use often requires either layering multiple linear learners and using a backpropagation algorithm, or generating an expanded feature space. Expanded feature spaces typically involve generating combinations of the original features, resulting in very large numbers of attributes irrelevant to the concept to be learned. The two algorithms most studied for use with this approach have been the Perceptron algorithm and Winnow [10,8,12,7]. Some learning domains such as computer vision and natural language processing naturally provide very large feature spaces with many irrelevant attributes, even without an expanded feature space. In this paper, I provide evidence that the Incremental Delta-Bar-Delta (IDBD) algorithm [14] combines the attribute-efficient properties of Winnow with additional robustness and flexibility, and is particularly useful for learning when many attributes may be irrelevant. In the on-line learning framework, the learner repeatedly performs a prediction task and receives a supervised training signal. Examples x are selected from an instance space, and are labeled by a concept c in a concept space. For each trial, the learner is given the example, predicts the example’s label, e.g. T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 135–147, 2002. c Springer-Verlag Berlin Heidelberg 2002
136
Harlan D. Harris
p ∈ {−1, 1}, and then receives the true label, e.g. ∈ {−1, 1}, where = c(x). The goal of an on-line learner is to minimize the total number of mistakes ( = p) made in the prediction task while learning concept c. For a linear learner, the concept space is linear functions, as represented by a linear threshold unit, or Perceptron. Given an input vector x, of width n, a Perceptron computes the function p = sign(w·x), where w is the weight vector of the Perceptron. Since the hyperplane defined by the weight vector alone always includes the zero, Perceptrons frequently include a fixed or trainable bias weight. In order to learn non-linearly-separable functions, a common approach is to generate conjunctions of the inputs, either explicitly [12], or using kernel functions [2]. Although arbitrary DNF expressions can then be represented, linear functions being adequate to represent disjunctions of these conjunctions, several issues remain. There are an exponential number of conjunctions of the n input features, requiring exponential time and space to process. Use of kernel functions present other problems, particularly in domains where the input space is naturally large and mostly irrelevant [7]. In perhaps the most important problem for on-line learning, the number of examples needed to learn may increase linearly with the size of the expanded input space. An attribute-efficient algorithm is one in which the number of required examples increases only logarithmically in the number of irrelevant features. In an engineering setting, these expanded feature space approaches have been used most commonly in natural language processing (e.g., [12]), in which the presence or absence of particular words or word combinations leads to very large numbers of sparse, irrelevant features. Valiant [15] motivates this type of expanded-basis learning by reviewing biological systems, which are able to learn quickly in settings with very large numbers of interconnected neurons. For example, in the cerebellum, the neuronal architecture closely resembles the expanded feature space and linear learner paradigm [5,13].
2 2.1
Algorithms Perceptron Learning Rule
The Perceptron learning rule trains a linear threshold unit (Perceptron) by adding a fraction of each mis-predicted input vector to the weight vector. wi ← wi + η( − p)xi ,
(1)
where η ∈ (0, 1) is a learning rate parameter. The Perceptron rule performs incremental gradient descent in weight space. Variations include the Least-MeanSquare (LMS) rule (for non-thresholded linear units), and backpropagation (for multi-layer networks). 2.2
Winnow
The Winnow algorithms were introduced by Littlestone [10,11] as linear learners that separate (winnow) relevant from irrelevant attributes. Winnow was designed
Evidence that Incremental Delta-Bar-Delta
137
to efficiently learn monotone (non-negative) disjunctions and r-of-k threshold functions, and many of its proofs and applications have been in those domains. There are a number of variations of Winnow, optimized for different domains and with varying notation. The Winnow2 version of the algorithm, shown below, is typically initialized with all weights set to 1 or n1 . The threshold is often set to n, giving a ratio of threshold to initial weights of n : 1 or n2 : 1. xi if p = −1 ∧ = 1, wi α x i wi ← wi (1/α) (2) if p = 1 ∧ = −1, otherwise, wi where xi ∈ {0, 1}. The Balanced Winnow variation learns non-monotone functions. Winnow works by forcing weights on irrelevant features towards zero exponentially quickly for false positives, and by raising weights on relevant features exponentially quickly for false negatives. The use of a multiplicative rather than additive update rule means that fewer changes are needed for convergence. The mistake bounds for the Winnow and Perceptron algorithms have been closely compared [8]. When learning monotone disjunctions of r literals, Winnow makes no more than O(r log n) mistakes, where n is the total number of attributes, and is thus attribute-efficient. In contrast, when learning the same class of functions, the Perceptron learning rule makes a nearly linear (in n) number of mistakes. If n is much larger than r, then Winnow performs much better, while if n is not much larger then r, the Perceptron algorithm does better, particularly if few features are active (non-zero) at any one time. 2.3
IDBD
Sutton’s IDBD (Incremental Delta-Bar-Delta) algorithm [14] is a variation on Jacobs’ Delta-Bar-Delta algorithm [6]. Delta-Bar-Delta, in turn, is a variation on the LMS learning rule (and also on backpropagation) that includes heuristics designed to speed up learning, most notably the inclusion of per-weight learning rates. The incremental version, IDBD, is additionally suitable for on-line learning and for learning non-stationary concepts. (Unlike DBD, IDBD is restricted to learning weights for linear units, and no backpropagation variant has apparently been derived. Note also that IDBD was originally derived to learn linear functions, with real-valued inputs and non-thresholded outputs, but I am using it here to learn weights for a linear-threshold unit and Boolean inputs.) As shown below, the particular way that IDBD accelerates learning seems to put it in the same category of attribute-efficient learning algorithms as Winnow. IDBD uses an additive weight update rule with a per-unit modifiable learning rate, rather than a shared, fixed learning rate as in the Perceptron rule. The algorithm can be seen to be performing gradient descent in learning-rate space, as well as in weight-space [14]. Intuitively, if a weight repeatedly changes in the same direction, then the learning rate is increased, since this input’s weight appears not to have converged (it is either too high or too low). If the weight
138
Harlan D. Harris
changes appear to be random, then the learning rate is decreased, since this input’s weight appears to have converged, and the weight is oscillating around the correct value. The update rules are as follows: βi = βi + θhi xi ( − p) βi
(3)
ηi = e wi = wi + ηi xi ( − p)
(4) (5)
hi = hi [1 − ηi x2i ]+ + ηi xi ( − p)
(6)
Note the per-input learning rate, ηi , in equation 5. Also, observe that β is updated using the Perceptron rule, using θ as a learning rate, with the addition of hi , which represents the recent summed history of the weight’s changes. The [·]+ notation is equivalent to max(·, 0). An interesting aspect of IDBD is that its update rules are defined recursively; β depends on h, which depends on β via η. Sutton [14] gives experimental results showing that IDBD has lower cumulative error on a simple non-stationary tracking task (where the concept periodically changes) than does the LMS rule, is able to distinguish relevant from irrelevant inputs, and weights those inputs approximately optimally. However, he does not address the issue of attribute-efficiency, and no further theoretical or empirical examination of IDBD has been performed to my knowledge.
3
Experiments
In this section I present new experimental evidence that compares IDBD to Winnow and the Perceptron algorithm. Experiment 1 tests these algorithms on the l-of-m-of-n threshold function, and shows that IDBD usually performs comparably to Winnow, and much better than the Perceptron learning rule, when learning disjunctive concepts with many irrelevant attributes. Experiment 2 systematically varies the complexity parameter n, with results suggesting that IDBD’s mistake bounds grow logarithmically, like Winnow, and unlike the Perceptron algorithm. Experiment 2b uses the same domain to illustrate IDBD’s low need for parameter tuning. Experiment 3 then looks at learning of random linear functions with few irrelevant attributes, and shows that IDBD performs well even in circumstances when Winnow does poorly, which reinforces the idea that IDBD is not merely a reimplementation of Winnow. (In an earlier experiment, IDBD was shown to be useful for learning complex DNF concepts with incomplete expanded feature spaces [5].) For all results reported below, 10 replications were performed with different random number seeds used to generate the data, and the results were averaged. Initial weights for all three algorithms were set to n1 , and the thresholds were n2 .
Evidence that Incremental Delta-Bar-Delta
139
1000 Perceptron Winnow IDBD
Cumulative Errors
800
600
400
200
0
1
10
25 50 75 Number of Required Active Attributes (l)
100
Fig. 1. Cumulative errors after 10,000 examples in an l-of-100-of-1000 learning task. Mean of ten replications, with 95% confidence intervals. Winnow performs best when l = 1 (disjunction) or l = 100 (conjunction), but IDBD performs similarly or better for intermediate values of l
3.1
Experiment 1: Comparing IDBD to Winnow with Irrelevant Attributes
This experiment was designed to compare IDBD to Winnow and the Perceptron algorithm on a task covered by the theoretical results of Kivinen et al. [8]. When learning a concept where the Perceptron rule should make a nearly linear (in irrelevant attributes) number of errors, and Winnow should make a logarithmic (in irrelevant attributes) number of errors, how does IDBD empirically perform? To test this, the algorithms were compared on l-of-m-of-n Boolean concepts. These concepts are defined so that if l or more of the first m attributes are set to 1, the example is positive, and if fewer than l of the first m attributes are set to 1, the example is negative. Data was generated as follows. Each input vector was of width n. The first m attributes were considered “relevant,” while the remaining n − m attributes were “irrelevant,” and were set to 1 with a probability of 0.25. The examples were half positive and half negative. For positive examples, a number r in the interval [l, m] was chosen, while for negative examples, r was in the interval [0, l − 1]. Then, r attributes in the first m were randomly selected and were set to 1, with the other relevant attributes set to 0. 10,000 examples were generated and presented to the algorithms. Based on pilot experiments, the learning rates of the Perceptron rule (η) and IDBD (θ) were set to 0.1, and the learning rate of Winnow (α) was set to 1.1.
140
Harlan D. Harris
Table 1. Error rate in final 1000 examples of an l-of-100-of-1000 learning task. Mean of ten replications Perceptron Winnow IDBD
1 6.8 1.6 4.1
10 5.0 1.0 2.7
25 3.8 1.1 2.1
50 3.0 1.4 1.7
75 2.9 2.6 2.1
100 4.8 0.0 3.2
Moderate changes to these rates do not qualitatively change the results (see also Experiment 2b). Each algorithm was run with various values of l, with m = 100 and n = 1000. The number of errors after presentation of 10,000 examples is shown in Figure 1. The error rate for the last 1000 examples is shown in Table 3.1. IDBD performed uniformly much better than Perceptron, with fewer cumulative errors and lower final error rates. Compared to Winnow, IDBD usually made a similar number of cumulative errors, but often with a somewhat higher residual error rate. 3.2
Experiment 2: Showing Attribute-Efficient Learning by IDBD
The theoretical results predict that as n grows, the Perceptron algorithm should make errors that increase nearly linearly, while Winnow’s mistakes should increase only logarithmically. If IDBD is attribute-efficient, its results should be similar to Winnow’s. To test this, l and m were set to 10 and 100, respectively, and n was varied between 100 and 1000. Instead of running for a fixed number of examples, each trial was terminated when 200 examples in a row were classified correctly (i.e., the concept had been effectively learned). The learning rates for the Perceptron rule (η), IDBD (θ), and Winnow (α) were set to 0.8, 0.1, and 1.4, respectively (see below). This test shows precisely how well each algorithm scales up with irrelevant attributes, and should illustrate the theoretical predictions discussed above. The results are shown in Figure 2. The Perceptron algorithm, as expected, did much more poorly than Winnow, and its mistake measures increased sharply with n. Winnow made only slightly more errors as n increased, and the curve appears logarithmic. Most interestingly, IDBD shares Winnow’s property of making few additional errors as n increases. Like Winnow, IDBD’s attribute-efficiency curve seems logarithmic, and is qualitatively different from the Perceptron algorithm. These results strongly suggest that IDBD shares the irrelevant-attribute-efficiency that Winnow is know for.
Evidence that Incremental Delta-Bar-Delta
141
750
Cumulative Errors
Perceptron Winnow IDBD 500
250
0
200
400 600 800 Total Attributes (100 Relevant)
1000
Fig. 2. Cumulative errors at convergence (200 correct predictions in a row) for a 10-of-100-of-n learning task. Mean of ten replications, with 95% confidence intervals. IDBD and Winnow show attribute efficiency, unlike the Perceptron algorithm
3.3
Experiment 2b: Showing IDBD Is Relatively Insensitive to Its Parameters
To more carefully explore the strengths and weaknesses of these algorithms, and to replicate one of Sutton’s conclusions about IDBD [14], the same domain was used for an exploration of the learning rate parameter space. As before, the target concepts are 10-of-100-of-n threshold functions. The learning rates tested were 0.001, 0.01, 0.05, 0.1, 0.2, 0.4, and 0.8. (For Winnow, α was set to 1 plus the above numbers.) The variable n took the values 100, 500, and 1000. As before, 10 replications were performed, and the cumulative errors before convergence (200 correct predictions in a row) were counted. The results are shown in Table 3.3. Clearly, the IDBD algorithm was relatively insensitive to small values of the learning rate, as compared with Winnow and the Perceptron algorithm. However, it was susceptible to convergence failure when the learning rate was very high. Again, note the sharp increase in errors between n = 500 and n = 1000 for the Perceptron algorithm, compared to the small increases for Winnow and IDBD.
142
3.4
Harlan D. Harris
Experiment 3: Showing IDBD Does Well when Winnow Does Poorly
The same theoretical results that predict that Winnow should learn with fewer mistakes when irrelevant attributes are predominant also predict that the Perceptron learning rule should learn with fewer mistakes when all (or nearly all) attributes are relevant. In addition, since Winnow is optimized for disjunctions, it’s reasonable to expect that it will do relatively poorly when learning arbitrary weights. By virtue of its fixed, multiplicative weight update rule, Winnow may find it difficult to set weights with the precision needed for arbitrary, smallmargin concepts. Weights can oscillate around a target value, their step size too large to approach it. To test this intuition, the three algorithms were compared on randomly-generated linear threshold functions. These functions are of the ˜ · x, where the elements of w ˜ were randomly generated, indepenform c(x) = w dent real-valued numbers. The data was generated as follows. For each concept, target weights were selected from a uniform distribution over the interval [0, 10]. Then, 20,000 examples of each concept were generated by randomly setting Boolean inputs with probability 0.5, multiplying by the target concept’s weights, and comparing the result to a threshold equal to the expected value, 10 ∗ n4 . Approximately half of the resulting examples were thus positive, and half were negative. Note that weights near 10 are maximally relevant, while those near zero are essentially irrelevant. Unlike the previous experiments, there is a continuum of relevance, with nearly all weights being somewhat relevant to the target concept. In this experiment, we used learning rates of 0.01 for the Perceptron rule, 0.2 for IDBD, and 1.01 for Winnow, based on informal pilot experiments. The results, for n = 200 (other values of n were similar), can be seen in Figure 3. As expected, the Perceptron algorithm performed better than Winnow.
Table 2. Cumulative errors to convergence on a 10-of-100-of-n task, varying learning rates. Bold represent the parameter that results in the lowest error count at n = 1000. NC means that the algorithm did not always converge even after 100,000 examples n 100 500 1000 Winnow 100 500 1000 IDBD 100 500 1000 Alg. Percep.
.001 11645 12495 14608 15145 20442 22762 380 1185 1374
.01 1671 2889 4800 1667 2126 2390 365 910 1097
.05 382 1215 1415 358 445 502 289 421 553
.1 216 823 1018 190 240 261 223 358 352
.2 115 434 635 115 153 185 171 211 NC
.4 63 413 562 107 132 164 118 NC NC
.8 37 330 550 192 278 480 180 NC NC
Evidence that Incremental Delta-Bar-Delta
143
The IDBD algorithm showed performance similar to, and slightly but significantly better than, that of the Perceptron learning rule for this domain.
4
Discussion
These empirical results suggest that IDBD is an attribute-efficient learner, with a logarithmic attribute-efficiency curve. IDBD makes mistakes at rates only somewhat higher than Winnow when irrelevant attributes are plentiful, but significantly lower than Winnow when all attributes are relevant. By being both attribute-efficient and flexible, IDBD should be particularly useful when the number of irrelevant attributes is unknown prior to learning. For example, robotic multi-modal sensory systems, computational linguistic applications [12], and neurological modeling applications [5,15] often naturally have extensive irrelevant attributes and expanded feature spaces. Other linear learning systems with attribute-efficient properties have been mentioned in the literature. The p-norm family of Generalized Perceptron1 algorithms [4] is able to approximate the Perceptron and Winnow algorithms by choices of the p parameter. For a sufficiently large p, the p-norm algorithm is known to be attribute-efficient. The ALMAp algorithm [3] combines a p-normlike computation with a decaying learning rate, a specifiable margin, and nor1
The Generalized Perceptron defines Quasi-Additive algorithms defined by a vector z, updated exactly like the Perceptron, and a function f (z) = w specifying the weight vector as a function of z. The Perceptron, Balanced Winnow, and p-norm algorithms may be defined by f (z) = z, sinh(z), and sign(z)p|z|p−1 , respectively.
3000 Perceptron IDBD Winnow
Cumulative Errors
2500 2000 1500 1000 500 0
0
5000
10000 Examples
15000
20000
Fig. 3. Cumulative errors on 200-attribute random linear threshold concepts. Average of ten replications, with 95% confidence intervals
144
Harlan D. Harris
malized weight vectors. Although it has not been shown to be attribute-efficient, it seems likely to be so. IDBD has similarities with these two algorithms, but differences and advantages as well. Like p-norm and Winnow, and unlike ALMAp , IDBD’s weights are unbounded, and by having increasing learning rates2 , can grow exponentially fast. Like ALMAp , but unlike p-norm and Winnow, IDBD uses adjustable learning rates which decrease as the learner converges. However, IDBD has per-weight learning rates, while ALMAp has only a single dynamic learning rate. A practical advantage in real-world learning situations is that IDBD has only a single parameter, the learning rate θ, compared to p-norm with two and ALMAp with three. (IDBD has more variables requiring initial conditions, however.) IDBD is relatively insensitive to the settings of its learning rate parameter [14], allowing IDBD to be used with more confidence given less knowledge of the target concept than other algorithms. A final advantage is that IDBD was designed to be capable of learning drifting concepts, while other attribute-efficient learners may deal poorly with concept change. (But see [1] for a variation on Winnow which does learn drifting concepts well.) 4.1
Analytical Support
Several analytical approaches can be used to complement the empirically-supported assertion that IDBD is attribute-efficient. In this section, I investigate the relative speed of weight changes in various algorithms, and argue that exponential weight increases are sufficient for attribute-efficiency. Then, I show that a variation of IDBD is very similar to the Generalized Perceptron formulation of an algorithm that is known to be attribute-efficient. IDBD’s attribute-efficiency seems to be due to the algorithm’s ability to increase weights exponentially fast on relevant attributes, relative to the other weights. With exponentially fast increases, relevant weights outweigh irrelevant weights quickly enough that a number of errors only logarithmic in the number of irrelevant weights need be made [10]. Consider a simple case of the weight on a relevant Boolean input, where the learner makes repeated false negative predictions. That is, for simplicity, let x = 1 and − p = 1 (actually = 2, but we can use 1 by doubling the learning rate). How does w change over time? n For the Perceptron algorithm in this case, w ← w + η, and w(n) = i=1 η = O(n). The Perceptron algorithm only increases weights linearly quickly, and it is not attribute-efficient. For Winnow under these assumptions, w ← αw, and w(n) = w(0)αn = O(αn ). Clearly, Winnow increases weights exponentially quickly, and it is attribute-efficient. Before examining IDBD, let’s first examine a simple second-order gradientdescent relative, in which η = β and h = 1. Then, (using T instead of θ so as not to be confused with the θ(·) of asymptotic notation) we have β ← β + T 2
Winnow can be rewritten as an additive learning rule with a per-weight learning rate that is equal to a function of the weight itself.
Evidence that Incremental Delta-Bar-Delta
145
Table 3. IDBD-h algorithm prediction label update 1 -1 if xi = 1, wi ← wi − ezi , demotion zi ← zi − η promotion -1 1 if xi = 1, zi ← zi + η wi ← wi + ezi ,
n n and ← w + β. Therefore, β(n) = i=1 T = O(n), and w(n) = i=1 β(n) = n w n 2 i=1 j=1 T = O(n ). This algorithm does not grow weights exponentially, and thus could not be expected to show the logarithmic attribute-efficient behavior of Winnow3 . Adding η = eβ back to IDBD, but keeping h = 1, we now have that w ← w + eβ , and w(n) = ni=1 eO(n) = O(en ). This algorithm, without the decaying sum of recent weight changes, can increase weights exponentially, and thus should be attribute-efficient. It is rather similar to Winnow, in that the effective learning rate is related to all previous weight changes. Re-adding the h of equation 6 to complete IDBD results in an attributeefficient algorithm that modulates the exponential in the exponentially-increasing learning rate by the extent to which the recent weight changes have been in the same direction. The result is more flexible than Winnow, yet still appears attribute-efficient. With some modifications, IDBD can be shown to be similar to the Weighted Majority algorithm [11,9]. To see this, first modify IDBD by separating the promotion (false negative) and demotion (false positive) cases, as in the original presentation of Winnow [10]. Then, let hi = 1 as above, and assume that xi ∈ {0, 1}. We then have the algorithm shown in Table 3, which I’ll call IDBD-h. Note that by reversing the order of wi and zi updates in demotions, that a sequence of promotions and demotions can be re-ordered arbitrarily and will result in the same final weight vectors. Assuming the initial values of z and w are zero, the weight vector w can be zi /η iη η written as a function of the current value of z: wi = i=1 e = eηe−1 (ezi − 1). Note that the ratio is a constant, and thus can be shifted to the bias weight, if their is one, or ignored otherwise. We therefore have a Generalized Perceptron with f (z) = ez − 1. This is very similar to the Weighted Majority algorithm [9], which can be defined as a G.P. with f (z) = ez (plus simple transformations of the input representation and learning rate)[4]. Clearly, the IDBD-h algorithm should have mistake bounds that are similar to those of the attribute-efficient (see [11]) Weighted Majority. Unfortunately, direct analysis of IDBD-h has so far failed, as the methods for finding mistake bounds for Generalized Perceptron can’t be easily applied to the f (z) above, and other approaches have yet to be successful. 3
√ It’s interesting to consider whether this algorithm might have O( n) mistake bounds. In fact, work in progress shows that a closely related algorithm does.
146
Harlan D. Harris
However, this analysis on a simplified version of IDBD further provides evidence for the notion of IDBD’s attribute-efficiency. 4.2
Conclusions
This paper has provided an empirical basis for believing that the IDBD algorithm has attribute-efficient properties. In addition, it was shown that IDBD has strengths relative to more traditional attribute-efficient learners such as Winnow, and that analytical approaches to establishing IDBD’s attribute efficiency show promise. Future work will analytically explore the space of attribute-efficient linear learners, including IDBD and related algorithms. By more clearly identifying the relationships between attribute-efficient algorithms, and by defining the functional features of each, it will become easier to explain experimental results, and to apply these algorithms to real-world problems.
Acknowledgments I wish to thank Dan Roth, Gary Dell, Jerry DeJong, Sylvian Ray, Jesse Reichler, Dav Zimak, Ashutosh Garg, and several anonymous reviewers for their helpful suggestions on this and earlier versions of this paper. This work was in part supported by NSF grant SBR-98-73450 and NIH grant DC-00191.
References 1. P. Auer and M. K. Warmuth. Tracking the best disjunction. Machine Learning, 32:127–150, 1998. 144 2. N. Christiani and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2000. 136 3. Claudio Gentile. A new approximate maximial margin classification algorithm. Journal of Machine Learning Research, 2:213–242, December 2001. 143 4. Adam J. Grove, Nick Littlestone, and Dale Schurrmans. General convegence results for linear discriminant updates. Machine Learning, 43(3):173–210, 2001. 143, 145 5. Harlan D. Harris and Jesse A. Reichler. Learning in the cerebellum with sparse conjunctions and linear separator algorithms. In Proceedings of the International Joint Conference on Neural Networks 2001, 2001. 136, 138, 143 6. Robert A. Jacobs. Increased rates of convergence through learning rate adaptation. Neural Networks, 1:295–307, 1988. 137 7. Roni Khardon, Dan Roth, and Rocco Servedio. Efficiency versus convergence of boolean kernels for on-line learning algorithms. In Proceedings of Neural Information Processing Systems 2001, 2001. 135, 136 8. J. Kivinen, M. K. Warmuth, and P. Auer. The Perceptron algorithm versus Winnow: Linear versus logarithmic mistake bounds when few input variables are relevant. Artificial Intelligence, 97:325–343, 1997. 135, 137, 139 9. N. Littlestone and M. K. Warmuth. The weighted majority algorithm. Information and Computation, 108:212–261, 1994. 145 10. Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linearthreshold algorithm. Machine Learning, 2:285–318, 1988. 135, 136, 144, 145
Evidence that Incremental Delta-Bar-Delta
147
11. Nick Littlestone. Mistake bounds and logarithmic linear-threshold learning algorithms. PhD thesis, University of California, Santa Cruz, Technical Report UCSCCRL-89-11, March 1989. 135, 136, 145 12. Dan Roth. Learning to resolve natural language ambiguities: A unified approach. In Proceedings of AAAI-98, 15th Conference of the American Association for Artificial Intelligence, pages 806–813, 1998. 135, 136, 143 13. Nicolas Schweighofer and Michael A. Arbib. A model of cerebellar metaplasticity. Learning and Memory, 4:421–428, 1998. 136 14. Richard S. Sutton. Adapting bias by gradient descent: An incremental version of Delta-Bar-Delta. In Proceedings of the Tenth National Conference on Artificial Intelligence, pages 171–176. MIT Press, 1992. 135, 137, 138, 141, 144 15. Leslie G. Valiant. Projection learning. Machine Learning, 37:115–130, 1999. 136, 143
Scaling Boosting by Margin-Based Inclusion of Features and Relations Susanne Hoche and Stefan Wrobel Otto-von-Guericke University, Magdeburg, Germany {hoche,wrobel}@iws.cs.uni-magdeburg.de
Abstract. Boosting is well known to increase the accuracy of propositional and multi-relational classification learners. However, the base learner’s efficiency vitally determines boosting’s efficiency since the complexity of the underlying learner is amplified by iterated calls of the learner in the boosting framework. The idea of restricting the learner to smaller feature subsets in order to increase efficiency is widely used. Surprisingly, little attention has been paid so far to exploiting characteristics of boosting itself to include features based on the current learning progress. In this paper, we show that the dynamics inherent to boosting offer ideal means to maximize the efficiency of the learning process. We describe how to utilize the training examples’ margins - which are known to be maximized by boosting - to reduce learning times without a deterioration of the learning quality. We suggest to stepwise include features in the learning process in response to a slowdown in the improvement of the margins. Experimental results show that this approach significantly reduces the learning time while maintaining or even improving the predictive accuracy of the underlying fully equipped learner.
1
Introduction
Boosting is a method for enhancing learning algorithms by basing predictions on a group of specialized hypotheses. Instead of searching for one highly accurate prediction rule covering a given set of training examples, an ensemble of rules is constructed by repeatedly calling a base learner with a changing distribution of weights for the training examples. Each rule in the ensemble might cover only a small subset of the examples, and all predictions are combined into one accurate joint prediction. Boosting is a popular technique for increasing the accuracy of classification learners and has been developed into practical algorithms that have demonstrated superior performance on a broad range of application problems in both propositional and multi-relational domains [3,19,17,5,7]. However, the iterative nature of boosting implies an amplification of the underlying learner’s complexity. Boosting’s efficiency is vitally determined by the base learner’s efficiency. A standard approach to deal with the issue of efficiency in the presence of large feature sets would be to use a feature selection method in an a priori fashion, and then run boosting with the small selected feature subset. However, deciding a priori on the number of features to be included in T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 148–160, 2002. c Springer-Verlag Berlin Heidelberg 2002
Scaling Boosting by Margin-Based Inclusion of Features and Relations
149
the learning process might lead to inferior results since it is often difficult to decide just how many features to include. If too many features are included the learner is unnecessary slow, if too few features are included the learning result might not be sufficiently accurate. Instead, in this paper we suggest to actively determine the right balance between speed and accuracy of a learner based on its learning progress. We propose to monitor the learning success in terms of the development of the training examples’ mean margins - which are known to be maximized by boosting and to present step-by-step promising features to the learner whenever the improvement of the margins drops below a certain threshold. The margins’ improvement is measured by the ratio of the mean margins’ gradients averaged over several iterations and the current gradient of the training examples’ mean margins. This ratio increases from one iteration to the next as long as the margins increase significantly. As soon as the ratio starts to decrease, an estimate of the slowdown in the margins’ improvements is determined. This estimate predicts the expected decrease of the ratio and is used to determine when to provide a new feature to the learner. Whenever the actual decrease of the ratio is exceeding the predicted decrease by a certain factor, a new feature is included in the learning process. To this end, all features present in the training examples are initially sorted according to their mutual information [25,14] with the examples’ class, and new features are provided to the learner in a demand-driven fashion, starting with the top two features and the relations in which they occur. The evaluation of our approach on various domains shows that our approach significantly reduces learning times while maintaining or even improving predictive accuracy. Although our learner is multi-relational, our experiments indicate that the results apply equally well to boosting in propositional domains. This paper is organized as follows. In section 2, we review boosting. In section 3, we present our approach to include features and relations into the learning process in a demand-driven fashion. Our experimental evaluation of the approach is described and discussed in section 4. In section 5, we discuss related work and conclude in section 6 with some pointers to future work.
2
Boosting
Boosting is a method for improving the predictive accuracy of a learning system by means of combining a set of base classifiers constructed by a base learner into one single hypothesis [22,20,19]. The idea is to “boost” a weak learner performing slightly better than random guessing into an arbitrarily accurate learner by repeatedly calling the learner on varying probability distributions over the training instances. The probability distribution models the weight associated with each training instance and indicates the influence of an instance when building a base classifier. Initially, all instances have equal influence on the construction of a base hypothesis, i.e. the probability distribution is uniform. In each iterative call of the learner, a base hypothesis is learned with a prediction confidence for each example. The weights of misclassified instances are increased and those of cor-
150
Susanne Hoche and Stefan Wrobel
rectly classified instances are decreased according to the confidence of the learned base hypothesis. Thus, correctly classified instances have less and misclassified instances have more influence on the construction of the base hypothesis in the next iteration. That way, in each new round of boosting the learner is confronted with a modified learning task and forced to focus on the examples which have not yet been correctly classified. Finally, all base hypotheses learned are combined into one strong hypothesis. An instance x is classified by the strong hypothesis by adding up the prediction confidence of each base hypothesis covering x, and classifying x according to the sign of this sum. 2.1
Constrained Confidence-Rated Boosting
In this paper, we employ a specific form of constrained boosting, C 2 RIB, Constrained Confidence-Rated ILP-Boosting, which we introduced in [7] and which forms the basis of the work presented here (C 2 RIB D ). In Tables 1 and 2, we give a concise description of the proposed algorithm. Components of the base algorithm C 2 RIB in Table 1 are denoted by ’◦’. For a definition of the functions in the following explanation, the reader is referred to Table 1. In C 2 RIB, the training instances are randomly split into two sets used for specialization and pruning of clauses, respectively. Starting with the target predicate, the refinement operator ρ of the relational learner iteratively refines the clause C maximizing the objective function z˜ until either a clause C is found with hitherto maximal z˜(C ) that covers only positive examples, or z˜ can not be further maximized. The resulting clause is subject to overfitting on the training data, and thus immediately considered for pruning. The generated hypothesis is compared to the so called default hypothesis, just comprising the target predicate and satisfying all examples. Whichever of these two hypotheses maximizes the objective function z is chosen as the base classifier of the current iteration and its prediction confidence is used to update the probability distribution for the next iteration.
3
Margin-Based Inclusion of Features and Relations
The objective of our work presented here is to accelerate the learning process of the boosted ILP-learner C 2 RIB without a deterioration of its prediction accuracy. The idea is to equip the learner at all times with the right amount of power needed to “successfully” perform the learning task, i.e. to start the learner with a few features and relations to be considered for refinement, monitor the learning results and include additional features and relations into the learning process by demand. For this purpose, we exploit the dynamics inherent to boosting, namely that it a) is known to maximize training examples’ margins, b) is based on combining classifiers specialized on certain fractions of the instance space, and c) works by repeatedly calling a weak learner. Table 2 and the sections of Table 1 marked with ’•’ give a concise description of the algorithm which is detailed in the following. References to Table 2 will be indicated by “T2. ”.
Scaling Boosting by Margin-Based Inclusion of Features and Relations
151
Table 1. C 2 RIB D Algorithm Let N denote the number of training instances ei = (xi , yi ) ∈ E = E + ∪E − , and yi = 1 for ei ∈ E + and yi = −1 for ei ∈ E − . Let p be the target predicate of arity a(p) to be learned, T the total number of iterations of the weak learner, and D a probability distribution over E with Dit the probability of ei in the t-th iteration. For a clause C and a set S ⊆ E, let w+ , w− be weight functions defined as in 1. and 2. further down this page, and c(C, S) C’s prediction confidence on S defined according to 3. • Let F be the set of features sorted in descending order with respect to their mutual information with the examples’ class computed according to equation (2), and F the set of features known to the learner, initially comprising the top two features of F. ◦ Set Di1 := N1 for 1 ≤ i ≤ N ◦ For t = 1 . . . T Dit ≈ 23 ◦ Split E randomly into G and P according to Dt s.t. (xi ,yi )∈G ◦ C := p(X1 , · · · , Xa(p) ) ◦ Z˜ := 0 ◦ While w− (C, G) > 0 z (C )}, where z˜ is defined as in 4. ◦ Let C := argmaxC ∈ρ(C) {˜ ˜ ◦ Let Z := z˜(C ) ◦ If Z˜ − Z˜ ≤ 0 exit loop ◦ Else C := C , Z˜ := Z˜ ◦ P runes(C) := {p(X1 , · · · , Xa(p) ) ← B | C = p(X1 , · · · , Xa(p) ) ← BB } ◦ Remove from P runes(C) all clauses C where c(C , E) ≤ 0 ◦ If P runes(C) = ∅ let Ct := p(X1 , · · · , Xa(p) ) ◦ Else ◦ C := argminC ∈P runes(C) {loss(C )}, with loss(C ) defined as in 5. ◦ Let Ct := argmaxC ∈{C ,p(X1 ,···,Xa(p) )} {(z(C )}, with z defined as in 6. ◦ ht : X → is the function c(Ct , E) if e = (x, y) ∈ E is covered by Ct ht (x) = 0 else Dit
◦ Update the probability distribution: Dit :=
e(yi ·ht (xi ))
t
D , Dit+1 := i i
Dit
,1 ≤
i≤N =F • If t > 2 and F • Let Ht := {h1 , · · · , ht }, with base classifier hk of iteration 1 ≤ k ≤ t • F = CheckLearningP rogress(Ht, t, E, N, F, F , ) as detailed in Table 2
◦ Construct the strong hypothesis H(x) := sign
3.1
Ct :(x,y) covered by Ct
c(Ct , E)
Margins in the Framework of Confidence-Rated Boosting
In this approach, we monitor the learning success by observing the training examples’ mean margins. The margin of an example ei = (xi , yi ) under an ensemble Ht of classifiers is a real-valued number margin(Ht , ei ) ∈ [−1, 1] indicating
152
Susanne Hoche and Stefan Wrobel
Table 1. continue
Function Definitions
Dt 1. w+ (C, S) =def. (xi ,1)∈S covered by C i t 2. w− (C, S) =def. D (xi ,−1)∈S covered by C i 1 3. c(C, S) =def.
1 2
ln
w+ (C,S)+ 2N
1 w− (C,S)+ 2N
.
4. z˜(C) =def. w+ (C, G) − w− (C, G). 5. loss(C) =def. (1 − (w+ (C, P) + w− (C, P))) + w+ (C, P) · e(−c(C,G)) + w− (C, P) · e(c(C,G)) 6. z(C) =def.
w+ (C, E) −
2
w− (C, E)
the amount of disagreement of the classifiers in Ht with respect to ei ’s class. For the binary case we deal with here, we can define the margin of ei under Ht as the difference between the sum of the absolute weights of those base classifiers in Ht predicting for ei its correct class yi , and the sum of the absolute weights of those base classifiers in Ht predicting for ei the incorrect class y = yi [24,6]. We define the weight w(hk , ei ) of a base classifier hk with respect to an example ei = (xi , yi ) as its prediction confidence (as defined in Table 1, 3.) if hk covers ei , and 0 otherwise. We define, following [6], the margin of ei under ensemble Ht = {h1 , · · · , ht } of t classifiers hk with weights w(hk , ei ) as |w(hk , ei )| − |w(hk , ei )| . (1) margin(Ht , ei ) = hk ∈Ht :hk (xi )=yi
hk ∈Ht :hk (xi )=yi
We normalize the prediction confidences of the base classifiers such that the absolute values of the confidences of allbase classifiers sum to 1. Consequently, hk ∈Ht :hk (xi )=yi |w(hk , ei )| ∈ [0, 1], hk ∈Ht :hk (xi )=yi |w(hk , ei )| ∈ [0, 1], and margin(Ht , ei ) ∈ [−1, 1] for all ei ∈ E and ensembles Ht . Large positive margins (close to +1) indicate “confident” correct classification, and small negative margins (close to −1) indicate “confident” incorrect classification. Boosting is known to be especially effective at increasing the margins of the training examples [24,6]. It forces the focus on misclassified instances by increasing their probabilities. Misclassified examples show small or even negative margins. Consequently, the learner is forced to search for base hypotheses which correctly classify these hard examples and thus increase their margins. Since the margins are increasing in the course of iterated calls to the base learner, the gradient of the mean margins can be assumed to be positive and be employed to monitor the quality of the learning process. The repeated calls of the base learner in the boosting framework allow for a stepwise inclusion of features in the course of iterations. If the learning curve indicates that the learner’s current instrumentation is not sufficient any longer,
Scaling Boosting by Margin-Based Inclusion of Features and Relations
153
Table 2. CheckLearningProgress CheckLearningP rogress(Ht, t, E, N, F, F ) returns F
n
1. Compute for E the examples’ average margin AMt = N1 margin(Ht, ei ) i=1 according to equation (1) 2. Let gradient(t) be the slope of the line determined by the least square fit to the AMk in k, 1 ≤ k ≤ t Tl 1 gradient(t − j) if t > Tl j=1 3. Compute trend(t) := T1l t−1 gradient(j) if t ≤ Tl , t−2 j=2 where Tl denotes the number of iterations over which the gradients are averaged trend(t) 4. Compute ratio(t) := gradient(t) 5. If t > 3: (a) If ratio(t−1) exhibits a local maximum, estimate the slowdown in the margins’ improvement in the form of predict(x) := a ln(1x ) , where a, b are chosen such b that predict(2) = ratio(t − 1) and predict(3) = ratio(t); of f set := t − 3 1 (b) If a, b have already been determined, compute predict(t) := a t−of f set ln(
b
)
(c) Else predict(t) := ratio(t) (d) If predict(t) > α, select the first element F of F, i.e. the feature with the next ratio(t) greatest mutual information with the training examples’ class; F := F ∪ {F } (e) Else F := F
learning can be continued in the next iteration with an enhanced equipment. Initially, we provide our learner with the target relation to be learned together with two features with the greatest mutual information [25,14] with the examples’ class and the relations in which these features occur. In each iteration of boosting, the learning success is monitored in terms of the development of the training examples’ mean margins. To this end, we define the gradient gradient(t) of an iteration t as the slope of the line determined by the least square fit to the average margins in each single iteration 1 to t (T2.1, T2.2). We then average the gradients over the last Tl iterations as to smooth temporary fluctuations in the margins’ development (T2.3), and compute the ratio of the averaged previous gradients and the gradient of the current iteration (T2.4). The margins’ improvement is measured by this ratio which increases from one iteration to the next as long as the margins increase significantly. As soon as the ratio starts to decrease, an estimate for the slowdown in the margins’ improvements is determined (T2.5a). This estimate predicts the expected decrease of the ratio and is used to determine when a new feature has to be presented to the learner. The estimate is chosen to be an inverse-logarithm. Whenever the actual decrease of the ratio exceeds the predicted decrease by a certain threshold, a new feature is included into the learning process (T2.5d).
154
3.2
Susanne Hoche and Stefan Wrobel
Mutual Information between Features
Initially, all features in the given training examples are sorted according to their mutual information [25,14] with the examples’ class. The mutual information M I(F1 , F2 ) between two features F1 , F2 is defined as the difference between the entropy of F1 and the entropy of F1 given F2 [27], i.e. as the amount of information about the possible values (f11 , f12 , ...) of feature F1 that is obtained when the value f ∈ {f21 , f22 , ...} of feature F2 is known. To compute the mutual information between class C and feature Fj , we estimate the probability distributions of C and Fj from the training data, ignoring missing values, as follows: – The probability p(C = c) of any training example being of class c is estimated c | as the fraction |E |E| of training examples from E belonging to c. – The probability p(Fj = fi ) that the nominal feature Fj takes value fi is |F =f | estimated as the fraction j|E| i of training examples for which feature Fj takes value fi . – The joint probability p(C = c) ∧ (Fj = fi ) is derived from the probabilities of the two single events. The mutual information between a feature Fj and the class C of an example can then be defined as M I(C, Fj ) = E(C) − E(C|Fj ) =
mj k
p(C = c, Fj = fi ) ln
i=1 c=1
p(C = c, Fj = fi ) p(C = c)p(Fj = f i)
(2)
with k possible classes and mj possible values of feature Fj . For features Fj with continuous values, we estimate the probability distribution by discretizing the values of Fj with an entropy based method [4] and using the resulting interval [d1 , ..., dmi ] to estimate the probability of Fj taking a value in the interval Ii := [di , di+1 ), 1 ≤ i < mi − 1, and Imi−1 := [dmi −1 , dmi ] respectively, |Fj ∈Ii | of training examples for which feature Fj takes a value as the fraction |E| in Ii , 1 ≤ i ≤ mi − 1. Note that this way of sorting features according to their mutual information with respect to classification assumes independence of the features and may thus result in inferior performance in domains with highly correlated features.
4
Empirical Evaluation
To evaluate our approach, we performed experiments on data sets differing in the number of features and the total number of examples. We determine prediction accuracy and learning time for each dataset for both the base case C 2 RIB and for C 2 RIB D described in this paper, and compare the results to those of other systems (Tables 3 to 5). For C 2 RIB D , we also indicate the average number of features included in the learning process. In all experiments, 1) the
Scaling Boosting by Margin-Based Inclusion of Features and Relations
155
Table 3. Accuracy, standard deviation and learning time in minutes for SLIPPER [3], C 2 RIB and C 2 RIB D on five propositional domains SLIPPER Domain
Ex. Fea- Train / Acc tures
breast-wisc
699
horse-colic
368
hypothyroid 3163 mushroom 8124 splicejunction
3190
Test
StdD
9 10CV 95.8 n/a 23 10CV 85.0 n/a 25 10CV 99.3 n/a 22 10CV 99.8 n/a 60 10CV 94.1
[3] C2 RIB Time Acc StdD
n/a 96.1 ±1.5 n/a 81.0 ±8.4 n/a 95.2 ±0.69 n/a 99.3 ±3.0 n/a 53.13 ±3.0
C2 RIBD Time Acc
Time
StdD
9 95.4 ±1.7 3.6 83.7 ±5.7 39.1 96.6 ±2.9 144 99.6 ±0.16 289 88 ±4.7
sel. Feat.
5.1
5.8
0.9
2
20.8
11.8
71.4
4.8
13.6
4.4
base learner is invoked T = 100 times, 2) the gradients of the examples’ mean margins are averaged over the last Tl = 10 iterations and, 3) the threshold α is set to 1.01 (see 5d in Table 2). The value 1.01 has been empirically determined on the domain of Mutagenicity [26], and has not been modified for subsequent experiments on the other domains in order to ensure proper cross validation results. We chose three different types of domains in order to get an assessment of our learner 1) on propositional tasks, and 2) on general knowledge and data mining tasks and 3) on ILP benchmark and classic Machine Learning problems. The first set of experiments comprises five propositional domains from the UCI-repository [16]. We compare our approach to the propositional constrained confidence-rated booster SLIPPER [3] which served as a basis for C 2 RIB. Predictive accuracies are estimated by 10-fold-cross validation.1 As can be seen from Table 3, C 2 RIB performs in four domains on par with or slightly weaker than SLIPPER. C 2 RIB D reduces C 2 RIB’s learning time2 up to one order of magnitude with a superior predictive accuracy in four domains, and without a significant deterioration of predictive accuracy in the one domain where only few features are present. C 2 RIB shows a poor performance on the splice-junction dataset, most likely due to the great number of features. However, C 2 RIB D clearly outperforms C 2 RIB both in accuracy and learning time. The second set of experiments was conducted on datasets subject of the data mining competitions PKDD Discovery Challenge 2000 [1] (classification of loans, where Task AC is based on all loans, and Task A only on the closed loans from 1
2
However, in [3], single training- and test set splits are used for hypothyroid, mushroom and splice-junction. Learning times for SLIPPER are not known to us.
156
Susanne Hoche and Stefan Wrobel
Task AC), and KDD Cup 2001, Task2 [2] (prediction of gene functions). The predictive accuracy is estimated by 10-fold-cross validation, and the results are compared to Progol [15] and RELAGGS [13], a transformation-based approach to ILP, combined with SVMlight and C4.5rules, respectively, run on the propositionalized data. For Task AC, Progol was run for 2 days, and discontinued without any results. Prediction accuracies of C 2 RIB and C 2 RIB D are, for Task AC, notedly lower than the ones obtained by RELAGGS/C4.5rules, however still in the range of standard deviation of the accuracies obtained by RELAGGS/SVMlight , as holds for Task A. However, learning times of C 2 RIB and C 2 RIB D are lower than the ones of the other systems. For Task AC, C 2 RIB D speeds up C 2 RIB’s learning time by factor 2. For Task A, C 2 RIB D seems to be penalized for sorting the features in the presence of few examples. For the gene function
Table 4. Accuracy, standard deviation and learning time for Progol [15], RELAGGS [13], C 2 RIB and C 2 RIB D on some data mining competition domains Progol RELAGGS RELAGGS C2 RIB C2 RIBD Domain
Ex.
SVMlight
C4.5rules
Acc
Acc
Acc
Acc
sel.
StdD
StdD
StdD
StdD
Feat.
Time
Time
Time
Time
Time
n/a n/a 2 days 45.7 ±10.5 hrs 92.2 24 min
90.8 ±3.2 23 min 88.0 ±5.3 10 min 92.2 ≈ 2 min
94.1 ±3.2 23 min 88.0 ±6.5 10 min n/a n/a
88.9 ±3.4 20 min 86.3 ±6.1 3.6 min 91.1 53 min
88.9 ±3.4 9.5 min 86.7 ±6.6 4.2 min 91.5 27 min
Fea- Acc
Train / tures StdD Test PKDD DS 2000, AC PKDD DS
2000, A KDD Cup 2001, Task2
682 10CV
24
234 10CV
24
1243 862/381
49
10
10.2
5
prediction task, C 2 RIB and C 2 RIB D were ran on the original KDD Cup 2001 training-test-data partition and the results were compared to Progol3 and RELAGGS/SVMlight .4,5 Again, learning time is reduced by factor 2 in the demand-driven approach C 2 RIB D . It slightly improves C 2 RIB’s predictive accuracy which is on par with the other systems’ accuracies. Finally, we evaluated our approach on the two ILP benchmark problems Mutagenicity [26] (prediction of mutagenic activity of 188 molecules (description B4 )) and QSARs, Quantitative Structure Activity Relationships, [9,10] (prediction of a greater-activity relationship between pairs of compounds based on 3 4 5
L. Pe˜ na Castillo, unpublished, 2002 M.-A. Krogel, unpublished, 2002 RELAGGS won Task of KDD Cup 2001.
Scaling Boosting by Margin-Based Inclusion of Features and Relations
157
Table 5. Accuracy, standard deviation and learning time in minutes for C 2 RIB and C 2 RIB D in comparison to other systems on two ILP benchmark and one artificial domain Fors
Progol
C2 RIB C2 RIBD
Acc
Acc
Acc
Acc
sel.
StdD
StdD
StdD
StdD
Feat.
Time
Time
Time
Time
82.0 [26] 89.0 [8] 88.0 [26] 88.0 ±3.0 ±6.0 ±2.0 ±3.4 n/a n/a 307 7 82.9 n/a 79.8 83.4 ±2.7 n/a ±3.7 ±2.9 0.7 n/a 372 91 n/a n/a 77.78 [18] 83.3 n/a n/a ±6.43 ±0 n/a n/a 1.15 0.44
88.8 ±5.2 1.53 83.3 ±1.9 70 89.6 ±8.6 0.1
FOIL Domain
Ex. Fea- Acc Train/ tures StdD Test
Time
188 10CV
18
QSARs
2788 5CV
12
Eastbound
55 / 6
9
Mutagenicity
Trains
6
11.8
6.25
their structure), and on the artificial problem of Eastbound Trains6 proposed by Ryszard Michalski (prediction of trains’ directions based on their properties). For the two ILP domains, predictive accuracy is estimated by 10- and 5-foldcross validation, respectively, and results are compared to FOIL [21], Fors [8] and Progol. For the Eastbound Trains, the data is split into one training and test set partition, and the results are averaged over 8 iterations of the experiment. Predictive accuracy of C 2 RIB is higher than or on par with the one of the other learners. C 2 RIB D significantly outperforms C 2 RIB both in terms of predictive accuracy and learning time in two of the three domains, indicating that our approach seems to be superior in classical, highly structured ILP domains.
5
Related Work
The idea of selecting smaller feature subsets and shifting the bias to a more expressive representation language is common in multi-relational learning. The work probably most related to our work is [12], where AdaBoost [22] is combined with molfea, an inductive database for the domain of biochemistry [11]. In [12], AdaBoost is employed to identify particularly difficult examples for which molfea constructs new special purpose structural features. AdaBoost re-weighting episodes and molfea feature construction episodes are alternated. In each iteration, a new feature constructed by molfea is presented to a propositional learner, the examples are re-weighted in accordance to the base classifier learned by it, and a new feature is constructed by molfea based on the modified weights. In contrast, our approach actively decides when to include new features 6
The examples were generated with the Random Train Generator available at http://www-users-cs-york.ac.uk/∼stephen/progol.html
158
Susanne Hoche and Stefan Wrobel
from the list of ranked existing features with the central goal of including new features only when absolutely necessary in order to be maximally efficient. This means that in principle the two approaches could be easily combined, for example by calling a generator of new features whenever the list of existing features has been exhausted. [23] propose a wrapper model utilizing boosting for feature selection. In their approach, alternative feature subsets are assessed based on the underlying booster’s optimization criterion. The feature subset optimal according to this criterion is then presented as a whole to a learner. In contrast, we use a criterion of mutual information once before we start the boosting process to establish a feature ranking, and utilize the characteristics of our boosted learner to actively decide when to include a new feature. However, it would be interesting to combine both approaches.
6
Conclusion
In this paper, we have proposed an approach to boosting a weak relational learner which starts off with a minimal set of features and relations and is - by demand stepwise strengthened. Our work is based on C 2 RIB [7], a fast weak ILP-learner in a constrained confidence-rated boosting framework. The quality of the current learning results is measured in terms of the gradient of the training examples’ mean margins, and the learner is strengthened whenever the learning curve drops under a certain threshold. To that purpose, features occurring in the training examples are sorted according to their mutual information with the examples’ class and by and by provided to the learner together with the relation in which they occur. We showed that learning times are significantly reduced while the predictive accuracy is comparable to those of other learning systems and, in the majority of cases, superiour to those of the “fully equipped” learner C 2 RIB. These results are encouraging, especially since all experiments were conducted without optimizing parameters. One question for further work is whether one could expect to even gain a higher predictive accuracy by repeatedly evaluating the features’ ordering and taking into account the examples’ weights under the current probability distribution. In each iteration, the learner is presented a different training set, emphasizing the hard examples more and more. A stronger influence of so far misclassified examples on the feature ranking could support the induction of correct classifiers for those examples that are particularly difficult to learn. Another question for further research is whether it is possible to determine automatically for every domain a) an optimal threshold to which the deviation of the current from the expected decrease of the ratio of the average and the current gradient should be compared and b) the number of iterations over which the gradients should be averaged. It is also part of the future work to investigate other approaches to feature selection, and make use of the accelerated learning time to incorporate more standard elements of “full-blown” ILP-learners and to determine the right balance between speed and accuracy of the learning system.
Scaling Boosting by Margin-Based Inclusion of Features and Relations
159
This work was partially supported by DFG (German Science Foundation), project FOR345/1-1TP6. We would like to thank L. Pe˜ na Castillo and M. Krogel for providing their results on the KDD Cup 2001, L. Pe˜ na Castillo for reviewing previous versions of this paper, and J. Kaduk for many inspiring discussions.
References 1. P. Berka. Guide to the financial Data Set. In: A. Siebes and P. Berka, editors, PKDD2000 Discovery Challenge, 2000. 155 2. J. Cheng, C. Hatzis, H. Hayashi, M.-A. Krogel, Sh. Morishita, D. Page, and J. Sese. KDD Cup 2001 Report. ISIGKDD Explorations, 3(2):47-64, 2002. 156 3. W. Cohen and Y. Singer. A Simple, Fast, and Effective Rule Learner. Proc. of 16th National Conference on Artificial Intelligence, 1999. 148, 155 4. U. M. Fayyad, and K. B. Irani. Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. Proc. of 13th Int. Joint Conf. on AI, 1993. 154 5. Y. Freund, and R. E. Schapire. Experiments with a New Boosting Algorithm. Proc. of 13th International Conference on Machine Learning, 1996. 148 6. A. J. Grove, and D. Schuurmans. Boosting in the limit: Maximizing the margin of learned ensembles. Proc. of 15th National Conf. on AI, 1998. 152 7. S. Hoche, and S. Wrobel. Relational Learning Using Constrained Confidence-Rated Boosting. Proc. 11th Int. Conf. on Inductive Logic Programming (ILP), 2001. 148, 150, 158 8. A. Karalic. First Order Regression. PhD thesis, University of Ljubljana, Faculty of Computer Science, Ljubljana, Slovenia, 1995. 157 9. R. D. King, S. Muggleton, R. A. Lewis, and M. J. E. Sternberg. Drug design by machine learning: The use of inductive logic programming to model the structure activity relationships of trimethoprim analogues binding to dihydrofolate reductase. Proc. of the National Academy of Sciences of the USA 89(23):11322-11326, 1992. 156 10. R. D. King, A. Srinivasan, and M. Sternberg. Relating chemical activity to structure: An examination of ILP successes. New Generation Computing, Special issue on Inductive Logic Programming 13(3-4):411-434, 1995. 156 11. S. Kramer, and L. De Raedt. Feature construction with version spaces for biochemical applications. Proc. of the 18th ICML, 2001. 157 12. S. Kramer. Demand-driven Construction of Structural Features in ILP. Proc. 11th Int. Conf. on Inductive Logic Programming (ILP), 2001. 157 13. M.-A. Krogel , and S. Wrobel. Transformation-Based Learning Using Multirelational Aggregation. Proc. 11th Int. Conf. on Inductive Logic Programming (ILP), 2001. 156 14. W. J. McGill. Multivariate information transmission. IRE Trans. Inf. Theory, 1995. 149, 153, 154 15. S. Muggleton. Inverse Entailment and Progol. New Gen. Computing, 13, 1995. 156 16. P. M. Murphy, and D. W. Aha. UCI repository of machine learning databases. University of California-Irvine, Department of Information and Computer Science, 1994. http://www1.ics.uci.edu/ mlearn/MLRepository.html 155 17. D. Opitz, and R. Maclin. Popular Ensemble Method: An Empirical Study. Journal of Artificial Intelligence Research 11, pages 169-198, 1999. 148
160
Susanne Hoche and Stefan Wrobel
18. L. Pe˜ na Castillo, S. Wrobel. On the Stability of Example-Driven Learning Systems: a Case Study in Multirelational Learning. Proceedings of MICAI 2002, 2002. 157 19. J. R. Quinlan. Bagging, boosting, and C4.5. Proc. of 14th Nat. Conf. on AI, 1996. 148, 149 20. J. R. Quinlan. Boosting First-Order Learning. Algorithmic Learning Theory, 1996. 149 21. J. R. Quinlan and R. M. Cameron-Jones. FOIL: A Midterm Report. In P. Brazdil, editor, Proc. of the 6th European Conference on Machine Learning, 667: 3-20, 1993. 157 22. R. E. Schapire. Theoretical views of boosting and applications. Proceedings of the 10th International Conference on Algorithmic Learning Theory, 1999. 149, 157 23. M. Sebban, and R. Nock. Contribution of Boosting in Wrapper Models. In: J. M. Zytkow, and J. Rauch, eds, Proc. of the PKDD’99, 1999. 158 24. R. E. Schapire, Y. Freund, P.Bartlett, and W. S. Lee. Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods. The Annals of Statistics, 26(5):1651-1686, 1998. 152 25. C. E. Shannon. A mathematical theory of communication. Bell. Syst. Techn. J., 27:379-423, 1948. 149, 153, 154 26. A. Srinivasan, S. Muggleton, M. J. E. Sternberg, and R. D. King. Theories for mutagenicity: A study in first-order and feature-based induction. Artificial Intelligence, 1996. 155, 156, 157 27. D. Wettschereck. A Study of Distance-based Machine Learning Algorithms. PhD thesis, Oregon State University,Computer Science Department, Corvallis, USA 1994. 154
Multiclass Alternating Decision Trees Geoffrey Holmes, Bernhard Pfahringer, Richard Kirkby, Eibe Frank, and Mark Hall Department of Computer Science, University of Waikato Hamilton, New Zealand {geoff,bernhard,rkirkby,eibe,mhall}@cs.waikato.ac.nz
Abstract. The alternating decision tree (ADTree) is a successful classification technique that combines decision trees with the predictive accuracy of boosting into a set of interpretable classification rules. The original formulation of the tree induction algorithm restricted attention to binary classification problems. This paper empirically evaluates several wrapper methods for extending the algorithm to the multiclass case by splitting the problem into several two-class problems. Seeking a more natural solution we then adapt the multiclass LogitBoost and AdaBoost.MH procedures to induce alternating decision trees directly. Experimental results confirm that these procedures are comparable with wrapper methods that are based on the original ADTree formulation in accuracy, while inducing much smaller trees.
1
Introduction
Boosting is now a well established procedure for improving the performance of classification algorithms. AdaBoost [8] is the most commonly used boosting procedure, but others have gained prominence [3,10]. Like many classification algorithms, most boosting procedures are formulated for the binary classification setting. Schapire and Singer generalize AdaBoost to the multiclass setting producing several alternative procedures of which the best (empirically) is AdaBoost.MH [14]. This version of AdaBoost covers the multilabel setting where an instance can have more than one class label as well as the multiclass setting where an instance can have a single class label taken from a set of (more than two) labels. Alternating decision trees are induced using a real-valued formulation of AdaBoost [14]. At each boosting iteration three nodes are added to the tree. A splitter node that attempts to split sets of instances into pure subsets and two prediction nodes, one for each of the splitter node’s subsets. The position of this new splitter node is determined by examining all predictor nodes choosing the position resulting in the globally best improvement of the purity score. Essentially, an ADTree is an AND/OR graph. Knowledge contained in the tree is distributed as multiple paths must be traversed to form predictions. Instances that satisfy multiple splitter nodes have the values of prediction nodes that they reach summed to form an overall prediction value. A positive sum represents one class and a negative sum the other in the two-class setting. The result T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 161–172, 2002. c Springer-Verlag Berlin Heidelberg 2002
162
Geoffrey Holmes et al.
is a single interpretable tree with predictive capabilities that rival a committee of boosted C5.0 trees [7]. An additional attractive feature of ADTrees, one that is not possible with conventional boosting procedures, is their ability to be merged together. This is a particularly useful attribute in the context of multiclass problems as they are often re-formulated in the two-class setting using one or more classes against the others. In such a setting ADTrees can be combined into a single classifier. In their original exposition on ADTrees, Freund and Mason [7] note that because alternating trees can be defined as a sum of simple base rules it is a simple matter to apply any boosting algorithm to the problem of inducing ADTrees. For the multiclass setting one possible candidate is AdaBoost.MH. In this paper we also explore and compare two other solutions. The first is to adapt the original two-class ADTree algorithm to the multiclass setting using a variety of wrapper methods. The second is to use the multiclass LogitBoost [10] procedure as the underlying boosting algorithm. This algorithm is a natural choice as it is directly applicable to multiclass problems. The paper is organized as follows. In Section 2 we review ADTrees and the LogitBoost procedure. Section 3 describes our attempts to cast ADTrees to the multiclass setting. Section 4 describes the new algorithm that induces ADTrees using LogitBoost. Section 5 contains experimental results that compare both the LogitBoost and AdaBoost.MH methods with the best of the adaptations of the original algorithm on some benchmark datasets. Section 6 summarizes the contributions made in this paper.
2
Background
In this section we first summarize the original algorithm for inducing ADTrees. As Freund and Mason [7] argue that any boosting method is applicable to ADTree induction, it is natural to suppose that AdaBoost.MH would provide a good setting for the multiclass extension (given that AdaBoost works so well in the two-class setting). A similar argument can be made for an alternative framework based on LogitBoost, and this is discussed in the final part of this section. 2.1
ADTrees
Alternating decision trees provide a mechanism for combining the weak hypotheses generated during boosting into a single interpretable representation. Keeping faith with the original implementation, we use inequality conditions that compare a single feature with a constant as the weak hypotheses generated during each boosting iteration. In [7] some typographical errors and omissions make the algorithm difficult to implement so we include below a more complete description of our implementation. At each boosting iteration t the algorithm maintains two sets, a set of preconditions and a set of rules, denoted Pt and Rt , respectively. A further set C of weak hypotheses is generated at each boosting iteration.
Multiclass Alternating Decision Trees
163
1. Initialize Set the weights wi,t associated with each training instance to 1. Set the first rule R1 to have a precondition and condition which are both true. CalW+ (c) culate the prediction value for this rule as a = 12 ln W where W+ (c), W− (c) − (c) are the total weights of the positive and negative instances that satisfy condition c in the training data. The initial value of c is simply True. 2. Pre-adjustment Reweight the training instances using the formula wi,1 = wi,0 e−ayt (for two-class problems, the value of yt is either +1 or -1). 3. Repeat for t = 1, 2, . . ., T (a). Generate the set C of weak hypotheses using the weights associated with each training instance wi,t (b). For each base precondition c1 ∈ Pt and each condition c2 ∈ C calculate Zt (c1 , c2 ) = 2
p
W+ (c1 ∧ c2 )W− (c1 ∧ c2 ) +
!
p
W+ (c1 ∧ ¬c2 )W− (c1 ∧ ¬c2 ) + W (¬c1 )
(c). Select c1 , c2 which minimize Zt (c1 , c2 ) and set Rt+1 to be Rt with the addition of the rule rt whose precondition is c1 , condition is c2 and two prediction values are: a=
1 W+ (c1 ∧ c2 ) + ln , 2 W− (c1 ∧ c2 ) +
b=
1 W+ (c1 ∧ ¬c2 ) + ln 2 W− (c1 ∧ ¬c2 ) +
(d). Set Pt+1 to be Pt with the addition of c1 ∧ c2 and c1 ∧ ¬c2 . (e). Update the weights of each training example according to the equation wi,t+1 = wi,t e−rt (xi )yt 4. Output the classification rule that is the sign of the sum of all the base rules in RT +1 : T class(x) = sign rt (x) t=1
In terms of parameter settings for implementations described in this paper, we set the value of to 1, and vary the value of T for stopping the induction in fixed increments (namely, 10, 20, 50 and 100). Determining an optimal setting for T is still an open research question. 2.2
LogitBoost
As mentioned above, the underlying learning algorithm for ADTrees is AdaBoost. Friedman et al. [10] analyze AdaBoost from a statistical perspective
164
Geoffrey Holmes et al.
and find that it can be viewed as a stage-wise estimation procedure for fitting an additive logistic regression model according to an exponential loss function. This finding enables them to derive a stage-wise boosting procedure, implementing an adaptive Newton algorithm, that optimizes the (more standard) binomial likelihood instead of the exponential loss function used in AdaBoost. They call this algorithm LogitBoost. They also describe a generalized version of LogitBoost that optimizes the multinomial likelihood. This algorithm is directly applicable to multiclass problems. Compared to AdaBoost.MH (see Section 3.2), the general form of LogitBoost (which we call LT1PC later) has the advantage that it can be wrapped around any numeric predictor without any modifications. AdaBoost.MH, on the other hand, requires serious modification to the weak learner so that it can produce a separate prediction for each class value and also deal with class specific weights.
3
Multiclass ADTrees
When extending any algorithm from binary to multiclass classification there are two options. The simplest approach is to transform the multiclass problem into several binary classification problems. This general approach can be applied to any classification algorithm, resulting in a set of voting models. Typically, this approach leads to a large number of models. Alternatively, we can attempt to induce a single tree capable of predicting each of the class labels directly. 3.1
Multiclass as Multiple Two-Class Problems
Transforming ADTrees to map multiple class labels to two classes can be approached in several ways. As ADTrees can be merged, the resulting multiclass model can be a single tree derived from the set of two-class voting trees. A standard method [6] is to treat a subset of class labels as class A, and the set of remaining labels as class B, thus reducing the problem to two classes from which a model can be built. This is then repeated for different subsets and the models vote towards the class labels they represent. Provided there is sufficient class representation and separation between the subsets, the vote tallies for individual class labels can be collected to form a reasonable prediction. We experimented with a number of subset generation schemes: 1-against-1 [9,1]: generate a tree for every pair of classes, where subset A contains only the first class and subset B contains only the second. An advantage of this approach is that each tree need only be trained with a subset of the data, resulting in faster learning [11]. 1-against-rest: one tree per class, where subset A contains the class, and subset B contains the remaining classes. random: randomly generate a unique subset, creating twice as many trees as there are classes. Random codes have good error-correcting properties [13]. exhaustive: every unique subset possible.
Multiclass Alternating Decision Trees
165
Note that the exhaustive method is not computationally practical as class numbers increase (in our experiments this occurs when there are more than 16 class labels). 3.2
Direct Induction
The AdaBoost.MH algorithm is almost identical to AdaBoost. The major difference is that instead of generating weak hypotheses ht that map the input space X to either a discrete set [−1, +1] or by extension R, the weak hypotheses map X × Y to R, where Y is a finite set of class labels. It would appear that the correct interpretation of AdaBoost.MH is not immediately obvious, for example, Friedmann et al [10] interpret the method as a variant of 1-against-rest and build a distinct classifier per class. Many of the criticisms of AdaBoost.MH in [10] are based on this mis-interpretation. Our results suggest that AdaBoost.MH and LogitBoost actually share much in common in terms of both predictive performance and computational complexity. In fact, AdaBoost.MH constructs a single tree per iteration. To construct an ADTree using AdaBoost.MH we need to change predictor nodes to handle a vector of predictions (one per class) and splitter nodes to compute a Z value per class label. At each iteration the test that minimises the sum of Z scores over all class labels is added to the tree. To perform prediction using this tree we sum all contributions at each predictor node that is satisfied by the example, to form a prediction vector containing a single prediction per class. We choose the maximum value from this vector as the single output class.
4
LADTree Algorithm
We follow Friedmann et al [10] in defining the multiclass context. Namely, that ∗ for an instance i and a J class problem, there are J responses yij each taking values in {−1, 1}. The predicted values, or indicator responses, are represented by the vector Fj (x) which is the sum of the responses of all the ensemble classifiers on instance x over the J classes. The class probability estimate is computed from a generalization of the two-class symmetric logistic transformation to be: J eFj (x) , Fk (x) = 0 pj (x) = J Fk (x) k=1 e k=1
(1)
The LogitBoost algorithm can be fused with the induction of ADTrees in two ways, which will be explained in the following subsections. In the first, more conservative approach called LT1PC we grow separate trees for each class in parallel. In the second approach called LT, only one tree is grown predicting all class probabilities simultaneously.
166
4.1
Geoffrey Holmes et al.
LT1PC: Inducing One Tree per Class
The LADTree learning algorithm applies the logistic boosting algorithm in order to induce an alternating decision tree. As with the original algorithm, a single attribute test is chosen as the splitter node for the tree at each iteration. Stored with each training instance is a working response and weights on a per-class basis. The aim is to fit the working response to the mean value of the instances, in a particular subset, by minimising the least-squares value between them. When choosing tests to add to the tree we look for the maximum gain, that is, the greatest drop in the least squares calculation. Note, in the algorithm below the fmj (x) vector is equivalent to the single prediction weight of a predictor node in the original ADTree algorithm. The algorithm is as follows: 1. Initialize Create a root node with Fj (x) = 0 and Pj (x) =
1 J ∀j
2. Repeat for m = 1, 2, . . . , T : (a) Repeat for j = 1, . . . , J : (i) Compute working responses and weights in the jth class y ∗ −pij y ∗ −pij wij = ijzij zij = pijij(1−pij ) (ii) Add the single test to the tree that best fits fmj (x) by a weighted least-squares fit of zij to xi with weights wij (b) Add prediction nodes to the tree by setting 1 J fmj (x) ← j−1 k=1 fmk (x)), and J (fmj (x) − J Fj (x) ← Fij (x) + fmj (x) (c) Update pj (x) via Equation 1 above 3. Output Output the classifier argmaxj Fj (x) With this algorithm, trees for the different classes are grown in parallel. Once all of the trees have been built, it is then possible to merge them into a final model. If the structure of the trees is such that few tests are common, the merged tree will mostly contain subtrees affecting only one class. The size of the tree cannot outgrow the combined size of the individual trees. The merging operation involves searching for identical tests on the same level of the tree. If such tests exist then the test and its subtrees can be merged into one. The additive nature of the trees means that the prediction values for the same class can be added together when merged. 4.2
LT: Directly Inducing a Single Tree
We can make a simple adjustment to this algorithm within Step 2 by moving Step (a)(ii) out to become Step (b). We then obtain a single directly induced tree, as follows:
Multiclass Alternating Decision Trees
167
2. Repeat for m = 1, 2, . . . , T : (a) Repeat for j = 1, . . . , J : (i) Compute working responses and weights in the jth class y ∗ −pij y ∗ −pij zij = pijij(1−pij ) wij = ijzij (b) Add the single test to the tree that best fits fmj (x) by a weighted least-squares fit of zij to xi with weights wij (c) Add prediction nodes to the tree by setting J 1 fmj (x) ← j−1 k=1 fmk (x)), and J (fmj (x) − J Fj (x) ← Fij (x) + fmj (x) (d) Update pj (x) via Equation 1 above The major difference to LT1PC is that in LT we attempt to simultaneously minimise the weighted mean squared error across all classes when finding the best weak hypothesis for the model.
5
Experimental Results
The datasets and their properties are listed in Table 1. The first set of ten datasets are used to compare ADTrees with LT as an algorithm for solving two-class problems. The remainder are used in multiclass experiments, ordered incrementally from the smallest number of classes (3) to the largest (26). Most of the datasets are from the UCI repository [2], with the exception of half-letter. Half-letter is a modified version of letter, where only half of the class labels (A-M) are present. In the case of the multiclass datasets, on the first nine having less than eight classes, accuracy estimates were obtained by averaging the results from 10 separate runs of stratified 10-fold cross-validation. In other words, each scheme was applied 100 times to generate an estimate for a particular dataset. For these datasets, we speak of two results as being “significantly different” if the difference is statistically significant at the 5% level according to a paired two-sided t-test, each pair of data points consisting of the estimates obtained in one ten-fold cross-validation run for the two learning schemes being compared. On the datasets with more than eight classes, a single train and test split was used. Statistical significance was measured by the McNemar [5] test. NA (for not available) in the results table signifies that the learning scheme did not finish training. If learning could not complete within the time period of a week then it was terminated and marked NA. It is not surprising that the exhaustive method did not finish above 16 classes when one considers the number of permutations required. Due to the presence of these unfinished experiments the averages for all methods listed in this table exclude the last four datasets. Thus a fair comparison is possible. Table 2 shows that LT is comparable to ADTree over ten boosting iterations on two-class datasets. There is little change to this result when raising the number of boosting iterations to 100.
168
Geoffrey Holmes et al.
Table 1. Datasets and their characteristics Dataset Classes Instances (train/test) Attributes Numeric Nominal breast-wisc 2 699 9 9 0 cleveland 2 303 13 6 7 credit 2 690 15 6 9 hepatitis 2 155 19 6 13 ionosphere 2 351 34 34 0 labor 2 57 16 8 8 promoters 2 106 57 0 57 sick-euthyroid 2 3163 25 7 18 sonar 2 208 60 60 0 vote 2 435 16 0 16 iris 3 150 4 4 0 balance-scale 3 625 4 4 0 hypothyroid 4 3772 29 7 22 anneal 6 898 38 6 32 zoo 7 101 17 1 16 autos 7 205 25 15 10 glass 7 214 9 9 0 segment 7 2310 19 19 0 ecoli 8 336 7 7 0 led7 10 1000/500 7 0 7 optdigits 10 3823/1797 64 64 0 pendigits 10 7494/3498 16 16 0 vowel 11 582/462 12 10 2 half-letter 13 8000/1940 16 16 0 arrhythmia 16 302/150 279 206 73 soybean 19 307/176 35 0 35 primary-tumor 22 226/113 17 0 17 audiology 24 200/26 69 0 69 letter 26 16000/4000 16 16 0
Given the large number of options for solving multiclass problems using ADTrees we designed the following experiments to provide useful comparisons. First, we determine the best multiclass ADTree method by treating the induction as a two-class problem (Table 3). Second, we compare this method with the AdaBoost.MH and the two LADTree methods described in the last section. Generally, it is difficult to compare all of these methods fairly in terms of the number of trees produced. For example, the 1-against-1 method produces J(J−1) trees, 1-against-rest J, random 2 ∗ J, and exhaustive 2J−1 trees. LT1PC 2 produces J trees while AdaBoost.MH and LT induce a single tree. Thus, it can be the case that the number of trees is greater than the number of boosting iterations, for example, the average number of trees produced by 1-against-1 over all nineteen multiclass datasets is 79. Unless otherwise stated, in all tables we compare methods against a fixed number of boosting iterations (10).
Multiclass Alternating Decision Trees
169
Table 2. Two-class problems: ADTree vs. LT dataset ADTree(10) LT(10) breast-wisc 95.61 95.65 cleveland 81.72 80.36 credit 84.86 85.04 hepatitis 79.78 77.65 ionosphere 90.49 89.72 labor 84.67 87.5 promoters 86.8 87.3 sick-euthyroid 97.71 97.85 sonar 76.65 74.12 vote 96.5 96.18 +, − statistically significant difference
−
+ + − −
Table 3 allows us to compare the results of each method on an overall basis through the average and on a pair-wise basis through the significance tests. Note that the significance tests are all performed with respect to the first column in the table. On both scales the exhaustive method is the best. As the exhaustive method is not practical for large class datasets we chose the 1-against-1
Table 3. Wrapping two-class ADTree results dataset 1vs1 1vsRest Random Exhaustive iris 95.13 95.33 95.33 95.33 balance-scale 83.94 85.06 + 85.06 + 85.06 + hypothyroid 99.61 99.63 99.64 99.64 anneal 99.01 98.96 99.05 99.19 + zoo 90.38 93.45 + 95.05 + 95.94 + autos 78.48 77.51 77.98 79.99 + glass 75.90 74.33 − 73.79 − 76.76 segment 96.74 95.94 − 95.91 − 96.62 ecoli 83.31 83.96 84.69 + 85.95 + led7 75.40 74.40 76.40 75.60 optdigits 92.49 90.26 − 92.21 − 93.82 pendigits 94.11 91.48 − 86.16 − 89.54 − vowel 47.40 41.13 − 48.48 + 50.65 half-letter 88.71 80.77 − 76.13 − 80.98 − arrhythmia 68.00 66.00 − 66.00 68.00 soybean 89.36 89.10 89.36 NA primary-tumor 46.90 43.36 − 46.90 NA audiology 76.92 80.77 84.62 NA letter 85.98 70.63 − 65.20 − NA average 84.57 83.21 83.46 84.87 +, − statistically significant difference to 1vs1
170
Geoffrey Holmes et al.
method to compare against LADTrees, as this method is very similar in overall performance. Table 4 compares the “winner” of Table 3 (1-against-1) to AdaBoost.MH and both versions of LADTrees of various sizes. It demonstrates the improvements that can be made by increasing the number of boosting iterations for the single tree methods LT and AdaBoost.MH as they generate tree sizes closer to the number generated by 1-against-1. The 1-against-1 method defeats each of the small tree methods at 10 boosting iterations. But when the number of iterations is increased to 100 tests each, we notice a dramatically different picture: all methods are outperforming the 1against-1 method. Consider the 100 iteration case: 1-against-1 is boosted 10 times but produces J(J−1) trees, which represents an average tree size of 790 (tests). LT and Ad2 aBoost.MH outperform this method on average after 100 iterations (i.e. using trees with 100 tests). Table 4 shows that LT(100) outperforms most of the early datasets (class sizes 3-13) but struggles against two of the later datasets. For soybean 1-against-1 uses a tree of size 1710, and for primary-tumor it uses a tree of size 2310. Perhaps the most remarkable result is for half-letter where 1against-1 using 780 tests has an accuracy of 88.71% whereas LT(100) achieves 92.16% using only 100 tests. Clearly, both on an overall average and on a per dataset basis, AdaBoost.MH and LT are comparable methods. There are no obvious performance differences between these methods at 10 and 100 iterations. Table 4 also compares the two logistic methods. Due to the number of trees used by LT1PC it outperforms LT both on average and on pairwise tests. But these differences seem to disappear as the number of iterations increases: at 10 boosting iterations LT1PC wins on 11 datasets and has 4 losses; at 100 boosting iterations LT1PC has only 4 significant wins and 3 losses.
6
Conclusions
This paper has presented new algorithms for inducing alternating decision trees in the multiclass setting. Treating the multiclass problem as a number of binary classification problems and using the two-class ADTree method produces accurate results from large numbers of trees. Although ADTrees can be merged, the size of the combined tree prohibits its use as a practical method, especially if interpretable models are a requirement. Using AdaBoost.MH for multiclass problems was thought to be problematic. The theoretical objections to this method presented in [10] appear to be based on a mis-interpretation of AdaBoost.MH. Our experimental results demonstrate that this method is competitive with LogitBoost in the multiclass setting, at least for ADTrees. Two new algorithms, LT1PC and LT, for inducing ADTrees using LogitBoost are presented. One method induces a single tree per class, the other a single tree, optimised across all classes. In experimental results comparing these methods to
Multiclass Alternating Decision Trees
171
Table 4. LADTree and AdaBoost.MH results dataset 1PC(10) LT(10) MH(10) 1PC(100) LT(100) iris 95.07 94.20 − 94.93 95.13 95.13 balance-scale 88.80 + 84.50 84.21 86.53 + 90.40 hypothyroid 99.49 − 99.59 99.57 − 99.55 − 99.62 anneal 99.44 + 98.50 − 97.41 − 99.62 + 99.66 zoo 92.95 + 94.34 + 94.55 + 92.35 + 94.53 autos 81.12 + 64.57 − 69.92 − 82.71 + 82.43 glass 71.81 − 67.95 − 66.65 − 77.05 75.51 segment 96.68 92.27 − 93.14 − 97.99 + 97.84 ecoli 82.44 84.64 + 84.40 84.27 + 83.54 led7 75.20 77.60 72.80 75.00 73.60 optdigits 91.32 78.63 − 77.69 − 95.77 + 94.94 pendigits 91.65 − 78.53 − 78.24 − 96.74 + 96.51 vowel 39.61 − 34.85 − 34.85 48.05 46.54 half-letter 83.92 − 66.80 − 65.36 − 95.00 + 92.16 arrhythmia 70.00 64.67 64.67 68.67 66.67 soybean 90.43 81.38 − 79.79 − 85.90 83.51 primary-tumor 34.51 − 43.36 42.48 33.63 − 42.48 audiology 80.77 80.77 88.46 76.92 76.92 letter 76.78 − 50.53 − 44.25 − 93.25 + 86.78 average 81.16 75.67 75.44 83.38 83.09 +, − statistically significant difference to 1vs1
+ + + + +
+ + + − −
MH(100) 95.13 90.82 99.63 99.72 94.34 82.69 73.97 97.72 83.99 74.00 94.49 96.00 46.54 91.65 67.33 92.82 45.13 80.77 84.80 83.77
+ + + + − +
+ + + +
−
1-against-1, the best of the wrapper methods, both LADTree methods LT1PC and LT and AdaBoost.MH show significant promise, especially when we consider the relative sizes of the induced trees. From a different point of view one can also argue that the LADTree and AdaBoost.MH methods are the first direct induction methods for multiclass option trees, a hitherto unsolved problem. Previous attempts [4,12] were plagued by the need to specify multiple parameters, and also seemed to contradict each other in their conclusion of why and where in a tree options (i.e. alternatives) were beneficial. Contrary to these attempts, the LADTree and AdaBoost.MH methods have only a single parameter, the final tree size, and automatically add options where they seem most beneficial. A research problem that deserves attention is the determination of the stopping condition T for boosting methods. Freund and Mason [7] use crossvalidation with some success but this method is impractical for large datasets. One possible solution is to use out-of-bag samples to determine if adding new tests will continue to increase performance. This will be a topic of future work. Acknowledgements We would like to thank the anonymous referees for making us re-address the results we had earlier achieved with our first implementation of AdaBoost.MH.
172
Geoffrey Holmes et al.
This uncovered what appears to be a common misunderstanding of how to implement this method.
References 1. Erin Allwein, Robert Schapire, and Yoram Singer. Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1:113–141, 2000. 164 2. C. Blake, E. Keogh, and C. J. Merz. UCI repository of machine learning databases. Technical report, University of California, Department of Information and Computer Science, Irvine, CA, 1998. [www.ics.uci.edu/˜mlearn/MLRepository.html]. 167 3. Leo Breiman. Arcing classifiers. The Annals of Statistics, 26(3):801–849, 1998. 161 4. Wray Buntine. Learning classification trees. Statistics and Computing, 2:63–73, 1992. 171 5. Thomas G. Dietterich. Approximate statistical test for comparing supervised classification learning algorithms. Neural Computation, 10(7):1895–1923, 1998. 167 6. Thomas G. Dietterich and Ghulum Bakiri. Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research, 2:263– 286, 1995. 164 7. Yoav Freund and Llew Mason. The alternating decision tree learning algorithm. In Proc. 16th Int. Conf. on Machine Learning, pages 124–133. Morgan Kaufmann, 1999. 162, 171 8. Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In Proc. 13th Int. Conf. on Machine Learning, pages 148–156. Morgan Kaufmann, 1996. 161 9. Jerome Friedman. Another approach to polychotomous classification. Technical report, Stanford University, Department of Statistics, 1996. 164 10. Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. The Annals of Statistic, 28(2):337–374, 2000. 161, 162, 163, 165, 170 11. Johannes F¨ urnkranz. Round robin classification. Journal of Machine Learning Research, 2:721–747, 2002. 164 12. Ron Kohavi and Clayton Kunz. Option decision trees with majority votes. In Proc. 14th Int. Conf. on Machine Learning, pages 161–169. Morgan Kaufmann, 1997. 171 13. Robert E. Schapire. Using output codes to boost multiclass learning problems. In Proc. 14th Int. Conf. on Machine Learning, pages 313–321. Morgan Kaufmann, 1997. 164 14. Robert E. Schapire and Yoram Singer. Improved boosting algorithms using confidence-rated predictions. In Proc. 11th Conf. on Computational Learing Theory, pages 80–91. ACM Press, 1998. 161
Possibilistic Induction in Decision-Tree Learning Eyke H¨ ullermeier Department of Mathematics and Computer Science University of Marburg, Germany
[email protected]
Abstract. We propose a generalization of Ockham’s razor, a widely applied principle of inductive inference. This generalization intends to capture the aspect of uncertainty involved in inductive reasoning. To this end, Ockham’s razor is formalized within the framework of possibility theory: It is not simply used for identifying a single, apparently optimal model, but rather for concluding on the possibility of various candidate models. The possibilistic version of Ockham’s razor is applied to (lazy) decision tree learning.
1
Introduction
Inductive reasoning – by its very nature – is inseparably connected with uncertainty [4]. To begin with, the data presented to learning algorithms is imprecise, incomplete or noisy most of the time, a problem that can badly mislead a learning procedure. But even if observations are perfect, the generalization beyond that data is still afflicted with uncertainty. For example, observed data can generally be explained by more than one candidate theory, which means that one can never be sure of the truth of a particular model. In fact, the insight that inductive inference can never produce ultimate truth can be traced back at least as far as Francis Bacon’s epistemology. In his Novum Organum1 , Bacon advocates a gradualist conception of inductive enquiry and proposes to set up degrees of certainty. Thus, from experience one may at best conclude that a theory is likely to be true – not, however, that it is true with full certainty. In machine learning and mathematical statistics, uncertainty is often handled by means of probabilistic methods. In Bayesian approaches, for example, the data-generating process is modeled by means of a probability distribution which depends on the true model. Given the data S, a (posterior) probability (density) can thus be assigned to each model M ∈ M, where M is the class of candidate models. The specification of a probability distribution, µ, over that class of models allows one to take the uncertainty related to the learning (prediction) task into account. For example, rather than making a single prediction y0 = M ∗ (x0 ) on the basis of a particular model M ∗ (and a given query x0 ), one can derive a probability Pr(y) = µ({M ∈ M | M (x0 ) = y}) for each potential outcome y. Probabilistic approaches are not always applicable, however, and they do not capture every kind of uncertainty relevant to machine learning. Particularly, this 1
Published in 1620.
T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 173–184, 2002. c Springer-Verlag Berlin Heidelberg 2002
174
Eyke H¨ ullermeier
appears to be true for the uncertainty or, say, unreliability connected to heuristic principles of inductive inference such as Ockham’s razor. Such principles usually suggest one particular model M ∗ ∈ M, thereby disregarding the aspect of uncertainty. Our aim in this paper is to alleviate this drawback by means of a possibilistic approach to inductive inference. More specifically, we shall propose a formalization of Ockham’s razor within the framework of possibility theory. In its generalized version, Ockham’s razor specifies the possibility of alternative models rather than selecting one particular model. Section 2 recalls some basic principles of decision tree learning. In Section 3, the possibilistic version of Ockham’s razor is introduced. The application of this generalized principle to classical decision tree learning and to a lazy variant thereof are discussed, respectively, in Sections 4 and 5. Finally, Section 6 presents some experimental results.
2
Decision Tree Learning
We proceed from the common framework for learning from examples: X denotes the instance space, where an instance corresponds to the description x of an object in attribute–value form. That is, each object x is characterized through attribute values αı (x) ∈ Aı , 1 ≤ ı ≤ k, where Aı = dom(αı ) is the (finite) domain of the ı-th attribute αı ; the set of all attributes is denoted A. L = {λ1 , . . . , λm } is a set of labels, and x, λx is called a labeled instance or an example. S denotes a sample that consists of n labeled instances xı , λxı , 1 ≤ ı ≤ n. Finally, a new instance (query) x0 ∈ X is given, whose label λx0 is to be estimated. The basic principle underlying most decision tree learners, well-known examples of which include the ID3 algorithm [12] and its successor C4.5 [13] as well as the CART system [2], is that of partitioning the set of given examples, S, in a recursive manner. Each inner node η of a decision tree τ defines a partition of a subset Sη ⊂ S of examples assigned to that node. This is done by classifying elements x ∈ Sη according to the value of a specific attribute α. The attribute is selected according to a measure of effectiveness in classifying the examples, thereby supporting the overall objective of constructing a small tree. A widely applied “goodness of split” measure is the information gain, G(S, α), which is defined as the expected reduction in entropy (impurity) which results from partitioning S according to α: |Su | . ent(Su ), (1) G(S, α) = ent(S) − |S| u∈dom(α)
. where Su = { x, λx ∈ S | α(x) = u}. The entropy of a set S is given by . ent(S) = −qλ · log2 (qλ ),
(2)
λ∈L
. where qλ = card({ x, λx ∈ S | λx = λ}) · card(S)−1 . Besides, a number of other selection measures have been devised. See [11] for an empirical comparison of such measures.
Possibilistic Induction in Decision-Tree Learning
175
Since decision tree induction is by now a well-known method, we shall restrict ourselves to a concise exposition of the basic algorithm underlying ID3 and C4.5. This algorithm derives a decision tree in a top-down fashion by means of the following heuristic (greedy) strategy: – The complete set of training samples, S, is assigned to the root of the tree. – A node η becomes a leaf (answer node) of the tree if all associated samples Sη belong to the same class λ. In this case, η is assigned the label λ.2 – Otherwise, node η becomes a decision node: It is split by partitioning the associated set Sη of examples. This is done by selecting an attribute (among those that have not been used so far) as described above and by classifying the samples x ∈ Sη according to the values α(x). Each element of the resulting partition defines one successor node. Once the decision tree has been constructed, each path can be considered as a rule. The antecedent of a rule is a conjunction of conditions of the form αı (x) = uı , where αı is an attribute and uı ∈ dom(αı ) a specific value thereof. The conclusion part determines a value for the class variable. New examples are then classified on the basis of these rules, i.e. by looking at the class label of the leaf node whose attribute values match the description of the example.
3
A Possibilistic Version of Ockham’s Razor
3.1
Possibility Theory
Here we briefly review some aspects of possibility theory without going into technical detail. Possibility theory [7] is an alternative calculus for modeling and processing uncertainty or, more generally, partial belief. Possibility theory makes a distinction between the concepts of certainty (necessity) and plausibility (possibility) of an event. As opposed to probability theory, it does not claim that the confidence in an event is determined by the confidence in the complement of that event. Consequently, possibility theory is non-additive. In fact, the basic axiom of possibility theory involves the maximum-operator rather than the arithmetic sum: Π(A ∪ B) = max Π(A), Π(B) . In plain words, the possibility of the union (disjunction) of two events A and B is the maximum of the respective possibility of the individual events. A possibility distribution Π on 2X (satisfying Π(X) = 1 and Π(∅) = 0) is related to a possibility measure . π : X → V via Π(A) = supx∈A π(x). V is a totally ordered scale which is usually taken as the unit interval [0, 1]. However, V can also be a purely qualitative scale, in which case π(x) < π(y) simply means that y is more plausible than x. . A so-called necessity measure N , defined by N (A) = 1 − supx∈X\A π(x) for all A ⊆ X, is associated measure Π. A necessity measure satisfies with a possibility N (A ∩ B) = min Π(A), Π(B) . 2
In the case of noisy data, it may happen that all attributes have already been used along the path from the root of the tree to η, though not all samples have the same label.
176
Eyke H¨ ullermeier
Where does a possibility distribution come from? Originally, the idea of Zadeh [14] was to induce a possibility distribution from vague linguistic information, as represented by a fuzzy set. For example, the uncertainty related to the vague statement that “x is a small positive integer” translates into a distribution which lets x = 1 appear fully plausible (π(1) = 1), whereas, say, 5 is regarded as more or less plausible (π(5) = 1/2) and 10 as impossible (π(10) = 0).3 More generally, a possibility distribution can be induced by a flexible constraint: Consider a set A of alternatives and suppose information about an element a0 ∈ A of interest to be given, expressed in the form of a constraint. Usually, a constraint completely excludes some alternatives a ∈ A and can hence be identified with a subset C ⊆ A of still admissible candidates. A flexible constraint may exclude alternatives to a certain extent. A possibility degree π(a) is then understood as the plausibility that remains of alternative a given the constraint. Note that two constraints are naturally combined by intersection. The possibilistic counterpart to this kind of conjunctive operation is the (pointwise) minimum, i.e. the combination of two possibility distributions π1 and π2 into a new distribution π : x → min{π1 (x), π2 (x)}. In the following section, we shall look at Ockham’s razor as a flexible constraint. More generally, our view of a heuristic inductive reasoning principle is that of a constraint which may exclude a model (from the class of candidate models) to a certain degree. 3.2
Ockham’s Possibilistic Razor
According to Ockham’s razor, a simple model is to be preferred to a more complex one. In the context of decision trees, simplicity is usually equated with size and, hence, one tries to find the smallest tree among those consistent with the data. Note that the heuristic divide and conquer algorithm outlined in Section 2 only finds an approximation to this tree. Of course, what we actually desire is the true model, and the assumption underlying Ockham’s razor is that a simple model is more likely to be true than a complex one if both explain the data equally well. Even though this assumption is not very well settled from a theoretical point of view it is intuitively appealing and has proved its worth in practice [5]. Now, consider two decision trees τ ∗ and τ , where τ is only slightly more complex than τ ∗ . In such a case, one would generally not completely reject τ . Indeed, when taking the “more likely to” in the above formulation of Ockham’s razor seriously, then τ should be assigned a certain degree of possibility as well. This, in turn, should be taken into account when making inferences about new objects. More generally, this possibilistic interpretation of Ockham’s razor suggests to define a possibility distribution πM over the class of models M, where the possibility πM (τ ) depends on the simplicity of τ in comparison to the simplicity of 3
The specific definition of π clearly depends on the context.
Possibilistic Induction in Decision-Tree Learning
the simplest (and hence most plausible4 ) model τ ∗ : 0 if τ is not consistent . πM (τ ) = πM (τ | S) = , f (|τ |, |τ ∗ |) otherwise
177
(3)
where |τ | denotes the complexity of τ (a model τ is consistent if τ (x) = λx for all instances x, λx ∈ S). A possibilistic prediction, that is a possibility distribution over the class of labels L, can then be obtained by applying the well-known extension principle: . (4) πL (λ) = πL (λ | x0 ) = sup{πM (τ ) | τ (x0 ) = λ}. Needless to say, the computation of the possibility measure (3) is generally not tractable, as it requires the consideration of all (consistent) models. Apart from that, one will often not be interested in the possibility degrees of all models, but only in those models with a high degree of possibility. In the following section, we shall propose a heuristic approach which is a generalization of recursive partitioning: The problem of inducing a decision tree is decomposed into sub-problems in a hierarchical way, and the possibility of a tree τ is derived from the possibilities of its sub-trees.
4
Generalized Decision Tree Learning
Recall that the selection of an attribute in decision tree learning is made on the basis of a measure such as (1). Now, suppose that G(Sη , α∗ ) is quite large for the apparently optimal attribute α∗ , whereas G(Sη , α) is rather small for all remaining attributes. Taking the adequacy of the decision tree approach for granted, one can then be quite sure that α∗ is indeed the “correct” selection (problem decomposition) at this place. However, if G(Sη , α) is close to G(Sη , α∗ ) for some alternative attribute α, it is reasonable to say that α appears possible to a certain extent as well. More specifically, one might define a degree of possibility πA (α | Sη ) for each attribute α on the basis of the set of measures {G(Sη , α) | α ∈ A}, for example . (5) πA (α) = πA (α | Sη ) = max 0, 1 − c G(Sη , α∗ ) − G(Sη , α) , where c > 0. In order to guarantee a meaningful interpretation of the difference G(Sη , α∗ ) − G(Sη , α), the measure G(·) is assumed to be normalized such that 0 ≤ G(·) ≤ 1, with 1 being the best evaluation. This idea suggests the following generalization of the algorithm for decision tree induction: At a node η, a recursive partitioning is not only made for the best attribute a∗ but rather for all attributes in the set . A∗η = {α ∈ Aη | πA (α) > ∆} (6) 4
Letting πM (τ ∗ ) = 1 for a least one τ ∗ ∈ M means that at least one model is fully plausible. This can be seen as a kind of closed world assumption. More generally, one might allow that πM (τ ) < 1 for all τ ∈ M, suggesting that none of the candidate models is fully plausible.
178
Eyke H¨ ullermeier
of candidates whose possibility exceeds a lower threshold ∆. More precisely, a possibilistic branching is realized as follows: For each attribute α ∈ A∗η and each value u ∈ dom(α), one outgoing edge is added to η. This edge is marked with the test α = u and the possibility degree πA (α). Thus, one obtains a possibilistic tree or, say, a meta-tree T in which an instance can branch at a node in different directions. T actually consists of several ordinary trees τ . In fact, an ordinary tree is obtained by retaining at each (meta-)node η only those edges associated with a single attribute and by deleting all other edges. The possibility of a tree, πM (τ ), is determined by the smallest possibility of its edges. 4.1
Classification with Possibilistic Trees
Now, suppose that a new query x0 is to be classified. Given the possibility distribution πM (·) as defined above, a possibilistic prediction of the label λx0 can be derived from (4). However, a more efficient approach is to propagate possibility degrees in the meta-tree T directly. To this end, define possibility η for nodes η in a recursive way as follows: If η is a leaf node, distributions πL η then πL is defined by 1 if η is labeled with λ η . πL : λ → 0 otherwise Otherwise, let η1 , . . . , ηr be the successor nodes of η, and suppose the edge leading from η to ηı be marked with the possibility degree pı . The distribution associated with η is then given by ηı (λ), pı }. πL : λ → max min{πL 1≤ı≤r
(7)
The possibility distribution πL = πL (· | x0 ) is defined to be the possibility disη0 tribution πL associated with the root η0 of the meta-tree. Proposition 1. The propagation of possibility degrees in the meta-tree yields the same possibilistic prediction πL (· | x0 ) as the extension principle (4). Proof. Let πL be the possibility distribution derived from the propagation of possibility degrees in the meta-tree T . Moreover, consider a label λ ∈ L and let p = πL (λ). If p = 0 then none of the leaf nodes in T is labeled with λ, and the proposition is obviously correct. Now, let p > 0. The definition (7) of distributions associated with nodes entails the existence of a path ρ∗ = (η1 , . . . , ηk ) in T such that the following holds: (1) η1 is the root of τ and ηk is a leaf node with label λ. (2) The possibility π(ρ∗ ) of the path ρ∗ , that is the minimum of the possibility degrees assigned to the edges (ηı , ηı+1 ), 1 ≤ ı < k, is given by p. Moreover, π(ρ) ≤ p for all other paths ρ in the meta-tree whose leaf nodes are labeled with λ. Now, it is easily verified that the path ρ∗ can be completed to an ordinary decision tree τ such that πM (τ ) = d. In fact, at each node η in the meta-tree T there is an attribute α such that all edges associated with that attribute are
Possibilistic Induction in Decision-Tree Learning
179
labeled with the possibility degree 1. Thus, the path ρ∗ can be extended to a tree τ such that each edges of τ which is not an edge of ρ∗ is labeled with a possibility degree of 1. Therefore, πM (τ ) = p, which means that the possibility of λ according to (4) is at least p. Clearly, (4) cannot be larger than p, since this would imply the existence of a tree τ which assigns x0 the label λ and whose edges all have possibility degrees larger than d. This tree therefore contains a path ρ whose leaf node is labeled with λ and such that π(ρ) > p, a contradiction to the definition of ρ∗ . Therefore, the possibility of λ according to (4) is also given by d. ✷ Using the classification scheme outlined above, a single estimated class label λ0 as predicted by an ordinary decision tree is replaced by a prediction in the form of a possibility distribution πL over the set of labels. This distribution is normalized in the sense that maxλ∈L πL (λ) = 1. Note that the label λ∗0 with πA (λ∗0 ) = 1 is unique unless there is an exact equivalence G(Sη , αı ) = G(Sη , α ) for a node η and two attributes αı = α . If λ∗0 is unique, it is just the label predicted by the classical approach to decision tree induction. The distribution πA reflects the uncertainty related to the classification: λ∗0 is the most plausible classification and will generally be chosen if a definite decision must be made. However, there might be further possible candidates as well, and the related possibility degrees indicate the reliability of λ∗0 . Formally, reliability is reflected by the necessity degree of λ0 , given by 1 − maxλ =λ∗0 πL (λ): If there is at least one other label with a rather high degree of possibility, the situation is ambiguous. A classification (on the basis of a decision tree) might then be rejected. More generally, one might take action on the basis of a set-valued prediction including the maximally plausible labels, or take this set as a point of departure for the acquisition of further information. The approach proposed here is related to other extensions of decision tree learning. Especially, the idea of option decision trees [3,9], which also provide a compact representation of a class of candidate decision trees, is worth mentioning in this connection. There are, however, some important differences between the two methods. For example, the outcomes at an option node are combined to a unique choice, e.g. by means of a majority vote. As opposed to this, our approach considers different choices with different degrees of possibility. 4.2
Alternative Aggregation Procedures
Consider a meta-tree T and let P = Px0 denote the class of paths ρ in T that are matched by the new query x0 (where x0 matches a path if it satisfies all tests αı (x0 ) = uı along that path). In agreement with the common max-min calculus of possibility theory we have defined the possibility of a path ρ = (η1 , . . . , ηk ) as . πP (ρ) = min poss((ηı , ηı+1 )), 1≤ı<|ρ|
(8)
180
Eyke H¨ ullermeier
where poss((ηı , ηı+1 )) denotes the possibility degree assigned to the edge (ηı , ηı+1 ). Moreover, the possibility of a label λ was determined as . πL (λ) = max πP (ρ), (9) ρ∈P : l(ρ)=λ
where l(ρ) is the label of ρ’s leaf node (max ∅ = 0 by definition). The minimum in (8) and the maximum in (9) are special types of aggregation operators. In fact, the minimum actually serves as a kind of conjunctive aggregation function, whereas the maximum is a special type of disjunctive operator. These aggregation functions can be replaced by more general operators, namely by a generalized (logical) conjunction, called a t-norm, and a generalized disjunction called a t-conorm. A t-norm is a binary operator ⊗ : [0, 1]2 → [0, 1] which is commutative, associative, monotone increasing in both arguments and which satisfies the boundary conditions x ⊗ 0 = 0 and x ⊗ 1 = x. An associated t-conorm is defined by the mapping (α, β) → 1 − (1 − α) ⊗ (1 − β). As can be seen, ⊗ = min is a special t-norm with associated t-conorm ⊕ = max. Other important operators include the product ⊗P : (α, β) → αβ with related t-conorm ⊕P : (α, β) → α + β − αβ and the Lukasievicz t-norm ⊗L : (α, β) → max{0, α + β − 1} with related t-conorm ⊕L : (α, β) → min{1, α + β}. Replacing min and max by a t-norm ⊗ and a t-conorm ⊕ yields . poss((ηı , ηı+1 )), (10) πP (ρ) = 1≤ı<|ρ|
. πL (λ) =
πP (ρ).
(11)
ρ∈P : l(ρ)=λ
As opposed to max and min, which are in agreement with the interpretation of possibility distributions as generalized constraints, most other operators are compensatory. For example, the possibility of a path is completely determined by its weakest edge according to (8), whereas several strong edges might compensate for this edge when using (10). Likewise, a label supported by several moderately possible paths might be preferred to a label supported by one very plausible path when using an operator such as the probabilistic sum ⊕P : (α, β) → α + β − αβ. Note that the label λ∗0 estimated by an ordinary decision tree τ will always have a possibility degree of πA (λ∗0 ) = 1 in the possibilistic extension. In fact, the path ρ in τ which is matched by x0 has a possibility degree of 1 in the metatree T . Thus, πA (λ∗0 ) = 1 follows immediately from α⊕1 = 1 which holds true for every t-conorm ⊕ and all 0 ≤ α ≤ 1. Now, however, it may happen that a label λ is also regarded as fully possible, even though there is no completely plausible path (classification sequence) that yields λ as a label. For example, suppose λ to be supported by at least paths ρ with possibility πP (ρ) ≥ 1/. When using the Lukasievicz t-conorm as an aggregation, one then obtains πA (λ) = 1.
5
Lazy Decision Tree Learning
Needless to say, a generalized (possibilistic) decision tree T as outlined in Section 4 can become quite awkward. In fact, passing from an ordinary tree to a
Possibilistic Induction in Decision-Tree Learning
181
possibilistic tree might easily result in a doubling or trebling of the (average) branching factor. In this connection, the idea of a lazy decision tree learner as outlined in [8] is quite interesting: In classical decision tree learning, test attributes are chosen so as to minimize the average impurity of the children of a node, thereby supporting the overall objective of maximizing average performance. However, a decision tree thus induced might not be optimally adapted to a specific query x0 . For example, the entropy of the child relevant for x0 might well increase, even though the average entropy decreases. Lazy decision tree induction applies the idea of lazy learning [1] to decision trees. Roughly speaking, only a single path of an imaginary decision tree is generated, namely the path which is matched by the query x0 . This allows for selecting the test attributes in a manner which is most favorable for the specific instance x0 .5 More precisely, the method proposed in [8] – called LazyDT by the authors – works as follows: As usual, the complete set of training samples, S, is assigned to the root of the tree. A node η becomes a leaf (answer node) if all associated samples Sη belong to the same class or if all attributes have already been used along the path from the root to η. Otherwise, the sample Sη associated with η is split according to the values of an attribute. As an evaluation measure for attributes α, a modified version G∗ of the information gain (1) is proposed: Firstly, G∗ is computed for the sub-sample Su with u = α(x) alone, not as a weighted average over all sub-samples. Secondly, the instances at a node η are weighted such that each class has equal weight, which means that the parent node has maximal entropy (see [8] for a justification of this approach). Once having identified an optimal attribute α∗ , the procedure is called recursively for the sub-sample Sα∗ (x0 ) . Apart from conceptual advantages in comparison to classical decision tree learning, this approach is interesting in our context since it avoids the generation of a complete (meta-)tree: Even though the individual path generated by the lazy learner becomes a “possibilistic path”, that is an ordinary tree, within our approach, it can be handled much more efficiently than a meta-tree. The possibilistic version of LazyDT – call it PLazyDT – performs in the same way as the original approach, with the following exceptions: At a node η, a degree of possibility πA (α) is derived for all (still available) attributes α. This is done as in Section 4, using a normalized version of the G∗ measure. Then, one successor node ηα is defined for each attribute α ∈ A∗η . The subsample assigned to ηα is the set of samples x, λx ∈ Sη such that α(x) = α(x0 ). While generating a path ρ, the possibility degrees πA (α) (assigned to edges of that path) are accumulated using the minimum operator or, more generally, a t-norm as proposed in Section 4.2. When reaching the leaf node of ρ, one thus obtains a predicted label l(ρ) along with a possibility degree πP (ρ). Finally, the possibility πL (λ) of a label λ is obtained by combining the possibility degrees of all paths ρ with l(ρ) = λ, using the maximum or an alternative t-conorm. 5
Note that a lazy learner needs to store all observations.
182
6
Eyke H¨ ullermeier
Experimental Results
As already explained above, the label estimated by an ordinary decision tree is also fully supported by the possibilistic generalization. Thus, the two approaches will principally yield the same final decisions. A difference can only occur if the distribution πA assigns full support to several labels. We shall turn to this aspect in Section 6.3 below. Still, the main motivation underlying the possibilistic approach is the idea of indicating the uncertainty related to a decision. This point will be investigated in Section 6.2. In this section, we restrict ourselves to the lazy versions of decision tree induction, as we obtained quite similar results for the classical approaches (apart from the runtime of the algorithms). 6.1
Experimental Setup: Generation of Synthetic Data
An individual experiment is parameterized by the number of attributes, k, the number of labels, m, the size of the training sample, n, and a complexity parameter γ: – An underlying “true” decision tree τ is generated at random. This is done in a recursive manner by starting with the root of the tree and flipping a (biased) coin to decide whether the current node is an inner node or a leaf.6 The probability of a node to become an inner node is specified by a fixed parameter 0 < γ < 1 (the larger γ, the more complex the tree will be on average). Here, we restrict ourselves to binary trees, i.e. we only consider binary attributes. Once a leaf node has been generated, it is assigned a class label at random.7 Likewise, inner nodes are assigned attributes. – A random sample is generated based on a uniform distribution over the instance space. The sample is labeled using the decision tree τ . – Decision trees τ1 and τ2 are induced, respectively, by LazyDT and PLazyDT based on the random sample. – A new query x0 is generated at random and classified by the two trees, which yields an estimation λ∗1 = τ1 (x0 ) and a possibilistic prediction πA with related decision λ∗2 = arg maxλ∈L πA (λ). The correct label is λx0 = τ (x0 ). 6.2
Representation of Uncertainty
To capture the aspect of uncertainty representation, let p1 denote the expected degree of possibility assigned by πL to the correct label λx0 given that this label is not predicted (λ∗2 = λx0 ). Moreover, let p2 denote the expected possibility of the most possible incorrect label λ = λx0 given that the decision is correct, that is 1 minus the degree of necessity of λx0 . Ideally, p1 is large and p2 is small: Wrong decisions are accompanied by a large degree of uncertainty, reflected by 6 7
The root is never a leaf. We pay attention that not all successors of a node do have the same label.
Possibilistic Induction in Decision-Tree Learning
183
considerable support of the actually correct label (and hence a low degree of necessity for λ∗2 ), whereas correct decisions appear reliable at the same time. We have derived approximations to these expected values by taking averages over 10,000 experiments. The table below shows results (r denotes the classification rate) for different setups with k = 6, γ = 0.8. For PLazyDT we used max and min as aggregation operators, the function (5) with c = 1/3 for assigning basic possibility degrees, and the threshold ∆ = 0 in (6). n 10 20 30 40
r 0.720 0.784 0.838 0.855
m=2 p1 0.700 0.810 0.871 0.886
p2 0.423 0.363 0.331 0.265
r 0.632 0.742 0.762 0.799
m=3 p1 0.653 0.754 0.814 0.839
p2 0.381 0.336 0.261 0.214
r 0.581 0.649 0.734 0.786
m=4 p1 0.533 0.717 0.726 0.767
p2 0.249 0.244 0.192 0.162
As can be seen, the reliability of a prediction is reflected extremely well by the possibilistic estimation. As it was to be expected, both the classification rate and the quality of the possibility distribution (as indicated by p1 , p2 ) increase with sample size (as already explained above, the larger p1 and the smaller p2 , the better the quality of the distribution). For other setups (values k, γ) the results were qualitatively very similar. We do not present them here for reasons of space. 6.3
Classification Performance
One may obtain πA (λ) = 1 for several labels λ ∈ L when making use of more general t-norms and t-conorms. In such a case, there are different options to make a final decision. Here, we simply choose one among these labels at random. The following results were again derived for k = 6, γ = 0.8, using the t-norm (α, β) → αβ and the related t-conorm (α, β) → α + β − αβ (r1 and r2 denote the classification rate for LazyDT and PLazyDT, respectively). n 10 20 30 40
m=2 r2 r1 0.731 0.707 0.797 0.792 0.851 0.812 0.864 0.844
m=3 r2 r1 0.646 0.652 0.711 0.708 0.760 0.757 0.808 0.782
m=4 r2 r1 0.587 0.556 0.675 0.656 0.734 0.730 0.789 0.776
As can be seen, PLazyDT is slightly superior, though – as it was to be expected – the difference in classification performance is not very significant. We obtained quite similar results for several real-world data sets from the UCI repository which are, again for reasons of space, not presented here. These results confirm that aggregating over possible models might indeed be better than completely relying on the supposedly optimal one.
184
7
Eyke H¨ ullermeier
Concluding Remarks
Inductive reasoning based on Ockham’s razor or, more generally, on heuristic principles is always afflicted with uncertainty. The major concern of the method proposed in this paper is to capture this type of uncertainty, which appears to be non-probabilistic by nature. Therefore, our formalization employs the alternative framework of possibility theory (flexible constraints). Let us mention that a related possibilistic formalization has already been developed for the heuristic principle underlying instance-based learning [6]. Of course, one might deplore the lacking of a sound theoretical basis for the possibilistic approach. It should be noted, however, that the same remark already applies to the underlying heuristic principle itself. In fact, what we introduced here is an alternative formalization of Ockham’s razor which – according to our opinion – extends the original version in a reasonable way. As the experimental results confirm, the possibilistic approach represents the reliability of a prediction in a thorough way and may even (slightly) improve classification performance. Apart from the uncertainty connected to inductive inference one usually has to cope with other types of uncertainty as well, such as e.g. noisy data. Extending the method proposed here by combining these different types of uncertainty is one of the challenges for future work.
References 1. D. W. Aha, editor. Lazy Learning. Kluwer Academic Publ., 1997. 181 2. L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth International Group, Belmont, CA, 1984. 174 3. W. Buntime. Learning classification trees. Statistics and Computing, 2(2), 1992. 179 4. L. J. Cohen. An Introduction to the Philosophy of Induction and Probability. Claredon Press, Oxford, 1989. 173 5. P. Domingos. The role of Occam’s razor in knowledge discovery. Data Mining and Knowledge Discovery, 3:409–425, 1999. 176 6. D. Dubois, E. H¨ ullermeier, and H. Prade. Fuzzy set-based methods in instancebased reasoning. IEEE Transactions on Fuzzy Systems. To appear. 184 7. D. Dubois and H. Prade. Possibility Theory. Plenum Press, 1988. 175 8. J. H. Friedman, R. Kohavi, and Y. Yun. Lazy decision trees. In Proceedings AAAI–96. Morgan Kaufmann, 1096. 181 9. R. Kohavi and C. Kunz. Option decision trees with majority votes. In Proceedings ICML–97. 179 10. J. Mingers. An empirical comparison of pruning methods for decision tree induction. Machine Learning, 4:227–243, 1989. 11. J. Mingers. An empirical comparison of selection measures for decision-tree induction. Machine Learning, 3:319–342, 1989. 174 12. J. R. Quinlan. Discovering rules by induction from large collections of examples. In D. Michie, editor, Expert Systems in the Micro Electronic Age. 1979. 174 13. J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993. 174 14. L. A. Zadeh. Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets and Systems, 1:3–28, 1978. 176
Improved Smoothing for Probabilistic Suffix Trees Seen as Variable Order Markov Chains Christopher Kermorvant1 and Pierre Dupont2 1
EURISE, Universit´e Jean Monnet Saint-Etienne, France 2 INGI, University of Louvain Louvain-la-Neuve, Belgium
Abstract. In this paper, we compare Probabilistic Suffix Trees (PST), recently proposed, to a specific smoothing of Markov chains and show that they both induce the same model, namely a variable order Markov chain. We show a weakness of PST in terms of smoothing and propose to use an enhanced smoothing. We show that the model based on enhanced smoothing outperform the PST while needing less parameters on a protein domain detection task on public databases.
1
Introduction
In many application domains like spoken or written natural language recognition and biological sequences analysis, Markovian models are widely used to model probability distributions on sequences of events. However, in the case of most general models in this family (hidden Markov models), the training and decoding procedures can be computationally heavy. Therefore, in many cases, the use of Markov chains, a subclass of Markovian models, is considered. Recently, a model based on variable order Markov chains, called probabilistic suffix trees (PST) has been proposed. On a computational biology task (protein domains detection), its performance is competitive with models based on hidden Markov models, while being of lower algorithmic complexity both for its learning procedure and for domain detection [3]. This model can be trained from raw sequences, whereas HMM based models may need aligned sequences, meaning that it is independent from an alignment procedure. This is an important point since multiple alignment of protein sequences is a computer consuming task and is liable to errors. Moreover, the EM algorithm used to train HMM can hit a local minimum. However, the model based on PST recently proposed provides a very simple solution to a problem pointed out as very important for models based on Markov chains : the probability smoothing. This classical problem occurs in probability estimation when the number of possible events is very large compared to the number of observed events. In this case, many non observed yet possible events are estimated with a null probability. Smoothing probability estimation consists in estimating the probability of non observed events, while correcting the probability of observed events. T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 185–194, 2002. c Springer-Verlag Berlin Heidelberg 2002
186
Christopher Kermorvant and Pierre Dupont P(0)=0.6 P(1) = 0.4
ε 0
O 0
00
1
1
P(0)=0.7 P(1) = 0.3 0
01
P(0)=0.8 P(1)=0.2
0
001
P(0)=0.3 P(1)=0.7
P(0)=0.5 P(1)=0.5 1
P(0)=0.4 P(1)=0.6
11
P(0)=0.4 P(1)=0.6
1
101
P(0)=0.2 P(1)=0.8
Fig. 1. A Probabilistic Suffix Tree on Σ = {0, 1}. The probability assigned to the sequence 100110 by this tree is P (100110) = γe (1)γ1 (0)γ0 (0)γ00 (1)γ001 (1)γ11 (0) = 0.4 ∗ 0.5 ∗ 0.7 ∗ 0.2 ∗ 0.7 ∗ 0.4 = 7.84 10−3 The structure of this article is as follows. We first recall the definition of probabilistic suffix trees, present the link with Markov chains, give the inference algorithm and point out the weakness of the model regarding probability smoothing. Then, we present a smoothing technique which has been proved to be efficient and show that this technique induces a model which is equivalent to PST but with a better smoothing. Finally, we test the models on a protein detection task and show that using the improved smoothing technique yields better results with less parameters.
2 2.1
Variable Order Markov Chains Prediction Suffix Trees
Probabilistic suffix trees (PST), also known as prediction suffix trees, are a model of probability distribution for discrete events occurring in sequences. They can be used either to predict the next event in the sequence, as in language modeling, or to assign a probability to a whole sequence for classification, or detection as in protein domain detection. The basic assumption underlying the PST is that the probability of an event in the sequence depends at most on the k preceding events, for some k fixed. The difference between PST and classical Markov chains is that for Markov chains, the dependency relies on exactly the k preceding events, whereas in PST the dependency can be based on a number of preceding events varying from 0 to k. In that way, PST are a generalization of Markov chains. Formally, let Σ be a set of discrete symbols (events). A PST is a rooted tree of maximum degree |Σ| for which – for each node, there is at most one outgoing edge labeled by each symbol of Σ – each node is labeled by the labels sequence of the edges needed to go up from this node to the root (suffix labeling).
Improved Smoothing for Probabilistic Suffix Trees
1 2 3 4 5 6 7 8 9 10 11 12 13 14
187
begin T has only one node, labeled by (the empty suffix) S = {σ|σ ∈ Σ, P (σ) ≥ Pmin } while S = ∅, select s in S remove s from S P (σ|s) ≤ 1r or ≥ r if ∃σ ∈ Σ such that P (σ|s) ≥ γmin and P ((σ|suf (s)) then add to T the node labeled by s and all the nodes needed to go to this node from the node in the tree labeled by the largest suffix of s. end if if |s| < L then S = S ∪ {σs | σ ∈ Σ, P (σs) > Pmin } end if end while for all s labeling a node in T , smooth the probability such that ∀σ ∈ Σ, γˆs (σ) = (1 − |Σ|γmin )P (σ|s) + γmin end
Fig. 2. The PST inference algorithm
– for each node q, every outgoing transition on a symbol s is associated to a probability γq (s) such that s∈Σ γq (s) = 1 Let X = (xn )n∈{1,...,n} be a sequence of events. Theprobability assigned by n the PST to the sequence is given by P (x1 , · · · , xn ) = i=1 γsi (xi ) where si is the node of the tree corresponding to the largest suffix of x1 , · · · , xi−1 stored in the tree. An example of suffix tree is presented on Figure 1. Ron et al. [14] proposed an algorithm for PST inference and studied the learnability properties of the model. Given a target PST with n states and with maximum depth L, and for every given > 0 and 0 < δ < 1, their algorithm returns with confidence 1 − δ, in time polynomial in n, L, |Σ|, 1δ , 1 , a PST whose per symbol Kullback-Leibler distance from the target is at most . The PST inference algorithm is presented on Figure 2. Given a sequence s = x0 , · · · , xi , we note suf (s) = x1 , . . . , xi the largest suffix of s different from s. The algorithm starts from a 0th order Markov chain (line 2) and with S, the set of suffixes to be examined, containing all the symbols with probability larger than a threshold set to Pmin (line 3). Then, for each element s in S, if there exists a symbol for which the probability conditioned by s is significantly different from the probability conditioned by the suffix of s (line 4), then the node labeled by s is added to the tree, and also all the nodes needed to go to this node from the node in the tree labeled by the largest suffix of s. If the length of s is strictly smaller than the maximum order of the tree, then all the sequences built from s added with a symbol σ, such that the probability of the sequence sσ is larger the threshold Pmin are added to S (line 10). The last step of the algorithm is a very simple smoothing procedure. The maximum likelihood probability estimator is modified so that no symbol is predicted with probability 0 whatever its suffix is (line 13).
188
Christopher Kermorvant and Pierre Dupont
However, the smoothing procedure using the modified maximum likelihood estimator has two main drawbacks. First, the same constant value γmin is added to every probability, whatever the observed frequency and the probability of the event are. Second, the same floor probability γmin is assigned to all unseen events, whatever their suffix is. Many other smoothing procedures, which do not suffer from these problems, have been proposed. We present one of them in the next section. 2.2
Back-Off Smoothing of Markov Chains
When using Markovian models in real applications, even if a large amount of data is available to estimate the model, the problem of predicting events which were not observed during the estimation procedure occurs. This is particularly true for high order Markovian models since the number of possible contexts for a kth order Markovian model is exponential in k. The probability of a large number of events are estimated on only a few occurrences, leading to poor probability estimation. Many rare but possible events are not observed, and are wrongly estimated with a null probability. The problem of predicting unseen events, also known as the zero-frequency problem, is due to the fact that the maximum likelihood estimator attributes the whole probability mass to the events seen during the estimation. Several solutions have been proposed to this problem : succession laws [12], linear interpolation of the maximum likelihood estimator with another estimator, such as an a priori distribution [8] or a more general distribution [7], discounting of a certain amount of the probability mass of seen events using the Turing-good formula [9] or absolute discounting [13]. When using discounting, the discounted probability mass is redistributed to all unseen events according to another probability distribution. This is the back-off smoothing methods, proposed by Katz [9]. For Markov chains, the back-off smoothing is based on the following idea : discount a certain amount dC from the probability mass of events which have been observed in a context of length k and redistribute this amount to all unseen events according to their probability in a context of length k−1. This probability can in turn be recursively smoothed. Formally, recalling the notation introduced in the previous section, if σ is a symbol and s a suffix (context), we have : c(s,σ)−d C if c(s, σ) > 0 σ∈Σ c(s,σ) P (σ|s) = α(s)β(s, σ) otherwise
P
where c(s, σ) is the number of times σ was seen after the suffix s and dC is the discount parameter, which may depend on c(s, σ), α(s) is a normalization factor and β(s, σ) is the back-off distribution, generally estimated on shorter suffixes. This distribution can in turn be smoothed, inducing a recursive process which ends, in the worst case, with the unconditional probability of the symbol P (σ). In this case, the back-off distribution is used only if the main distribution is null (shadowing). Kneser and Ney [10] showed that using the back-off distribution even if the main distribution is not null (non-shadowing) leads to a better
Improved Smoothing for Probabilistic Suffix Trees
model. We have then: P (σ|s) =
Pc(s,σ)−d c(s,σ) + α (s)β(s, σ)
if c(s, σ) > 0
α (s)β(s, σ)
otherwise
C
σ∈Σ
189
where α (s) is a normalization factor. This method is also named non-linear interpolation, since it can be defined as c(s, σ) − d C , 0 + α (s)β(s, σ) P (σ|s) = max σ∈Σ c(s, σ) if we suppose that dC ≤ c(s, σ) for all c(s, σ) > 0. The normalization factor is dC then α (s) = σ∈Σ | c(s,σ)>0 . Kneser and Ney propose to estimate σ∈Σ c(s,σ) the back-off probability in the following way :
P
c(•, suf (s), σ) σ∈Σ c(•, suf (s), σ)) with c(•, suf (s), σ) = {σ | c(σ , suf (s), σ) > 0}. This estimation is not based on the observed frequency of the sequence (s, σ) but on the number of different contexts in which σ has been observed after suf (s). The back-off probability can also be null, leading to a recursive smoothing using the same formula. Therefore, we see that using a recursive smoothing on a kth order Markovian model leads to build a variable order Markov chain. There are three differences between a variable order Markov chain build by recursive Kneser-Ney back-off smoothing (denoted KN-chain) and a variable order Markov chain represented by a PST infered by Ron’s algorithm: β(s, σ) =
– there is no pruning in KN-chain : if a sequence of length lower than the maximum order of the KN-chain is observed, its probability is estimated and stored, whereas in PST the estimated probability is stored only if it is above a threshold (Pmin ). For a given maximum order, PST may have less parameters than KN-chains. – in KN-chain, a different and enhanced estimation scheme is used for the backoff probability estimation, whereas in PST, for all the orders, probability estimation is based on the modified maximum likelihood estimator. – in KN-chain, both the enhanced estimation scheme and the modified maximum likelihood estimator are used (non-shadowing), whereas in PST, only one modified maximum likelihood estimator is used. In the next section, we show that KN-chains significantly outperform PST on a protein domain detection task.
3
Application to Protein Domains Detection
Many databases have been created to gather information concerning proteins. Researchers can find in these databases not only the amino-acid sequence of proteins but also information about their functions, structure, related diseases and
190
Christopher Kermorvant and Pierre Dupont
bibliographical pointers. These databases are used to help the analysis of newly sequenced proteins, for which no function or structure is known yet. They serve as a basis for learning models which are used to detect sub-sequences (called domains or motifs) which are known to be related to a particular biochemical function. Such models range from complex probabilistic models based on hidden Markov models [11,5] to purely syntactic models, like regular expressions, describing characteristic sub-sequences [1]. However, since the databases are constantly increasing and updated, the learning procedure of these models must be easy and of low complexity. 3.1
Protein Domains Detection with Variable Markov Chains
Automatic analysis of newly sequenced proteins, for which neither structure nor biochemical functions are known yet, is now very important since the number of newly sequenced proteins is increasing daily. To a certain extend, hypotheses concerning the function of a protein can be made by searching, in its amino-acid sequence, sub-sequences which are known to be related to a function in other proteins. Many of such sub-sequences, called domains, have been identified and are stored in databases like PFAM [15]. However, the sequence of a given domain is not constant through species. Substitutions, deletions and insertions occur, which make domain detection more complex than a simple exact sub-sequence detection. Domain models, like HMM[5], are trained on theses sub-sequences and used to detect domains in complete protein sequences. Variable order Markov chains may also be used to detect domains in protein sequences [3]. A variable order Markov chain is associated to each domain to be detected and is estimated on a set of examples of such domain. Then the likelihood of a new protein sequence given a domain model is related to the presence or not of the corresponding domain in the protein. A high likelihood is a sign of a probable presence of the corresponding domain in the protein. 3.2
Experimental Setup
We used two databases to test our models : the SWISSPROT database [2] which contains protein sequences from several living organisms and the PFAM database [15] which contains alignments of functional domains, grouped in families, extracted from SWISSPROT with a semi-automatic procedure. We labeled the SWISSPROT sequences with the name of the domains they contain, according to PFAM families. In order to compare with recently publish results [6,3], we used PFAM release 1.0. This release contains 22307 domains grouped in 175 families. 3.3
Training the Models
For each domain family, we estimated the models on 80% of the domain sequences extracted from the alignments available in PFAM. We trained probabilistic suffix
Improved Smoothing for Probabilistic Suffix Trees
191
Kneser-Ney Smoothing PST Maximal order 0 1 2 3 4 20 Correct detection rate 13.9 53.0 81.3 89.5 90.0 85.8 Number of parameters 2, 2 × 101 4, 0 × 102 3, 3 × 103 9, 9 × 103 1, 8 × 104 5, 1 × 104
Fig. 3. Correct detection rate on the complete SWISSPROT database and number of parameters for variable order Markov chains with Kneser-Ney smoothing and PST tree with software and parameters given as optimal by Bejerano [3]. The maximal order of the PST is 20. We also trained variable order Markov chains with KneserNey smoothing, with maximum order ranging from 0 to 4. 3.4
Testing the Models
All the models were tested for domain detection on the protein sequences of the SWISSPROT database corresponding to the complete PFAM database. In order to measure a correct detection rate, we used the iso-point detection criterion [3,6]. For each family model, an iso-point is computed on the complete SWISSPROT sequences set. The iso-point is defined as the value v for which the number of protein sequences not containing the domain with a likelihood above v is equal to the number of protein sequences containing the domain with a likelihood under v. For a given model, a sequence containing the domain with a likelihood above the iso-point is considered correctly detected. The correct detection rate is defined as the ratio of the number of proteins correctly detected on the number of proteins containing the domain. Note that in order to compute the iso-point, the likelihood of each sequence is normalized by its length. 3.5
Results
Figure 3 shows the correct detection rate on all the SWISSPROT database sequences and the number of parameters for PST and variable order Markov chains with Kneser-Ney smoothing. Smoothed variable order Markov chains outperform PST as soon as the maximum order of the chains is greater or equal to 3. As from the 4th order Markov chain and up to the the 9th order Markov chain, the detection rate is stationary. Considering the domain detection problem as a binary classification problem (“does a sequence contain a given domain or not”), the performance difference between variable 4th order Markov chains and PST was tested with a McNemar test [4]. The H0 hypothesis “the variable 4th order Markov chains and PST have the same classification performance” was rejected (p-value< 10−15 ). The performance difference is thus significant. Figure 4 shows the detection rate on the part of the SWISSPROT database corresponding to the PFAM domains which were not used for training (named SWISSPROT test set). Results are given when the size of the training set is varying from 20% to 100%. Even on small training set, the 4th-order smoothed variable order Markov chains outperform PST.
192
Christopher Kermorvant and Pierre Dupont
100 0th 1st 2nd 4st
Correct detection rate
80
KN-chain KN-chain KN-chain KN-chain PST-20
60
40
20
0 0
20
40
60 % of traning set used
80
100
120
Fig. 4. Learning curve: correct detection rate on the SWISSPROT test set versus the size of the PFAM training set for Markov chains with Kneser-Ney smoothing with order ranging from 0 to 4 (Oth KN-chain to 4th KN-chain) and PST with maximum order 20 (PST20 )
100
3rd KN 80
4th KN
2nd KN PST-10
Correct detection rate
PST-3
PST-20
PST-2
60
1st KN
40
PST-1
20 0th KN
0 10
100
1000 Number of parameters (log scale)
10000
100000
Fig. 5. Detection rate on the SWISSPROT test set versus the number of parameters needed for Markov chains with Kneser-Ney smoothing with order ranging from 0 to 4 (Oth KN-chain to 4th KN-chain) and PST with maximum order ranging from 1 to 20 (PST1 to PST20 )
Improved Smoothing for Probabilistic Suffix Trees
193
Finally, figure 5 shows the detection rate with respect to the number of parameter needed by the model. The 3rd-order and 4th-order Markov chains outperform PST while needing significantly less parameters.
4
Conclusion
We have shown that PST and smoothed Markov chains can be seen as equivalent variable order Markov models but for the smoothing technique. As the quality of the back-off technique as been shown to be important for Markov chain in other application domains, we proposed to enhance the smoothing technique used in PST by using a non-shadowing back-off smoothing to lower order Markov chain estimated as proposed by Kneser and Ney [10]. With this improved smoothing, we showed that the maximum order of the Markov chain can be drastically reduced, with a performance increase on a protein domain detection task. By reducing the maximum order of the Markov chain, we also reduce the number of parameters needed.
References 1. A. Bairoch. PROSITE: A dictionary of sites and patterns in proteins. Nucleic Acids Research, 19:2241–2245, 1991. 190 2. A. Bairoch and R. Apweiler. The SWISS-PROT protein sequence data bank and its new supplement TrEMBL. Nucleic Acids Res., 24:21–25, 1996. 190 3. Gill Bejerano and Golan Yona. Variations on probabilistic suffix trees: statistical modeling and prediction of protein families. Bioinformatics, 17(1):23–43, Jan 2001. 185, 190, 191 4. Thomas G. Dietterich. Approximate statistical test for comparing supervised classification learning algorithms. Neural Computation, 10(7):1895–1923, 1998. 191 5. S. Eddy. HMMER user’s guide: biological analysis using profile hidden Markov models. Department of Genetics, Washington University School of Medecine, 1998. http://hmmer.wustl.edu/. 190 6. E. Eskin, W. Grundy, and Y. Singer. Protein family classification using sparse markov transducers. In Proc. Int. Conf. on Intelligent Systems for Molecular Biology, August 2000. 190, 191 7. F. Jelinek and R. Mercer. Interpolated estimation of markov source parameters from sparse data. In E. Gelsema and L. Kanal, editors, Pattern recognition in practice, pages 381–397, Amsterdam, 1980. North-Holland. 188 8. W. E. Johnson. Probability : deductive and inductive problems. Mind, 41:421–423, 1932. 188 9. Slava M. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans. on Acoustics, Speech and Signal Processing, ASSP-35(3):400–401, March 1987. 188 10. R. Kneser and H. Ney. Improved backing-off for M-gram language modeling. In Proc. Int. Conf. on Acoustics, Speech and Signal Processing, pages 181–184, Detroit, MI, May 1995. 188, 193 11. A. Krogh, M. Brown, I. Mian, K. Sjolander, and D. Haussler. Hidden Markov models in computational biology: Applications to protein modeling. Journal of Molecular Biology, 235:1501–1531, 1994. 190
194
Christopher Kermorvant and Pierre Dupont
12. G. Lidstone. Note on the general case of the Bayes-Laplace formula for inductive or a posteriori probabilities. Trans. Fac. Actuar., 8:182–192, 1920. 188 13. Hermann Ney, Ute Essen, and Reinhard Kneser. On structuring probabilistic dependencies in stochastic language modelling. Computer Speech and Language, 8:1–38, 1994. 188 14. D. Ron, Y. Singer, and N. Tishby. The power of amnesia. In Jack D. Cowan, Gerald Tesauro, and Joshua Alspector, editors, Advances in Neural Information Processing Systems, volume 6, pages 176–183. Morgan Kaufmann Publishers, Inc., 1994. 187 15. E. L. L. Sonnhammer, S. R. Eddy, and R. Durbin. Pfam : a comprehensive database of protein domain families based on seed alignments. Proteins, 28(3):405–420, 1997. 190
Collaborative Learning of Term-Based Concepts for Automatic Query Expansion Stefan Klink, Armin Hust, Markus Junker, and Andreas Dengel German Research Center for Artificial Intelligence (DFKI, GmbH) P.O. Box 2080, 67608 Kaiserslautern, Germany {stefan.klink,armin.hust,markus.junker, andreas.dengel}@dfki.de http://www.dfki.de/klink
Abstract. Information Retrieval Systems have been studied in Computer Science for decades. The traditional ad-hoc task is to find all documents relevant for an ad-hoc given query but the accuracy of adhoc document retrieval systems has plateaued in recent years. At DFKI, we are working on so-called collaborative information retrieval (CIR) systems which unintrusively learn from their users search processes. In this paper, a new approach is presented called term-based concept learning (TCL) which learns conceptual description terms occurring in known queries. A new query is expanded term by term using the previously learned concepts. Experiments have shown that TCL and the combination with pseudo relevance feedback result in notable improvements in the retrieval effectiveness if measured the recall/precision in comparison to the standard vector space model and to the pseudo relevance feedback. This approach can be used to improve the retrieval of documents in Digital Libraries, in Document Management Systems, in the WWW etc.
1
Introduction
With the explosive growth of information on the Internet and Digital Libraries, an acute problem has raised called information overload. Typical search engines index billions of pages across a variety of categories, and return results ranked by expected topical relevance. But only a small percentage of these pages may be of a specific interest. Nowadays, there is an acute need for search engine technology to help users exploit such an extremely valuable resource. In weighted Information Retrieval (IR) the number of retrieved documents is related to the number of appropriate search terms. Retrieval with short queries is typical in Web search [6], but it is much harder as compared to retrieval with long queries. This is because shorter queries often provide less information for retrieval. Modern IR systems therefore integrate thesaurus browsers. They help to find additional search terms [13]. But the keywords used in short queries are not always good descriptors of T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 195-207, 2002. Springer-Verlag Berlin Heidelberg 2002
196
Stefan Klink et al.
contents. Nevertheless, most existing search engines still rely solely on the keywords contained in queries to search and rank relevant documents. This is one of the key reasons that affect the precision of the search engines. In many cases, the answer documents are not relevant to the user information need, although they do contain the same keyword as the query. Another problem which is typical for the Web and for Digital Libraries is that the terminology used in defining queries is often different to the terminology used in the representing documents. Even if some users have the same information need they rarely use the same terminology in their queries. Many intelligent retrieval approaches [2, 8, 12] have tried to bridge this terminological gap. Research on automatic query expansion (or modification) was already under way before the 60’s when initial requests were enlarged in the grounds of statistical evidence [14]. The idea was to obtain additional relevant documents through expanded queries based on the co-occurrence of the terms. However, this kind of automatic query expansion has not been very successful. The retrieval effectiveness of the expanded queries was often not greater than, or even less that the effectiveness of the original queries [10, 11, 16]. One idea involves the use of a relevance feedback environment where the system retrieves documents that may be relevant to a user’s query. The user judges the relevance of one or more of the retrieved documents and these judgments are fed back to the system to improve the initial search result. This cycle of relevance feedback can be iterated until the user is satisfied with the retrieved documents. In this case, we can say that the more feedback is given to the system the better is the search effectiveness of the system. This behavior is verified by [1]. He has shown that the recall-precision effectiveness is proportional to the log of the number of relevant feedback documents. But in a traditional relevance feedback environment the user voted documents are appropriate to the complete query. That means that the complete query is adapted to the users needs. If another user has the same intention but uses a different terminology or just one word more or less in his query then the traditional feedback environment doesn’t recognize any similarities in these situations.
2
Query Expansion
The crucial point in query expansion is the question: Which terms (or phrases) should be included in the query formulation? If the query formulation is to be expanded by additional terms there are two problems that are to be solved namely how are these terms selected and how are the parameters estimated for these terms. Many terms used in human communication are ambiguous or have several meanings [12]. But in most cases these ambiguities are resolved automatically without noticing the ambiguity. The way this is done by humans is still an open problem of psychological research, but it is almost certain, that the context in which a term occurs plays a central role. Most attempts at automatically expanding queries failed to improve the retrieval effectiveness and it was often concluded that automatic query expansion based on statistical data was unable to improve the retrieval effectiveness substantial [11].
Collaborative Learning of Term-Based Concepts for Automatic Query Expansion
197
But this could have several reasons. Term-based query expansion approaches are mostly using hand-made thesauri or just plain co-occurrence data. They do not use learning technologies for the query terms. On the other hand, those who use learning technologies (Neural Networks, Support Vector Machines, etc.) are query-based. That means these systems learn concepts (or additional terms) for the complete query. The vital advantage of using term-based concepts and not learning the complete query is that other users can profit from the learned concepts. A statistical evaluation of internet logging files has shown that the probability that a searcher uses exactly the same query than a previous searcher is much lower then the probability that parts of the query (phrases or terms) occurs in other queries. So, even if a web searcher never used the given search term, the probability that another searcher had used it is very high and then he can profit from the learned concept.
3
Traditional Document Retrieval
The task of traditional document retrieval is to retrieve documents which are relevant to a given query from a fixed set of documents. Documents as well as queries are represented in a common way using a set of index terms (called terms from now on). Terms are determined from words of the documents in the database, usually during pre-processing phases where some noise reduction procedures are incorporated, e.g. stemming and stop-word elimination. In the following, a term is represented by ti (1 ≤ i ≤ M) and a document by dj (1 ≤ j ≤ N), respectively, where M is the number of different terms and N is the number of documents in the data base. 3.1 Vector Space Model One of the most popular and indeed the simplest retrieval model is the vector space model (VSM) [2]. In the VSM, each document dj is represented as a M dimensional vector dj = (w1j, …, wMj)T, 1 ≤ j ≤ N
(1)
where T indicates the transpose, wij is a weight of a term ti in a document dj. A query is likewise represented as qk = (w1q, …, wMq)T, 1 ≤ k ≤ L
(2)
where wiq is a weight of a term ti in a query qk and L is the number of queries contained in the document collection (e.g. in the web-log file). The weights above can be processed in various ways. In our approach, we use the standard normalized tf · idf weighting scheme [14] defined as follows: wij = tfij * idfi
(3)
198
Stefan Klink et al.
where tfij is the weight calculated using the term frequency fij and idfi is the weight calculated using the inverse of the document frequency. The result of the retrieval is represented as a list of documents ranked according to their similarity to the given query. The similarity sim(dj, qk) between a document dj and a query qk is measured by the standard cosine of the angle between these M dimensional vectors dj and qk: sim(dj, qk) =
d Tj qk dj qk
(4)
where · is the Euclidean norm of a vector. In the case that the vectors are already normalized (e.g. have a unit length) the similarity is just the dot product between the two vectors. The VSM is one of the methods we applied to compare with our own methods. 3.2 Pseudo-Relevance-Feedback Everybody would agree that documents my be relevant to a query even if they do not share any word with the query. Unfortunately, the standard VSM will always return zero similarity in this case. So-called query expansion techniques try to overcome this problem by expanding the user given query q to a new enriched query q’ which hen used in the standard VSM. A very well-known technique is the pseudo relevance feedback (PRF) [8]. PRF enriches the original query q by the terms of the top-ranked documents with respect to q. We are using a variation of PRF described in [7]: Let ÷ be a set of document vectors for expansion given by sim(d +j ; q ) maxi sim(di ; q )
÷ = d +j
≥θ
(5)
where q is the users query vector and θ is a similarity threshold. The sum ds of the document vectors in ÷: ds =
∑ d +j
(6)
d +j ∈÷
can be considered as enriched information about the original query. Within the VSM, this sum of documents is again a vector which contains the weights of all terms of the summed documents. Hence, all terms of the documents in ÷ are used to expand the users query. The expanded query vector q’ is obtained by q’ =
q q
+ α
ds ds
(7)
where α is a parameter for controlling the weight of the newly incorporated terms. Finally, the documents are ranked again according to the similarity sim(dj, q’) to the expanded query.
Collaborative Learning of Term-Based Concepts for Automatic Query Expansion
4
199
Term-Based Concept Learning (TCL)
A problem of the standard VSM is that a query is often too short to rank documents appropriately. To cope with this problem, the approach is to enrich the original query with terms occurring in the documents of the collection. Our method uses feedback information and information globally available from previous queries. Feedback information in our environment is available within the ground truth data provided by the test document collections. The ground truth provides relevance information, i.e. for each query a list of relevant documents exists. Relevance information for each query is represented by a N dimensional vector: T
rk = (r1k, …, rNk)
,
1 if docment dj is relevant to query k rjk = 0 if docment d is not relevant to query q j k
(8)
In contrast to traditional pseudo relevance feedback methods, where the top j ranked documents are assumed to be relevant and then their terms are incorporated into the expanded query, we use a different technique to compute the relevant documents. Our method is divided into two phases: The learning phase for each term works as follows: • Select the old queries in which the specific query term occurs • From these selected old queries get the sets of relevant documents from the ground truth data • From each set of relevant documents compute a new document vector and use these document vectors to build the term concept. The expansion phase for each term is easy: • Select the appropriate concept of the current term • Use a weighting scheme to enrich the new query with the concept For the formal description of the learning phase we need the following definitions: • D = {d1 ,…, dN}: the set of all documents • Q = {q1 ,…, qL}: the set of all known queries with • qk = (w1k ,…,wik ,…, wMk)T represented within the vector space model. For each term of the query the appropriate weight wik is between 0 an 1. • R+(qk) = {dj ∈ D | rij = 1 }: the set of all documents relevant to the query qk Now, the first step of the learning phase collects all queries having the i-th term in common: Qi = { qk ∈ Q | wik ≠ 0 }
(9)
If the i-th term doesn’t occur in any query qk then Qi is empty. The second step collects all documents which are relevant to these collected queries: Dik = {dj | dj ∈ R+(qk) ∧ qk ∈ Qi }
(10)
200
Stefan Klink et al.
In the last step of the learning phase the concept of each i-th term is build as the sum of all documents (i.e. vectors of term weights) which are relevant to the known queries which have the term in common: Ci =
∑ dj dj ∈ Dik
(11)
As queries and documents, a concept is represented by a vector of term weights. If no query qk contains term i, the corresponding concept Ci is represented as (0,…,0)T. Now, where the term-based concepts are learned, the user query q can be expanded term by term. The expanded query vector q’ is obtained by
q’ = q +
M
∑ ωi Ci
(12)
i=1
where ωi are parameters for weighting the concepts. In the experiments described below ωi is set to 1. Before applying the expanded query, it is normalized by q’’ =
q’ q’
(13)
For this approach, the complete documents (all term weights wij of the relevant documents) are summed up and added to the query. Although, in some papers it is reported that using just the top ranked terms is sufficient or sometimes better. But experiments with this approach on the collections have shown that the more words are used to learn the concepts the better the results are. So, the decision was made to use always the complete documents and not only some (top ranked) terms. If no ground truth of relevant documents is available, relevant feedback techniques can be used and the concepts are learned by adding terms from the retrieved relevant documents.
5
Combination of PRF with TCL
Additionally to the approach described above, we made some experiments with a linear combination of our approach with the pseudo relevance feedback. For each query we applied PRF method in parallel to our method and the new query is build by (cmp. (7) and (12)): q’ = q + β ds +
M
∑ ωi Ci
i=1
Before applying the expanded query, it is normalized by equation (13).
(14)
Collaborative Learning of Term-Based Concepts for Automatic Query Expansion
6
201
Experiments and Results
6.1 Test Collections For our comparison we used four standard test collections: CACM (collection of titles and abstracts from the journal ‘Communications of the ACM’), CR (congressional report), FR88 (federal register), NPL (also known as the VASWANI) and ZF3 ('Computer Select' of Ziff-Davis Publishing Co.). These collections are contained in the TREC disks [4]. All collections are provided with queries and their ground truth (for each query a list of relevant documents). For these collections, terms used for document representation were obtained by stemming and eliminating stop words. Table 1. Statistics about collections after stemming and stop words elimination
# documents # queries # different terms avg doc length [terms] avg query length [terms]
CACM 3204 52 3029 25.8 10.9
CR 27922 34 45717 672.8 3.1
FR88 19860 199 43765 869.1 3.5
NPL 11429 93 4415 21.8 6.6
ZF3 161021 50 67108 155.6 7.7
In addition to the number of documents, a significant difference is the length of the documents: CACM and NPL consists of abstracts, while CR, FR88, and ZF3 contain (much) longer documents. Queries in the TREC collections are mostly provided in a structured format with several fields. In this paper, the “title” (the shortest representation) is used for the CR and NPL collection whereas the “desc” (description; medium length) is used for the CACM, FR88, and ZF3 collection. The short queries in FR88 are obtained by stop word elimination 6.2 Evaluation The following paragraphs describe some basic evaluation methods used in this paper. For further information and a more detailed description see Kise et al [7]. 6.2.1
Average Precision
A common way to evaluate the performance of retrieval methods is to compute the (interpolated) precision at some recall levels. This results in a number of recall/precision points which are displayed in recall-precision graphs [2]. However, it is sometimes convenient for us to have a single value that summarizes the performance. The average precision (non-interpolated) over all relevant documents [2, 3] is a measure resulting in a single value. The definition is as follows: As described in section 3, the result of retrieval is represented as the ranked list of documents. Let r(i) be the rank of the i-th relevant document counted from the top of the list. The precision for this document is calculated by i/r(i). The precision values for all documents relevant to a query are averaged to obtain a single value for the
202
Stefan Klink et al.
query. The average precision over all relevant documents is then obtained by averaging the respective values over all queries. 6.2.2
Statistical Test
The next step for the evaluation is to compare the values of the average precision obtained by different methods [7]. An important question here is whether the difference in the average precision is really meaningful or just by chance. In order to make such a distinction, it is necessary to apply a statistical test. Several statistical tests have been applied to the task of information retrieval [5,17]. In this paper, we utilize the test called “macro t-test” [17] (called paired t-test in [5]). The following is a summary of the test described in [7]: Let ai and bi be the scores (e.g., the average precision) of retrieval methods A and B for a query i and define di = ai - bi. The test can be applied under the assumptions that the model is additive, i.e., di = µ + εi where µ is the population mean and εi is an error, and that the errors are normally distributed. The null hypothesis here is µ = 0 (A performs equivalently to B in terms of the average precision), and the alternative hypothesis is µ > 0 (A performs better than B). It is known that the Student’s t-statistic ¯
t=
d s2 / n
¯
with d =
1 n
n
1
n
¯ ∑ di and s2 = n -1 ∑ (di - d)2 i=1 i=1
(15)
follows the t-distribution with the degree of freedom of n – 1, where n is the number ¯ of samples (queries), d and s2 are the sample mean and the variance. By looking up the value of t in the t-distribution, we can obtain the P-value, i.e., the probability of observing the sample results di (1 ≤ i ≤ n) under the assumption that the null hypothesis is true. The P-value is compared to a predetermined significance level σ in order to decide whether the null hypothesis should be rejected or not. As significance levels, we utilize 0.05 and 0.01. 6.3 Results and Comparison to the Standard 6.3.1
Recall and Precision
The results of the pseudo relevance feedback are depending on two parameters α (weight) and θ (similarity threshold). To get the best results, we were varying α from 0 to 5.0 with step 0.1 and θ from 0.0 to 1.0 with step 0.05. For the combined approach we calculated the best β by varying from 0 to 1.0 with step 0.1. For each collection, the best individual α, θ, and β are calculated and used for the comparison. Table 2 shows the best values for each collection: The results of our concept-based expansion are also depending on weights. But due to time restrictions, we had not enough time to vary the weights in a range for each collection. We just used the default value: ωi = 1, which means that the original term and the learned concept are weighted equal.
Collaborative Learning of Term-Based Concepts for Automatic Query Expansion
203
Table 2. Best values for pseudo relevance feedback and the combined method parameters
CACM
CR
FR88
NPL
α / θ (PRF)
1.70 / 0.35
0.6 / 0.75
0.60 / 0.00 2.00 / 0.45 0.80 / 0.50
β (comb.)
0.4
0.6
0.4
0.9
ZF3 0.2
Figure 1 shows the recall/precision results of the original query with the standard vector space model (VSM), the pseudo relevance feedback (PRF), the expanded query using TCL (Concepts) and the combination (Concepts+PRF):
Fig. 1. Recall / precision of CACM, CR, FR88, NPL, and ZF3
The recall/precision graphs in figure 1 indicate that the automatic query expansion method based on learned concepts yields a considerable improvement in the retrieval effectiveness in mostly all collections over all recall points compared to the standard vector space model and to the pseudo-relevance-feedback method (except with the NPL collection). There is no indication that the improvement is depending on the size of the collection, the number of documents nor on the number or size of the queries. On the small ones the method performs good on CACM but only somewhat better
204
Stefan Klink et al.
than the VSM on the NPL and on the FR it performs better than on the CR collection. On a closer look at the figures the impression could arise that our approach performs better on longer queries. But experiments with the CR collection have shown that ‘title’ queries result a better precision than ‘description’ or ‘narrative’ queries. This behavior is in contrast to the first impression of the figures. 6.3.2
Statistical Tests
To be sure that these exciting results are really meaningful and not just by chance, it is necessary to apply a statistical test. As described above, we used the “macro t-test”. The results of this test for all pairs of methods are shown in table 3. The meaning of the symbols such as “≅”, “>” and “~” is summarized at the bottom of the table. For example, the symbol “<” was obtained in the case of the concept method compared to VSM for the NPL collection. This indicates that (at significance σ = 0.05) the null hypothesis “concept method performs equivalently to the VSM” is rejected and the alternative hypothesis “concept method performs better than the VSM” is accepted. (At σ = 0.01, however, the null hypothesis cannot be rejected.) Roughly speaking, “A ≅(?) B”, “A > (<) B” and “A~B” indicate that “A is almost guaranteed to be better (worse) than B”, “A is likely to be better (worse) than B” and “A is equivalent to B”, respectively. Table 3. Results of the macro t-test
Methods (A vs. B) PRF vs. VSM Concepts vs. VSM Concepts vs. PRF Concepts+PRF vs. VSM Concepts+PRF vs. PRF Concepts+PRF vs. Concepts ≅,? >,< ~
: : :
CACM ≅ ≅ ≅ ≅ ≅ > 0.01 ≤ 0.05 ≤
CR ~ ≅ > ≅ > ~
FR88 ≅ ≅ ~ ≅ ≅ ≅
NPL ≅ > < ≅ ~ ≅
ZF3 ≅ ≅ ≅ ≅ ≅ ~
P-value ≤ 0.01 P-value ≤ 0.05 P-value
The macro t-tests prove our results. Our new method for expanding queries based on term-based concepts outperforms the standard VSM and outperforms or is equivalent to the pseudo relevance feedback (except at the NPL collection) and the results are not obtained by chance. Additionally, the combination of our method with the pseudo-relevance feedback outperforms the VSM and the pseudo-relevance feedback. Thus the term-based concepts are capable to improve the PRF just by adding the learned term weights.
Collaborative Learning of Term-Based Concepts for Automatic Query Expansion
7
205
Conclusions and Outlook
We have described a new approach for bridging the gap of different terminology within the user query and the searched documents by using term-based concepts. Each term of the query corresponds to a concept which is learned from the documents given by the ground-truth of other users. The selection relies on the similarity between the query terms and the learned concept rather than on the similarity between the terms of the collection nor on collection-based or hand-made thesauri. This approach can be used to improve the retrieval of documents in all domains like Digital Libraries, Document Management Systems, WWW etc. The approach performs the better the more user queries (users) are involved. Additionally, we combined our method with the pseudo relevance feedback by adding the term weights of our learned term-based concepts. Our experiments made on five standard test collections with different sizes and different document types have shown considerable improvements vs. the original queries in the standard vector space model and vs. the pseudo relevance feedback (except at the NPL collection). The improvements seem to be not depending on the type nor the size of the collection and they are not obtained by chance. In contrast to the relevance feedback, this approach is not relying on critical thresholds which are dangerous and mostly differ from collection to collection. Furthermore, this approach can be perfectly used in search machines where new queries with their appropriate relevant (user-voted) documents can be easily added to the ‘collection’, for example in Digital Libraries, Document Management Systems or the WWW. These new queries can be used to build an increasing approach and for a constant learning of the stored concepts. The vital advantage is that each user can profit from the concepts learned by other users. The more queries are learned (by the same or by other users) the better our approach will perform. Some experiments are planed to use user-voted relevance feedback instead of collection-given ground-truth to test the performance on ‘real-life’ data. Furthermore, it is planed to make some experiments on the influence of ωi for each term. An approach on passage-based retrieval by Kise [7] has shown good improvements vs. LSI and Density Distribution. An interesting idea for the future is not using the complete relevant documents for expanding the query and not using the N top ranked terms but using terms of relevant passages within the documents. With this idea just the relevant passages are used to learn the concepts. This should increase the quality of the expanded queries and we will be able to do a further evaluation of each concept in great detail, i.e. on the term level.
8
Acknowledgements
This work was supported by the German Ministry for Education and Research, bmb+f (Grant: 01 IN 902 B8).
206
Stefan Klink et al.
References 1.
2. 3. 4. 5. 6. 7.
8. 9. 10. 11. 12.
13. 14. 15. 16.
Buckley C., Salton G., Allen J.: The effect of adding relevance information in a relevance feedback environment. In Proceedings of the Annual Int. ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 292 300, 1994 Baeza-Yates R., Ribeiro-Neto B.: Modern Information Retrieval. AddisonWesley Pub. Co., 1999. ISBN 020139829X ftp://ftp.cs.cornell.edu/pub/smart/ http://trec.nist.gov/ Hull D.: Using Statistical Testing in the Evaluation of Retrieval Experiments. In Proceedings of the 16th Annual Int. ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 329 - 338, 1993 Jansen B.J., Spink A., Bateman J. and Saracevic T.: Real Life Information Retrieval: A Study of User Queries on the Web, In SIGIR Forum, Vol. 31, pp. 5-17, 1988 Kise K., Junker M., Dengel A., Matsumoto K.: Passage-Based Document Retrieval as a Tool for Text Mining with User’s Information Needs, In Proceedings of the 4th Internatl. Conference of Discovery Science, pp. 155-169, Washington, DC, USA, November 2001 Manning C.D. and Schütze H.: Foundations of Statistical Natural Language Processing, MIT Press, 1999 McCune B.P., Tong R.M., Dean J.S., Shapiro D.G.: RUBIC: A System for RuleBased Information Retrieval, IEEE Transaction on Software Engineering, Vol. SE-11, No.9, September 1985 Minker J., Wilson, G.A. Zimmerman, B.H.: An evaluation of query expansion by the addition of clustered terms for a document retrieval system, Information Storage and Retrieval, vol. 8(6), pp. 329-348, 1972 Peat H.J., Willet, P.: The limitations of term co-occurrence data for query expansion in document retrieval systems, Journal of the ASIS, vol. 42(5), pp. 378-383, 1991 Pirkola A.: Studies on Linguistic Problems and Methods in Text Retrieval: The Effects of Anaphor and Ellipsis Resolution in Proximity Searching, and Translation and query Structuring Methods in Cross-Language Retrieval, PhD dissertation, Department of Information Studies, University of Tampere. Acta Universitatis Tamperensis 672. ISBN 951-44-4582-1; ISSN 1455-1616. June 1999 Qiu Y.: ISIR: an integrated system for information retrieval, In Proceedings of 14th IR Colloqium, British Computer Society, Lancaster, 1992 Salton G., Buckley C.: Term weighting approaches in automatic text retrieval. Information Processing & Management 24(5), pp. 513 - 523, 1988 Sparck-Jones K.: Notes and references on early classification work. In SIGIR Forum, vol. 25(1), pp. 10-17, 1991 Smeaton A.F., van Rijsbergen C.J.: The retrieval effects of query expansion on a feedback document retrieval system. The Computer Journal, vol. 26(3), pp. 239 – 246, 1983
Collaborative Learning of Term-Based Concepts for Automatic Query Expansion
207
17. Yang Y. and Liu X.: A Re-Examination of Text Categorization Methods. In Proceedings of the 22nd Annual Int. ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 42 - 49, Berkeley, CA, August 1999
Learning to Play a Highly Complex Game from Human Expert Games Tony Kråkenes and Ole Martin Halck Norwegian Defence Research Establishment (FFI) P.O. Box 25, NO-2027 Kjeller, Norway {tony.krakenes,ole-martin.halck}@ffi.no
Abstract. When the number of possible moves in each state of a game becomes very high, standard methods for computer game playing are no longer feasible. We present an approach for learning to play such a game from human expert games. The high complexity of the action space is dealt with by collapsing the very large set of allowable actions into a small set of categories according to their semantic intent, while the complexity of the state space is handled by representing the states of collections of pieces by a few relevant features in a locationindependent way. The state–action mappings implicit in the expert games are then learnt using neural networks. Experiments compare this approach to methods that have previously been applied to this domain.
1
Introduction
This paper describes the application of machine learning techniques to the problem of making a software agent that plays a highly complex stochastic game. The game we consider, Operation Lucid, belongs to the class of two-person zero-sum perfectinformation stochastic games. It has been designed as a simplified military land combat model, with rules representing central concepts such as movement (and uncertainty in movement), logistics, and of course combat itself, including the asymmetry between attacking and defending a location. Our studies are concerned with the application of artificial intelligence techniques to decision making in combat models, and in this research Operation Lucid is being used as an environment that captures the important general properties of such models, while allowing us not to get bogged down in unnecessary detail. The insights and results gained in this way can then be used in the development and improvement of full-scale combat models. The problem of game playing has been extensively studied in machine learning research. A number of papers describing state-of-the-art developments in this field are collected in [1]; this reference also contains a survey of machine learning in games [2]. However, in most of the games studied in this body of research, the main challenges are different to those posed by Operation Lucid, making several of the standard techniques useless for our problem. T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 207-218, 2002. Springer-Verlag Berlin Heidelberg 2002
208
Tony Kråkenes and Ole Martin Halck
If we regard Operation Lucid as a decision-making problem in a combat simulation context, related research is somewhat thinner on the ground. Recent work includes [3], in which a genetic algorithm is applied to a force allocation problem not entirely dissimilar to ours, and [4], which describes how a knowledge-intensive agent is used for evaluating military courses of action. The remainder of the paper is organized as follows: Section 2 describes our problem domain. In Section 3, we describe our approach in dealing with the high complexity of the problem in order to make a game-playing agent; a more detailed description of the implementation of the agent is the subject of Section 4. Section 5 presents some experimental results, and Section 6 concludes the paper.
2
The Game of Operation Lucid
In this section we present our problem environment – the game of Operation Lucid – and describe some of the properties that make it both interesting and very challenging. A fuller description of the game is given in [5]. 2.1
Definition and Rules of Operation Lucid
In short Operation Lucid is a two-person stochastic board game where the two players start off with their pieces in opposing ends of the board, as shown in Figure 1. One player, named Blue, is the attacker. Blue starts the game with fifteen pieces; his aim is to cross the board, break through – or evade – his opponent’s defence, and move his pieces off the board into the goal node. The defending player, Red, starts with ten pieces; his task is to hinder Blue from succeeding. The result of the game is the number of pieces that Blue manages to get across the board and into the goal node; thus, there is no “winner” or “loser” of a single game. The rest of this section describes the rules of the game. The game of Operation Lucid is played in 36 turns. At the start of each turn, the right to move pieces is randomly given to either Blue or Red, with equal probabilities. The side winning this draw is allowed to move each piece to one of the neighbouring nodes (that is, to a node that is connected to the piece’s current node by an edge) or leave it where it is. The side losing the draw naturally does not get to move any pieces in that turn. The movement of the pieces is subject to two restrictions: • •
When the move is finished, no node can have more than three pieces of the same colour. Pieces cannot be moved from nodes where the player is defined as the attacker (see below).
Whenever Blue and Red pieces are in the same location at the end of a turn, combat ensues, and one of the pieces in that node is lost and taken out of the game. A weighted random draw decides which side loses a piece. In a node having combat, the player last entering is defined as the attacker of that location, while the other part is defined as the defender of the location. The weighted random draw is specified by the probability that Blue wins, that is, that Red loses a piece. This probability is given
Learning to Play a Highly Complex Game from Human Expert Games
209
by the fraction (Blue strength)/(Blue strength + Red strength), where a player’s strength in a node equals the number of own pieces in that node, modified in accordance to two rules. Firstly, the defending player in the node gains one extra point of strength. Secondly, if the Blue player does not have an unbroken path of nodes with only Blue pieces leading from the combat node to one of Blue’s starting positions (a supply line), Blue loses one point of strength. The game ends when the 36 turns are completed, or when Blue has no pieces left on the board. The result of the game is the number of Blue pieces that have reached the goal node. Goal
10 Red pieces
5 4
N
3 2 15 Blue pieces
1 A
B
C
D
E
Fig. 1. The board of the game Operation Lucid, with the pieces placed in their initial positions
2.2
A Combat Modelling Interpretation of the Game
Operation Lucid was designed to capture important aspects of military land combat modelling; in particular, it represents a scenario where the goal of one side is to break through enemy defence to reach a certain location. Movement of the pieces on the board naturally represents the movement of force units in a terrain; the stochastic ordering of Blue and Red moves is intended to capture the uncertainty inherent in manoeuvring in possibly unfamiliar territory. The rules for determining the result of combat naturally take into account the numerical strength of each side in the area of combat. In addition, they represent the advantage of being the defender of a location; this advantage is due to the defender’s opportunity to prepare himself and the environs for resisting attacks. The rule regarding Blue supply lines models the effects of logistics; an invading force will need a functioning line of supplies back to his home base to be able to perform well in enemy territory. 2.3
The Complexity of the Problem
A seemingly obvious way of playing Operation Lucid is by evaluating (in some way) each legal move in the current game state, and then choosing the one with the best
210
Tony Kråkenes and Ole Martin Halck
evaluation. This would reduce the problem of constructing a player agent to the problem of evaluating moves in given game states. This method is generally not feasible, however, as the number of possible moves in each state tends to be huge. A player may allocate one of at most five actions (stand still or move in either of four directions) to each of at most fifteen pieces, so an upper bound on the number of legal moves is 515 ≈ 3 ⋅1010. If we assume that a computer generates one thousand possible moves each second (a reasonable assumption according to our experience), it might take up to a year to enumerate all legal moves in one state. In typical game states the number is usually far lower than this – the player may have fewer than fifteen pieces left, and each of the pieces may not be free to perform all five actions. Also, a lot of the legal moves are equivalent, as all pieces of the same side are interchangeable. From the initial position, for instance, the number of possible non-equivalent moves for Blue is 60,112 (disregarding equivalence by symmetry). In intermediate states of the game the number of legal moves increases quickly, and is generally far too large for an exhaustive enumeration. Thus classical methods based on enumeration and evaluation of moves are infeasible in this domain.
3
Handling the Complexity of the Problem
How can the problem of the game’s high complexity be dealt with efficiently? In this section, we briefly mention some previous work on designing player agents for Operation Lucid, and describe the considerations that led to the agent design that is the main subject of this paper. Our efforts so far have mostly been focused on developing Blue agents, and this is the case in the present work as well. Building good Blue agents tends to be a more challenging task than building Red ones, since the nature of the game requires Blue to be the more creative and active side. However, it is usually a minor task to adjust the algorithms to fit a Red player as well. 3.1
Previous Approaches
As explained above, the usual way of making computers play games, that is, generating all possible moves and evaluating which is the best, is not feasible in Operation Lucid. We therefore looked to the way humans play games in order to create good player agents for the game. Humans generally do not test all available moves; rather, we decide on a goal, and form a plan we believe will help us reach this goal. Limiting the Set of Evaluated Moves. In one main approach we have followed, we kept part of the evaluative approach to game-play, but limited the number of moves to be evaluated to a tractable level. This was achieved by imposing constraints on the desired number of pieces in various areas of the board – these constraints could be seen as a representation of a plan for how to play. The challenging part of this procedure was deciding on which constraints to impose, so that the problem became tractable in size without eliminating good candidate moves. This approach in general, and in particular a Blue agent that used self-trained neural networks for evaluating moves, is described further in [6]. The constraint-based approach was successful in
Learning to Play a Highly Complex Game from Human Expert Games
211
that it yielded agents with fairly good performance. The main disadvantage of this method was that the constraints that had to be imposed in order to keep runtime at a reasonable level limited the range of play of the agents. In effect, this approach meant that the main strategy of play was entered by hand through the move constraints, while the move evaluator performed limited tactical optimizations. Simple Imitation Learning. In our second main approach, we took the idea of playing like a human literally, and designed an agent that played by lazy imitation learning. We made a lookup-table of game states and corresponding expert moves; the agent used this database by trying to imitate the action taken in the stored game state that was most similar to the one at hand. The challenging part of this method was to define a suitable similarity metric between states, and especially to deal with the unwanted side effects arising from states not being exactly equal. Our work with this kind of pure imitation learning produced rather disappointing results. The main reason for this was that the player very quickly found itself in game states with no very close match in the expert database, so that the move taken in the most similar state was not applicable in the present game state. It was realized that these problems were due to the fact that similarity, both of states and of moves, were seen at what can be called a syntactical level. At this level, the actual position and movement of the pieces in each single node were the basis of the agent’s behaviour, without any semantical notion of the role of the pieces in the game. The result was that this syntactical imitation would often require some pieces to perform unfeasible moves, while other pieces were left idle. One obvious fix to this problem would be to increase the size of the expert database, hoping to cover a greater range of board positions. This approach however requires man-hours, and in our case doubling the database showed only marginal improvement during play at the cost of increasing the runtime considerably. 3.2
Current Approach
Our experiments with pure imitation learning showed that the semantics of the problem domain would have to be addressed to some degree in order to achieve good and efficient performance. One way of doing this could be by following the methodology used in case-based reasoning (see e.g. [7]), where retrieval of previous cases from a database is combined with symbolic reasoning based on a semantic model of the domain1. Case-based reasoning has previously been applied to games such as chess [9] and Othello [10], as well as many real-world problems. The main disadvantages of this approach is that constructing such a domain model is a difficult and time-consuming task, and that frequent case retrievals – as is the case in game playing – may be costly in terms of runtime. These considerations led us to the conclusion that we required an agent design that imitates the expert games on a more abstract level than as single positions and movements of pieces, while not depending on a full semantic model of the game. In
1
We follow the terminology of Mitchell [8], where the term “case-based” is reserved for a subset of the broader set of “instance-based” methods, namely those using richer instance representations than simple feature vectors.
212
Tony Kråkenes and Ole Martin Halck
the following, we describe how this was done in the case of moves and game states respectively, and explain how the agent chooses its moves based on the expert games. Moves. As in the simple imitation learning above, our current approach to handling the complexity of the problem is to have an agent capable of mapping directly from game states into the proper moves to take. In order to make this work well, the action space must be reduced considerably in size. To this end, we collapsed the action space from the syntactical level consisting of all possible combinations of single-piece moves to a semantic level featuring only a few move categories. The move categories are symbols (e.g. attack, outflank, breakSupply) describing the overall character or intent of the move. The move categories are not disjunctive – they are defined so that a move may be labelled into more than one category. Game States. Reducing the action space in the manner described above raises another question: which pieces are to perform which types of action? Different pieces in a given expert game may be used for different intentions; this made it desirable to work with sub-collections of pieces sharing common intentions, rather than with all the pieces collectively. We call these sub-collections sharing common intentions force groups (FGs). Common intentions usually coincide with co-location of pieces on the board; this led us to define a FG as a collection of pieces of the same colour interconnected with each other but not with other friendly pieces. The pieces in a FG should act together pursuing the intentions of the group, and should differing intentions occur within a FG, it may break up, forming smaller FGs individually regaining conformity of intentions. The main advantage of introducing the concept of FGs is that it allows collapsing the game state space considerably. Previously, the game state was represented by the number of pieces for each side in each individual node, the number of remaining rounds and which side defended each combat node. We now employ a FG-centric state representation where individual pieces and nodes, even the board geography itself, are no longer of direct interest. What is of interest is a set of aggregated features like path lengths to the goal node, resistance along these paths, the number of own pieces in the FG and in total, the number of enemy pieces, distance to neighbouring FGs (if any), combat strength of the FG (including supply and defender status) and remaining rounds. Each FG in a game state has its individual, location-independent state perception, on which it bases its actions. This means that FGs located on different parts of the board are considered similar if their environs are similar in terms of the FG-centric state representation. The representation we use is illustrated in more detail in Figure 2. State–Action Mapping. Using the state and move representation just presented, we could repeat the lookup-table method described in Section 3.1 for mapping observed game states into actions to perform. The problem of finding a good distance metric for comparing FG states in order to choose the most similar one would then still remain. Instead, we chose to leave the lazy-learning design, and used the expert games to train a set of neural network classifiers for state–action mapping. In this way, the mappings implicit in the expert games may also generalize better to new FG states; an added advantage is that a full scan through the database at each decision point is no longer necessary, so that runtime is decreased. A more detailed look at the making of these classifiers, along with the rest of the player agent, is the subject of the next section.
Learning to Play a Highly Complex Game from Human Expert Games
213
Goal
(Distance1, resistance1)
(2,1)
(Distance2, resistance2)
(2,2)
(Distance3, resistance3)
(3,1)
(Distance4, resistance4)
(4,3)
(Distance5, resistance5)
(7,2)
Units in this group
5
Total Blue units
12
Total Red units
9
Distance to closest FG
3
Blue force in combat
0
Red force in combat
0
Proportion supplied
0
Proportion of attackers
0
Rounds left
27
Fig. 2. Representation of the game state as seen from the perspective of a force group. The ten first attributes gives the distance to each of the five exit nodes, together with the number of Red pieces on the path to each node. These are sorted in ascending order. The remaining nine attributes describe own and opposing forces, the combat situation, and the number of rounds left
4
Implementing the Agent Design
Our goal is to construct game-playing software agents. Such an agent should be able to get a game state as input, and, from this state and the rules of the game, generate a move as output. The move describes where each of the own pieces should be placed when the turn is finished. In accordance with the design choices described in the previous section, our agent selects its move as illustrated in Figure 3. Upon receiving the current game state, the agent identifies FGs and gives each FG a self-centric and simplified perception of the game state. Each FG then uses the move type classifiers in conjunction with this state representation in order to select which class (or classes) of moves is appropriate in its current situation. Finally, the agent should of course be able to translate these pieces of semantic move advice into actual board movement of the pieces in a suitable way. The remainder of this section details the steps involved in building the agent. In the following, game state or simply state refers to the overall game state (i.e. positioning of pieces, rounds left and attacking sides in the case of combat), while FG-state refers to the simplified, egocentric state perception of each FG. Similarly, move refers to the collective action of all Blue’s pieces in a turn, while FG-move refers to the action taken by a single FG only.
214
Tony Kråkenes and Ole Martin Halck
Input: Game state
Force group extraction
FG 1
FG 2
approach?
outflank? Y/N
...
Y/N
... ...
...
...
Move type classifiers
Chosen group move types
...
Move implementation Group move
Output: Full move
Fig. 3. Agent design
4.1
Database of Expert Games
A database of 20 human expert games for Blue against a fixed Red opponent was compiled. On each of Blue’s turns in these games, one or more FGs would be present; the total number of Blue FG-states throughout the game series (not necessarily distinct) was 646. The expert labelled each FG-move into one or more of twelve move categories according to the intent of the move. The move categories are given, along with brief descriptions of their meaning, in Table 1. The Red opponent employed in the expert games, named AxesRed, is an automatic playing agent adopting two main strategies of play. Firstly, it will never advance from the home row (i.e. the northernmost row) into the field, but stay home and wait for Blue to attack. Secondly, it attempts to position its pieces within the home row in a manner that proportionally mirrors the perceived threat on each vertical axis of nodes. For instance, if Blue has 13 pieces left, and 4 of these are located on the B axis (see Figure 1 for references to parts of the board), Red will to the best of his ability attempt to position 4/13 of his remaining pieces on this axis, i.e. in B5. A few simple rules apply to ensure that Red keeps a sound level of play in special cases. Red can of course not have more than 3 pieces in any node, and should its calculations require more than this, it will ensure that backup pieces are kept in the neighbourhood. Red will not reinforce in a combat node – this entails losing the defender’s advantage – if this is not more than compensated for by the strength of the extra pieces. Although AxesRed is a rather simple player, it has proved to serve well as a benchmark opponent for evaluating Blue agents.
Learning to Play a Highly Complex Game from Human Expert Games
215
Table 1. Move categories for Blue force groups
Category Approach EnsureSupply
Description Move pieces closer to the goal node Keep some pieces back, with the intent of ensuring a supply line for present or future combat BreakSupply Break the supply line, i.e. advance pieces previously withheld for supply purposes from the southernmost rows Attack Move pieces into a node containing only Red pieces ReinforceCombat Move additional pieces into a combat node ContinueCombat Neither exit from nor reinforce existing combat Outflank Perform an evading manoeuvre, sideways or backwards, aiming at a different exit node GoToGoal Move pieces from the northernmost row into the goal node ConcentrateForces Move pieces within a FG closer, i.e. occupying fewer nodes SplitIntoGroups Divide a FG into two or more FGs LinkUpGroups Join a FG to another, creating a larger FG StayInPosition Leave all pieces in the FG unmoved 4.2
Neural Network Classifiers for Force Groups
Having assembled the database of FG-states and corresponding expert semantic FGmoves, we trained an ensemble of neural networks (NNs) to serve as FG-state classifiers for the agent. One NN was trained for each label, using the database of FGstates as input data and the presence (0 or 1) of the label in question as target values. Each network was a standard feedforward NN featuring 19 input nodes, 36 hidden nodes and 1 output node (ranging from 0 to 1), sigmoid activation functions, and weights initially randomised from –0.2 to 0.2. Training was done by backpropagation. The input vector for each FG-state was scaled so that the magnitude ranges of the components were similar. About 1/3 of the data set was initially reserved and used for validating the training procedure, and the NNs were trained by repeatedly picking random examples from the training data. The learning of the NNs was evaluated by the proportion of correctly classified examples over the validation data set. The ensemble quickly attained a classification performance of 0.9, and after further training reached about 0.95. We noted that performance on the validation set did not start to decrease, even if training was continued. Taking this as an indication that the data set presented little danger of overfitting – even with the large number of network weights used – we restarted training using all available data. The total classification accuracy on the full training set reached about 0.99; the individual label-specific NNs showed minor deviations from this average. As explained above, the trained NNs are used in the game-playing agent. In a given game state, each FG inputs its FG-state into the twelve nets, each of which answers a number between 0 and 1. Output close to 1 indicates that the FG should perform a move corresponding to the category in question. The output value of 0.5 was used as the limit for choosing move categories.
216
Tony Kråkenes and Ole Martin Halck
In this procedure, the decision-making task of the player agent can be regarded as delegated to its constituent FGs, which choose and perform actions based on their own perceptions of the game state. An interpretation of the NNs when this view is adopted is as a shared overall doctrine for how to act in given situations. 4.3
Implementing the Acting Module
With the classification module of our agent properly in place, we turned to the task of actually designing and implementing the acting module. This module receives as input one or more chosen move categories for each FG, and returns the resulting movement of the individual pieces in the FG. Due to space restrictions, we are unable to go into details of this module here. Instead, we mention a number of difficulties that arose in the implementation of this part of the agent. In particular, the set of chosen move categories may be inconsistent, in which case not all of the move types may be performed, e.g. if both breakSupply and ensureSupply are chosen. Another inconsistent set of move types is the empty set – since we have specified stayInPosition as a category of its own, and this category was not selected, this is not an order to simply stand still. In cases when more than one move category is specified, we must decide which pieces should move according to which categories, or alternatively which categories should be given precedence. At the current stage, we use the simple strategy of using the numerical outputs of the respective neural nets for ranking the categories. Pieces are moved according to the first category, and if after this some pieces have not been assigned to an action, moves for the next category is implemented, and so on. The problem with empty move sets mentioned above was dealt with by defining the approach category as a default action; this category was chosen because advancing across the board is a reasonable baseline course of action for Blue – at any rate, it is almost always better than standing still. Table 2. Results and approximate runtimes for various Blue agents playing against AxesRed
Player Human SimpleBlue OneAxisBlue ConstraintNNBlue ImitationBlue Present agent
5
Average score 9.41 3.88 5.34 6.60 4.04 6.23
Approx. time (s/game) – 0.2 0.4 90 30 12
Experiments
We measure the success of a Blue agent by how many pieces it manages to move into the goal node in play against the AxesRed agent. Therefore, we need to have an idea of what constitutes a good result when playing against this particular Red opponent. The result from the expert games is a natural measure of the potential of our agent, since after all it is this expert behaviour we are trying to learn from. What then is a
Learning to Play a Highly Complex Game from Human Expert Games
217
bad result? This is difficult to say, but we can at least get a notion of a mediocre result by letting some rather naive Blue players challenge the AxesRed player. Two such benchmark Blue players have been designed. The first, SimpleBlue, employs the simple strategy of moving all its pieces forward when receiving the turn. This results in a full-breadth simultaneous attack, where three Blue pieces take on two Red pieces in each of the northernmost nodes. The second player, OneAxisBlue, initially decides upon an axis of attack, and advances as many of its pieces as possible along this axis for the rest of the game. This results in a focused attack on one of the northernmost nodes. Neither of these two players actively keeps a supply line, although the design of the game ensures that OneAxisBlue’s supply line happens to be intact in the first phase of the attack. Furthermore, it is interesting to compare the performance of our new agent with the best results from the two approaches described in Section 3.1. The agent called ConstraintNNBlue – the highest scoring agent we have managed to make during our previous work – is the constraint-based agent with NN move evaluation, while ImitationBlue is the rather less successful lazy learner. The average results obtained by the human expert and the four agents mentioned are given in Table 2, along with the best result obtained by the agent treated in this paper. For each automatic agent, 1000 games were played; the human played 20. The approximate average runtime per game is also reported. As we can see, our present agent outperforms the two benchmark players and the imitating agent, but still has some way to go to reach the human expert – this latter fact is nothing more than could be expected2. Comparing the agent to ConstraintNNBlue shows that it fails to set a new record; on the other hand, it is not discouragingly far behind, while being almost an order of magnitude quicker. Moreover, we expect the agent to have considerable potential for improvement within the limits of the current design; a larger database of expert games and better move implementations are two of the more obvious measures that can be taken.
6
Conclusion
We have presented the design and implementation of an agent playing a highly complex stochastic game. The complexity of the game makes it impossible to use standard game-playing methods; instead, the agent uses neural networks to learn to play from a database of human expert games. The high complexity is handled by collapsing the huge action space into a few categories representing the semantic intentions of moves, and representing the game states of subsets of the agent’s playing pieces by a few relevant features. An experimental evaluation of this approach shows promising results.
2
Indeed, our experience with the game leads us to suspect that the human must have been rather lucky in these 20 games to achieve this score.
218
Tony Kråkenes and Ole Martin Halck
References 1.
Fürnkranz, J., Kubat, M. (eds.): Machines That Learn to Play Games, Nova Science Publishers (2001). 2. Fürnkranz, J.: Machine learning in games: A survey. In: Fürnkranz, J., Kubat, M. (eds.): Machines That Learn to Play Games, Nova Science Publishers (2001) 11– 59. 3. Schlabach, J. L., Hayes, C. C., Goldberg, D. E.: FOX-GA: A genetic algorithm for generating and analyzing battlefield courses of action. Evolutionary Computation 7 (1999) 45–68. 4. Boicu, M., Tecuci, G., Marcu, D., Bowman, M., Shyr, P., Ciucu, F., Levcovici, C.: Disciple-COA: From agent programming to agent teaching. In: Langley, P. (ed.): Proceedings of the 17th International Conference on Machine Learning (ICML-2000), Morgan Kaufmann (2000) 73–80. 5. Dahl, F. A., Halck, O. M.: Three games designed for the study of human and automated decision making. Definitions and properties of the games Campaign, Operation Lucid and Operation Opaque. FFI/RAPPORT-98/02799, Norwegian Defence Research Establishment (FFI), Kjeller, Norway (1998). 6. Sendstad, O. J., Halck, O. M., Dahl, F. A.: A constraint-based agent design for playing a highly complex game. In: Proceedings of the 2nd International Conference on the Practical Application of Constraint Technologies and Logic Programming (PACLP 2000), The Practical Application Company Ltd (2000) 93–109. 7. Aamodt, A., Plaza, E.: Case-based reasoning: Foundational issues, methodological variations, and system approaches. AI Communications 7 (1994) 39–59. 8. Mitchell, T. M.: Machine Learning. WCB/McGraw-Hill (1997). 9. Kerner, Y.: Learning strategies for explanation patterns: Basic game patterns with application to chess. In: Veloso, M., Aamodt, A. (eds.): Proceedings of the 1st International Conference on Case-Based Reasoning (ICCBR-95). Lecture Notes in Artificial Intelligence Vol. 1010, Springer-Verlag (1995) 491–500. 10. Callan, J. P., Fawcett, T. E., Rissland, E. L.: CABOT: An adaptive approach to case-based search. In: Proceedings of the 12th International Conference on Artificial Intelligence, Morgan Kaufmann (1991) 803–809.
Reliable Classifications with Machine Learning Matjaˇz Kukar and Igor Kononenko University of Ljubljana, Faculty of Computer and Information Science Trˇzaˇska 25, SI-1001 Ljubljana, Slovenia {matjaz.kukar,igor.kononenko}@fri.uni-lj.si
Abstract. In the past decades Machine Learning algorithms have been successfully used in numerous classification problems. While they usually significantly outperform domain experts (in terms of classification accuracy or otherwise), they are mostly not being used in practice. A plausible reason for this is that it is difficult to obtain an unbiased estimation of a single classification’s reliability. In the paper we propose a general transductive method for estimation of classification’s reliability on single examples that is independent of the applied Machine Learning algorithm. We compare our method with existing approaches and discuss its advantages. We perform extensive testing on 14 domains and 6 Machine Learning algorithms and show that our approach can frequently yield more than 100% improvement in reliability estimation performance.
1 Introduction Usually Machine Learning algorithms output only bare classifications for the new unclassified examples. While there are ways for almost all Machine Learning algorithms to at least partially provide quantitative assessment of a classification in questions, so far there is no general method to assign reliability to a single classification. Note that we are interested in the classifier’s performance on a single example and not in average performance on an independent dataset. Let us define the reliability of classification as an estimated probability that the (single) classification is in fact the correct one. Some authors [16, 21] use for this purpose a statistical term confidence. We, however, have decided to use a term reliability, since its calculation and interpretation are not always strictly statistical. For a given example description xi we define the reliability of its predicted class yi as follows. Rel(yi ) = P(yi is a true class of example xi )
(1)
There have been numerous attempts to assign probabilities to Machine Learning classifiers’ (decision trees and rules, Bayesian classifiers, neural networks, nearest neighbour classifiers, . . . ) in order to interpret their decision as a probability distribution over all possible classes. In fact, we can trivially convert every Machine Learning classifier’s output to a probability distribution by assigning the predicted class the probability 1, and 0 to all other possible classes. The posterior probability of the predicted class can be viewed as a classifier’s trust in its prediction (reliability) [3, 19]. However, such estimations may not be good due to the applied algorithm’s language and representational biases. T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 219–231, 2002. c Springer-Verlag Berlin Heidelberg 2002
220
Matjaˇz Kukar and Igor Kononenko
There is some ongoing work for constructing classifiers that divide the data space into regions that are reliable and regions that are not reliable [1]. Such meta-learning approaches have also been used for picking the most reliable prediction from the outputs of an ensemble of classifiers [14, 17]. We propose a different approach based on a general transductive method for reliability estimations. Our approach differs from the above in the following: • it does not divide the data space into reliable and unreliable regions, but works instead on single data points (examples), • it does not induce a meta-classifier at all, but instead uses a transductive framework to generate a reliability estimate for each single example. Our approach is independent of the applied Machine Learning algorithm and requires only that it is able to represent its classifications as probability distributions. The core idea is to compare differences in classification’s probability distributions between inductive and transductive steps and use them to assess reliability of single points (examples) in data space. Such assessments are very useful, especially in risk-sensitive applications (medical diagnosis, financial and critical control applications) because there it often matters, how much one can rely upon a given prediction. In such cases a general reliability measure of a classifier (e.g. classification accuracy, mean squared error, . . . ) with respect to the whole input distribution would not provide desired warranty. Another use of reliability estimations is in ensembles for selecting or combining answers from different classifiers [8]. The paper is organized as follows. In Sec. 2 we describe the basic ideas of transductive inference and outline the reasons why transductive reliability estimation should work well. In Sec. 3 we develop our idea for general and efficient implementation of transductive reliability estimation. In Sec. 4 we evaluate our approach on 14 domains with 6 Machine Learning algorithms. In Sec. 5 we present some conclusions and directions for future work.
2 Transduction Principle for Reliability Estimation Transduction is an inference principle that takes a training sample and aims at estimating the values of a discrete or continuous function only at given unlabelled points of interest from input space, as opposed to the whole input space for induction. In the learning process the unlabelled points are suitably labelled and included into the training sample. The usefulness of unlabelled data [12] has among others been advocated in the context of co-training. It has been shown that for a better-than-random [2] classifier its performance can be significantly boosted by using only additional unlabelled data. It has been suggested [20] that when solving a given problem one should avoid solving a more general problem as an intermediate step. The reasoning behind this principle is that, in order to solve a more general task, resources may be wasted or compromises made which would not have been necessary for solving only the problem at hand (i.e. function estimation only on given points). This common-sense principle reduces a more general problem of inferring a functional dependency on the whole input space (inductive inference) to the problem of estimating the values of a function only at given points (transductive inference).
Reliable Classifications with Machine Learning
221
Let X be a space of attribute descriptions of points in a training sample, and Y a space of labels (continuous or discrete) assigned to each point. Given a probability distribution P , defined on the input space X × Y , a training sample S = {(x1 , y1 ), . . . , (xl , yl )}
(2)
consisting of l points, is drawn i.i.d. (identically independently distributed) according to P . Additional m data points (working sample) W = {xl+1 , . . . , xl+m , }
(3)
with unknown labels are drawn in the same manner. The goal of transductive inference is to label all the points from the sample W using a fixed set H of functions f : X → Y in order to minimize an error functional both in the training sample S and in the working sample W (effectively, in S ∪ W ) [5, 16]. In contrast, inductive inference (excluding ensembles of classifiers) aims at choosing a single function f ∈ H that is best suited to the unknown probability distribution P . At this point arises a question how to calculate the labels for a working sample. This can be done by labelling every point from a working sample with every possible label value; however given m working points and n possible class labels this leads to a combinatorial explosion yielding nm possible labellings. For each possible labelling, an induction process on S ∪W is run, and an error functional (error rate) is calculated. By leveraging the i.i.d. sampling assumption and transductive inference, one can for each labelling estimate its reliability (a probability that it is correct). If the i.i.d. assumption holds, the training sample S as well as the joint correctly labelled sample S ∪W should both reflect the same underlying probability distribution P . If one could measure a degree of similarity between probability distributions P (S) and P (S ∪W ), this could be used as a measure of reliability of the particular labelling. Unfortunately, this problem is in non-computable [11], so approximation methods have to be used [21, 9]. 2.1 Why does Transduction Work? There is a strong connection between the transduction principle and the algorithmic (Kolmogorov) complexity. Let the sets S and S ∪ W be represented as binary strings u and v, respectively. Let l(v) be the length of the string v and C(v) its Kolmogorov complexity. We define the randomness deficiency of the string v as following [11, 21]: δ(v) = l(v) − C(v)
(4)
Randomness deficiency measures how random is the respective binary string and therefore the set it represents. The larger it is, the more regular the string (and the set). If we could calculate the randomness deficiency (but we cannot, since it is not computable), we could do it for all possible labellings of the set S ∪W and select the labelling of W that results in the largest randomness deficiency of the joint set S ∪W as the most probable one [21]. We could also construct a universal Martin-L¨of’s test for randomness [11]:
∑{P(x|l(x) = n) : δ(x) ≥ m} ≤ 2−m
(5)
222
Matjaˇz Kukar and Igor Kononenko
That is, for all binary strings of fixed length n, the probability of their randomness deficiency δ being greater than m is less than 2−m . The value 2−δ(x) is therefore a p-value function for our randomness test [21]. Unfortunately, the definition of randomness deficiency is based on the Kolmogorov complexity and is not computable. Therefore we need feasible approximations to use this principle in practice. Extensive work has been done by using Support Vector Machines [5, 16, 21], however no general approach exists so far. 2.2 A Machine Learning Interpretation In Machine Learning terms, the sets S and S ∪ W are represented with induced models MS and MS∪W . Randomness of the sets is reflected in the (Kolmogorov) complexity of the respective models. If for the set S ∪W the labelling with the largest randomness deficiency is selected, it follows from the definition (Eq. 4) that since the uncompressed description length l(v) is constant, the Kolmogorov complexity C(MS∪W ) is minimal. This implies that the respective labelling of W is most consistent with the training data S, since the minimal Kolmogorov complexity implies most regularities in the data. This in order implies that our Machine Learning algorithm will induce the model MS∪W that will be most similar to the MS .1 Ideally, if the training data S is sufficient for inducing a perfect model, there is no difference between MS and MS∪W . This greatly simplifies our view on the problem, namely it suffices to compare the (finite) models MS and MS∪W . Greater difference means that the set S ∪ W is more random than the set S and (under the assumption that S is sufficient for learning effective model) that W consist of (at least some) improperly labelled, untypical examples. Although the problem seems easier now, it is still a computational burden to calculate changes between model descriptions (assuming that they can be efficiently coded; black-box methods are thus out of question). However, there exists another way. Since transduction is an inference principle that aims at estimating the values of a function only at given points of interest from input space (the set W ), we are interested only in model change considering these examples. Therefore we can compare the classifications (or even better, probability distributions) of models MS and models MS∪W . Obviously, the labelling of W that would minimally change the model MS is as given by MS . We will examine this approach in more detail in the next section.
3 Efficient Transductive Reliability Estimations The prerequisite for a Machine Learning algorithm to be used in a transductive reliability framework is to represent its classifications as a probability distribution over all possible classes, although these distributions may not be very good estimates. The transductive reliability estimation process is basically a two-step process, featuring an inductive step followed by a transductive step. 1
Actually, here it would be more appropriate to use a prefix Kolmogorov complexity K( ) instead of C( ), and two-part MDL-style (model+exceptions) descriptions of the sets, since the Kolmogorov complexity C( ) itself is non-monotonic [11] wrt. the string length.
Reliable Classifications with Machine Learning
Machine Learning
Training set
Classifier
Machine Training set (with added example)
Learning
Classifer
} Independ. set
Independ. set
(a) Inductive step.
223
difference = ? or distance
(b) Transductive step.
Fig. 1. Transductive reliability estimation – An inductive step is just like an ordinary inductive learning process in Machine Learning. A Machine Learning algorithm is run on the training set, inducing a classifier. A selected example is taken from an independent dataset and classified using the induced classifier. The same example is duplicated, labelled with its assigned class, and finally included into the training set (Fig. 1a). – A transductive step is almost a repetition of an inductive step. A Machine Learning algorithm is run on the changed training set, transducing a classifier. The same example as before is taken from the independent dataset and again classified, now using the transduced classifier (Fig. 1b). Both classifications (represented by probability distributions) of the same example are compared and their difference (distance) is calculated, thus approximating the randomness deficiency. A brief algorithmic sketch is given in Fig. 2.
3.1 Calculating the Difference between Probability Distributions Since a prerequisite for a Machine Learning algorithm is to represent its classifications as a probability distribution over all possible classes, we need a method to measure the difference between two probability distributions. The difference between two probability distributions (over discrete item sets of size N < ∞) can be viewed as a distance between two vectors in RN . In principle, any metric can be used, however not all strict metric properties are required. We require only that the difference measure D between probability distributions P and Q satisfies the following: 1. D(P, Q) ≥ 0 (nonnegativity) 2. 0 ≤ D(P, Q) ≤ ∞, whereD(P, Q) = 0 ⇔ P = Q 3. D(P, Q) = D(Q, P) (symmetry law). In our case P is a probability distribution after the inductive step, and Q is a probability distribution after the transductive step. For calculating the difference between probability distributions, a Kullback-Leibler divergence is frequently used [18]. In our experiments we use a symmetric Kullback-Leibler divergence.
224
Matjaˇz Kukar and Igor Kononenko
Requires:
Machine Learning classifier, a training set and an unlabelled test example Ensures: Estimation of test example’s classification reliability 1: Inductive step: • • • •
train a classifier from the provided training set select an unlabelled test example and classify this example with an induced classifier label this example with a predicted class temporarily add the newly labelled example to the training set
2: Transductive step: • train a classifier from the extended training set • select the same unlabelled test example as above and classify this example with a transduced classifier 3: Calculate a randomness deficiency approximation as a difference between inductive and transductive classification. 4: Calculate the reliability of classification as 2−difference .
Fig. 2. The algorithm for transductive reliability estimation
3.2 Kullback-Leibler Divergence Kullback-Leibler divergence, also frequently referred to as a relative entropy or Idivergence, is defined between probability distributions P and Q n
pi qi
I(P, Q) = − ∑ pi log2 i=1
(6)
Symmetric Kullback-Leibler divergence, or J-divergence, is defined between probability distributions P and Q J(P, Q) = (I(P, Q) + I(Q, P)) =
n
pi
∑ (pi − qi) log2 qi
(7)
i=1
J(P, Q) is limited to the interval [0, ∞], with J(P, P) = 0. Similarly to the p-values of the universal Martin-L¨of randomness test (Eq. 5), we calculate our reliability estimation as Rel(P, Q) = 2−J(P,Q)
(8)
However, measuring the difference between probability distributions does not always perform well. There are at least a few exceptional classifiers (albeit trivial ones) where our original approach utterly fails. 3.3 The Curse of Trivial Models So far we have implicitly assumed that the model used by the classifier is good (at the very least better than random). Unsurprisingly, our approach works very well with
Reliable Classifications with Machine Learning
225
random classifiers (probability distributions are randomly calculated) by effectively labelling their classifications as unreliable [8]. On the other hand, there also exist simple constant and majority classifiers. A constant classifier is such that it classifies all examples into the same class Ck with probability 1. In such cases our approach always yields reliability 1 since there is no change in probability distribution. A majority classifier is such that it classifies all examples into the same class Ck that is the majority class in the training set. Probability distribution is always the same and corresponds to the distribution of classes in the training set. In such cases our approach yields reliability very close to 1 since there is almost no change in probability distribution (only for the example in question), that is at most for 1/N, where N is number of training examples. In large datasets this change is negligible. Note that such extreme cases do occur in practice and even in real life. For example, a physician that always diagnoses an incoming patient as ill is a constant classifier. On the other hand, a degenerated – overpruned – decision tree (one leaf only) is a typical majority classifier. In both cases all classifications are seemingly completely reliable. Obviously we also need to take in account the quality of classifier’s underlying model and appropriately change our definition of reliability. If we review our original definition of reliability (Eq. 1) it is immediately obvious that we assumed that the model was good. Our reliability estimations actually estimate the conditional reliability with respect to the model M Rel(yi |M) = P(yi is a true class of xi | model M is good)
(9)
To calculate required unconditional reliability we apply the conditional probability theorem for the whole model Rel (yi ) = P(model M is good) ∗ P(yi is true class of xi | model M is good)
(10)
or even better for the partial models for each class yi Rel (yi ) = P(model M is good for yi ) ∗ P(yi is true class of xi | model M is good for yi ) (11) Now we only need to estimate the unconditional probabilities P(model is good)
or ∀i : P(model is good for yi )
(12)
In Machine Learning we have many methods to estimate the quality of the induced model, e.g. a cross-validation computation of classification accuracy is suitable for estimation of Eq. 12. However it may be better to calculate it in a less coarse way, since at this point we already know the predicted class value (yi ). We propose a calculation of (Bayesian) probability that the classification in a certain class is correct. Our approach is closely related to the calculation of post-test probabilities in medical diagnostics [3, 13]. Required factors can be easily estimated from the confusion matrix (Def. 1) with internal testing. Definition 1. A confusion matrix (CM) is a matrix of classification errors obtained with an internal cross validation or leave-one-out testing on the training dataset. The i j-th element cij stands for the number of classifications to the class i that should belong to the class j.
226
Matjaˇz Kukar and Igor Kononenko
Definition 2. Class sensitivity and specificity are a generalization of sensitivity (true positives ratio) and specificity (true negatives ratio) values for multi-class problems. Basically, for N classes we have N two-class problems. Let C p be a correct class in certain case, and C a class, predicted by the classifier in the same case. For each of possible classes Ci , i ∈ {1..N}, we define its class sensitivity Se(Ci ) = P(C = Ci |C p = Ci ) = Ci |C p = Ci ) as follows: and its class specificity Sp(Ci ) = P(C Se(Ci ) = P(C = Ci |C p = Ci ) =
cii ∑ j ci j
(13)
= Ci |C p = Ci ) = Sp(Ci ) = P(C
∑ j =i c ji ∑ j =i ∑k c jk
(14)
Class conditional probability is calculated for each class Ci , given its prior probability P(Ci ), approximated with the prevalence of Ci in the training set, its class specificity (Sp) and sensitivity (Se): Pcond (Ci ) =
P(Ci )Se(Ci ) P(Ci )Se(Ci ) + (1 − P(Ci))(1 − Sp(Ci ))
(15)
For a fixed model and a fixed class Ci its class sensitivity and specificity are typically interdependent according to the ROC (receiver operating characteristics) curve (Fig. 3). An important advantage of class conditional probability over classification accuracy is that it takes in account both classifier’s characteristics and prevalence of each class individually (Fig. 3). It is non-monotonic over all classes and therefore better describes the classifier’s performance in its problem space. To calculate the reliability estimation we therefore need the probability distributions P and Q, and index i = argmax P that determines the class with max. probability (Ci ). According to the Eq. 11 we calculate the reliability estimations by Rel(P, Q;Ci ) = Pcond (Ci ) × 2−J(P,Q)
(16)
Multiplication by class conditional probabilities accounts for basic domain characteristics (prevalence of classes) as well as classifier’s performance. This includes class sensitivity and specificity, and it is especially useful in an automatic setting for detecting possible anomalies such as default (either majority or constant classifiers) that – of course – cannot be trusted. It is easy to see that in this case we have one class with sensitivity 1 and specificity 0, whereas for all other classes we have sensitivity 0 and nonzero specificity. In the first case, the class post-test probability is equal to its prior probability, whereas in the second case it is 0. 3.4 Reliable and Unreliable Classifications Since the datasets used for training classifiers vary in their representativeness and noise levels as well as Machine Learning algorithms vary in strength and assumptions of their underlying models, it is hard to obtain absolute thresholds for reliable classifications. In our experiments they varied between 0.20 and 0.70 for different domains and Machine Learning algorithms. Therefore it is useful to calibrate our criteria in advance by
Reliable Classifications with Machine Learning
227
utilizing the training dataset. On the training set, an internal cross validation or (better) leave-one-out testing is performed. For each training example a reliability estimation is made and the predicted as well as the exact class is known. In fact, we now have a new dataset with two possible classes {incorrectly-classified, correctly-classified}, and a single numeric attribute {reliability-estimation}. On this meta-problem we perform binary discretization of the reliability estimation attribute by maximizing the information gain of the split [4] with our goal being to obtain as pure subsets as possible. The best threshold T for the dataset split is calculated by maximizing Eq. 19. H(S) = entropy of the set S S2 S1 H(S; T ) = H(S1 ) + H(S2 ) (entropy after split) S S Gain(S, T ) = H(S) − H(S; T)
(17) (18) (19)
In the set S1 there are unreliable examples {x : Rel(x) < T } whereas in the set S2 there are reliable examples {x : Rel(x) ≥ T }. An experimental result for a dataset split is presented in Fig. 4. Note that internal testing must be done only once during the preparation for transductive reliability estimation. During this calculation we may also conveniently calculate necessary frequencies needed for model quality estimations (Def. 1).
Fig. 3. Class conditional probabilities with respect to the ROC curve and the prior probability (P) of the class
Fig. 4. Reliability estimations in domain “Diabetes” using Backpropagation neural networks. To the left of the possible two boundaries are unreliable classifications, to the right are the reliable classifications
228
Matjaˇz Kukar and Igor Kononenko
4 Experiments To validate our proposed methodology we performed extensive experiments with 6 different Machine Learning algorithms – naive and semi naive Bayesian classifier [7], backpropagation neural network [15], K-nearest neighbour, locally naive Bayesian classifier (a combination KNN and naive Bayesian classifier) [8], Assistant (ID3-like decision trees) [6] on 14 well-known benchmark datasets (Tab. 1a and 1b). All algorithms were modified to represent their classifications as probability distributions. As a reference method the assigned classifier’s probability was used. We performed two comparisons. Firstly, we tested how well can the original populations be split in the subpopulations of correctly and incorrectly classified examples. We applied Kolmogorov-Smirnov and χ2 statistical tests. In all cases the difference between the two populations was significant with p < 0.05, in most cases even with p 0.01. So
Table 1. Experimental results with transductive reliability estimation on 14 domains and 6 ML algorithms, obtained with leave one out testing Domain Mesh Breast cancer Nuclear Diabetes Heart Hepatitis Iris Chess endgame LED Lymphograpy Primary tumor Rheumatology Soybean Voting
Inf. gain Inf. gain Relative Kolmogorov- χ2 -test (Symm. K-L) (class prob.) improvement Smirnov test 0.32 0.18 87.97% < 0.01 < 0.01 0.14 0.06 142.76% < 0.01 < 0.01 0.11 0.06 88.48% < 0.01 < 0.01 0.23 0.09 195.44% < 0.01 < 0.01 0.13 0.12 11.45% < 0.01 < 0.01 0.15 0.10 52.43% < 0.01 < 0.01 0.18 0.15 33.98% < 0.01 < 0.01 0.07 0.04 145.28% < 0.01 < 0.01 0.08 0.06 10.93% < 0.01 < 0.01 0.13 0.10 30.66% < 0.01 < 0.01 0.22 0.13 78.54% < 0.01 < 0.01 0.29 0.15 105.28% < 0.01 < 0.01 0.17 0.11 83.05% < 0.01 < 0.01 0.11 0.09 20.31% < 0.01 < 0.01 (a) Average results on different domains
ML Inf. gain Inf. gain Relative Kolmogorov- χ2 -test algorithm (Symm. K-L) (class prob.) improvement Smirnov test Naive Bayes 0.18 0.11 82.31% < 0.01 < 0.01 Semi naive Bayes 0.16 0.10 56.31% < 0.01 < 0.01 Neural network 0.20 0.08 169.38% < 0.01 < 0.05 K-nearest neighbour 0.13 0.09 55.19% < 0.05 < 0.01 KNN + Naive Bayes 0.16 0.12 43.10% < 0.01 < 0.01 Assistant 0.15 0.11 32.26% < 0.01 < 0.01 (b) Average results of different Machine Learning algorithms
Reliable Classifications with Machine Learning
229
the splitting criterion introduced in Sec. 3.4 really produces statistically significantly different subpopulations. Secondly, we measured the improvement of our methodology over the assigned classifier’s probability. For both methods we compared information gains (Sec. 3.4) that directly correspond to the (im)purity of the split subpopulations. Results are summarized by domains (Tab. 1a) and Machine Learning algorithms (Tab. 1b). As it is clearly visible from the results, relative improvements were always in favour of transductive reliability estimation. After the split, the subpopulations were much purer than the original one, information gain (Eq. 19) was on average increased by 75%, ranging between 11% and 195%. All improvements were statistically significant using a two-tailed t-test with p < 0.05. We also performed an in-depth comparison of transductive reliability estimations and physicians’ reliability estimations in the nuclear dataset (nuclear diagnostics of Coronary Artery Disease), where expert physicians were available for cooperation[10]. Our method increased the number of correctly reliable classifications by 22.5% while the number of incorrectly marked as reliable classifications remaind the same [9]. It is estimated that such results if applicable in practice would reduce the costs of diagnostic process by 10%!
5 Discussion We propose a new methodology for transductive reliability estimations of classifications within Machine Learning framework. We provide a theoretical framework for our methodology and an efficient implementation in conjunction with any Machine Learning algorithm that can represent its predictions as probability distributions. We show that in certain extreme cases our basic approach fails and provide improvements that account for such anomalous cases. We argue that, especially in risk-sensitive applications, any serious Machine Learning tool should use a similar methodology for the assessment single of classification reliability. Another use of reliability estimations is in combining answers from different predictors, weighed according to their reliability. Our experiments in benchmark domains show that our approach is significantly better than evaluating classifier’s posterior probabilities. Experimental results of reliability estimations in the Coronary Artery Disease diagnostics also show enormous potential of our methodology. The potential improvements in diagnostic process are so big that the physicians are seriously considering introducing this approach in everyday diagnostic practice. There are several things that can be done to further develop our approach. Currently we aim to replace the discretization of reliability estimation values for obtaining a threshold value. We intend to replace it with proprietary population statistics that would hopefully eliminate impact of differently representative datasets and model weaknesses on resulting quantitative reliability estimation values.
230
Matjaˇz Kukar and Igor Kononenko
Acknowledgements We thank dr. Ciril Groˇselj, from the Nuclear Medicine Department, University Medical Centre Ljubljana, for his work while collecting the nuclear data and interpreting the results, and the anonymous reviewers for their insightful comments. This work was supported by the Slovenian Ministry of Education, Science and Sports.
References [1] S. D. Bay and M. J. Pazzani. Characterizing model errors and differences. In Proc. 17th International Conf. on Machine Learning, pages 49–56. Morgan Kaufmann, San Francisco, CA, 2000. 220 [2] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the 11th Annual Conference on Computational Learning Theory, pages 92– 100, 1998. 220 [3] G. A. Diamond and J. S. Forester. Analysis of probability as an aid in the clinical diagnosis of coronary artery disease. New England Journal of Medicine, 300:1350, 1979. 219, 225 [4] J. Dougherty, R. Kohavi, and M. Sahami. Supervised and unsupervised discretization of continuous features. In Proc. ICML’95, pages 194–202. Morgan Kaufmann, 1995. 227 [5] A. Gammerman, V. Vovk, and V. Vapnik. Learning by transduction. In Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence, pages 148–155, Madison, Wisconsin, 1998. 221, 222 ˇ ˇ [6] I. Kononenko, E. Simec, and M. Robnik-Sikonja. Overcoming the myopia of inductive learning algorithms with ReliefF. Applied Intelligence, 7:39–55, 1997. 228 [7] I. Kononenko. Semi-naive Bayesian classifier. In Y. Kodratoff, editor, Proc. European Working Session on Learning-91, pages 206–219, Porto, Potrugal, 1991. Springer-Verlag. 228 [8] M. Kukar. Estimating classifications’ reliability. PhD thesis, University of Ljubljana, Faculty of Computer and Information Science, Ljubljana, Slovenia, 2001. In Slovene. 220, 225, 228 [9] M. Kukar. Making reliable diagnoses with machine learning: A case study. In Silvana Quaglini, Pedro Barahona, and Steen Andreassen, editors, Proceedings of Artificial Intelligence in Medicine Europe, AIME 2001, pages 88–96, Cascais, Portugal, 2001. Springer. 221, 229 [10] M. Kukar, I. Kononenko, C. Groˇselj, K. Kralj, and J. Fettich. Analysing and improving the diagnosis of ischaemic heart disease with machine learning. Artificial Intelligence in Medicine, 16 (1):25–50, 1999. 229 [11] M. Li and P. Vit´anyi. An introduction to Kolmogorov complexity and its applications. Springer-Verlag, New York, 2nd edition, 1997. 221, 222 [12] K. Nigam, A. K. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39 (2/3):103–134, 2000. 220 [13] M. Olona-Cabases. The probability of a correct diagnosis. In J. Candell-Riera and D. Ortega-Alcalde, editors, Nuclear Cardiology in Everyday Practice, pages 348–357. Kluwer, 1994. 225 [14] J. Ortega, M. Koppel, and S. Argamon. Arbitrating among competing classifiers using learned referees. Knowledge and Information Systems Journal, 3:470–490, 2001. 220 [15] D. E. Rumelhart and J. L. McClelland. Parallel Distributed Processing, volume 1: Foundations. MIT Press, Cambridge, 1986. 228
Reliable Classifications with Machine Learning
231
[16] C. Saunders, A. Gammerman, and V. Vovk. Transduction with confidence and credibility. In Proceedings of the International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 1999. 219, 221, 222 [17] A. Seewald and J. Furnkranz. An evaluation of grading classifiers. In Proc. 4th International Symposium on Advances in Intelligent Data Analysis, pages 115–124, 2001. 220 [18] I. J. Taneja. On generalized information measures and their applications. Adv. Electron. and Elect. Physics, 76:327–416, 1995. 223 [19] K. M. Ting. Decision combination based on the characterisation of predictive accuracy. Intelligent Data Analysis, 1:181–206, 1997. 219 [20] V. Vapnik. Statistical Learning Theory. John Wiley, 1998. 220 [21] V. Vovk, A. Gammerman, and C. Saunders. Machine learning application of algorithmic randomness. In Proceedings of the 16th International Conference on Machine Learning (ICML’99), Bled, Slovenija, 1999. 219, 221, 222
Robustness Analyses of Instance-Based Collaborative Recommendation Nicholas Kushmerick Computer Science Department, University College Dublin
[email protected]
Abstract. Collaborative recommendation has emerged as an effective technique for a personalized information access. However, there has been relatively little theoretical analysis of the conditions under which the technique is effective. We analyze the robustness of collaborative recommendation: the ability to make recommendations despite (possibly intentional) noisy product ratings. We formalize robustness in machine learning terms, develop two theoretically justified models of robustness, and evaluate the models on real-world data. Our investigation is both practically relevant for enterprises wondering whether collaborative recommendation leaves their marketing operations open to attack, and theoretically interesting for the light it sheds on a comprehensive theory of collaborative recommendation.
1
Introduction
Collaborative recommendation has emerged as an effective personalization technique for a diverse array of electronic commerce and information access scenarios (eg, [10,5]). Such systems keep track of their customers’ preferences, and use these data to offer new suggestions. Many variations have been explored, but the basic idea is as follows: to recommended items to a target customer, the system retrieves similar customers, and then recommends items that were liked by the retrieved customers but not yet rated by the target. Collaborative recommendation has been empirically validated for many domains (eg, [3]), and has been successfully deployed in many commercial settings. However, despite some interesting efforts [7,4,2], there is no general theoretical explanation of the conditions under which a particular collaborative recommendation application will succeed or fail. Our goal is to complement existing theoretical work by investigating the robustness of collaborative recommendation. Informally, robustness measures how sensitive the technique is to changes in the customer/product rating matrix. In particular, we analyze the situation in which a malicious agent attacks a recommender system by posing as one or more customers and submitting bogus product ratings. Our analysis is designed to rigorously quantify the extent to which a malicious agent can force the recommender system to give poor recommendations to its “genuine” customers. The theoretical results reported in this paper builds on an ongoing empirical investigation of this issue [9,8]. T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 232–244, 2002. c Springer-Verlag Berlin Heidelberg 2002
Robustness Analyses of Instance-Based Collaborative Recommendation
233
Our primary motivation is to gain deeper insights into the principles underlying effective collaborative recommendation. However, our work is also relevant for a second, more practical reason: recommender systems can represent an insecure back-door into an enterprise’s marketing operations. To lock this door, some enterprises impose substantial charges on customers to submit ratings (eg, a bookstore might only accept ratings for books that have been purchased). However, many collaborative recommenders are open Web services that malicious agents could easily attack. How much damage can they inflict? We make three contributions. First, we formalize robustness in machine learnings terms, and introduce a novel form of class noise that models an interesting suite of attacks (Sec. 2). Second, we develop two models that predict the change in accuracy as a function the number of fake ratings that have been inserted into the customer/product matrix (Secs. 3–4). Third, we empirically evaluate our predications against real-world collaborative recommendation data (Sec. 5).
2
Definitions
Our analysis on collaborative recommendation assumes the standard k-NN learning algorithms. Other (eg, model-based) approaches have been tried but k-NN is accurate, widely used and easily analyzed. In this approach, each customer is represented as a vector of product ratings (many of which will be empty for any particular customer). Unlike traditional machine learning settings, the “class” is not a distinguished attribute, but corresponds to the product that the system is contemplating for recommendation. We model an attack as the addition of noise to the training data. In general, this noise could be associated with either the attributes, the class or both. We focus exclusively on class noise, and defer attribute noise to future work. We are not concerned with malicious noise as defined by [6], because we can safely assume that the attacking agent is not omniscient (eg, the agents can not directly inspect any ratings except their own). We model attacks with a relatively benign noise model that we call biased class noise. This model is characterized by the following parameters: the noise rate β, and the class bias µ. Noise is added according to the following process. First, an instance is generated according to the underlying distribution. With probability 1 − β, the instance is noise-free and labeled by the target concept. Otherwise, with probability βµ the instance is labeled 1 and with probability β(1 − µ) the instance is labeled 0. The biased class noise model is useful because it can represent a variety of stereotypical attacks. For example, a book’s author could try to force recommendations of his book by pretending to be numerous customers who all happen to like the book. We call this a “push” attack and it corresponds to µ = 1. Alternatively, the author’s arch-enemy could insert fake customer profiles that all dislike the book; this “nuke” attack is modeled with µ = 0. We are interested in robustness: the ability of the recommender to make good recommendations in spite of an attack. There are two aspects to robust-
234
Nicholas Kushmerick
ness. First, we may be concerned with accuracy: are the products recommended after the attack actually liked? The second issue is stability: does the system recommend different products after the attack (regardless of whether customers like them)? While stability and accuracy are distinct, they are not orthogonal. For example, if a recommender has perfect accuracy for a given task both with and without noise, then it must be perfectly stable. On other the hand, consider a product that no-one likes. The recommendation policies “recommend to no-one” and “recommend to everyone” are both perfectly stable, yet the first is always correct and the second is always wrong. Our analysis of robustness focuses exclusively on accuracy. Before proceeding, we introduce some additional notation. We assume a ddimensional instance space X d . Without loss of generality, we assume X = [0, 1]. In the context of recommendation, each dimension corresponds to one of d products, and the value on the dimension is a numeric rating. The nearest neighbor approach requires a function dist(·, ·) defined over X d ×X d, but our analysis does not depend on any particular distance metric. Finally, let H be a hypothesis, C be a concept, and D be a probability distribution over X d . The error rate of H with respect to C and D is defined as E(H, C, D) = Prx∈D (H(x) = C(x)).
3
Absolute Accuracy
The first model extends Albert and Aha’s noise-free PAC results for k-NN [1] to handle biased class noise. We first review these noise-free results, and then state our model as Theorem 1. The key idea behind Albert and Aha’s (hereafter: AA) analysis is that of a “sufficiently dense” sample from the instance space. Informally, a subset S ⊂ X d is dense if most of the points in the entire space X d are near many points in the sample S. The terms “most”, “near” and “many” are formalized as follows: Let D be a distribution over X d . A subset S ⊆ X d is (k, α, γ)-dense if, except for a subset with probability less than γ under D, for every x ∈ X d , there exists at least k distinct points x1 , . . . , xk ∈ S such that dist(x, xi ) ≤ α for each i. Given this definition, AA derive [1, Lemma 2.2] a lower bound Υd (k, α, γ, |S|) Υd (k, α, γ, |S|) = on the probability that a sample S of X d is (k, α, γ)-dense: t d B (max {s/t, ρ} , s, t), 1 − m Φ (ρ, k , |S|), where Φ (ρ, t, s) = 2 0≤k
Robustness Analyses of Instance-Based Collaborative Recommendation
235
These constraints mean that AA’s results hold only for a restricted class of learning tasks (due to the parameter L), and are not truly distribution-free (due to B). However, we will see below that in fact the model’s predictions for real-world tasks are not very sensitive to the values of L and B. Let C ∈ CL be a concept to be learned, and D ∈ DB be a distribution over X d . AA’s central result [1, Theorem 3.2] is that the probability that the error of k-NN exceeds when trained on a sample S is at most Υd (k, /4LB, /2, |S|).1 The intuition behind AA’s results is that k-NN is accurate when training on a sufficient number of instances. We can extend this intuition to handle biased class noise by requiring that, in addition to the sample containing enough “good” (noise-free) instances, it must also not contain too many “bad” (noisy) instances. We begin by defining sparse subsets analogously to the definition of dense subsets. Informally, a subset S ⊂ X d is sparse if most of the points in the entire space X d are near few points in the sample S. More precisely, a subset S ⊆ X d is (k, α, γ)-sparse if, except for a subset with probability less than γ, for every x there exists at most k points xi such that dist(x, xi ) ≤ α. Appendix A proves the following lower bound Υs (k, α, γ, |S|) on the probability that a sample S of X d is (k, α, γ)-sparse: Υs (k, α, γ, |S|) = 1 − md Φ3 (ρ, k , k , |S|), k ,k ∈K
t t−r where m and ρ are as defined above, Φ3 (ρ, r, s, t) = r s T (max {r/t, ρ} , r, s/t, s, t), T (p, r, q, s, t) = pr q s (1 − p − q)t−r−s , and K = {k , k |0 ≤ k , k ≤ |S| ∧ k < k + k ≤ |S|}. To complete our analysis, we observe that the accuracy of k-NN with training data S is equal to the probability that S contains enough “good” instances and not too many “bad” instances. For k-NN, “enough” and “not too many” mean that most of the instances should have at least k/2 good neighbors and at most k/2 bad neighbors. Let Sreal ⊂ S be the examples corresponding to genuine users, and Sattack = S \ Sreal be the examples comprising the attack. Of course, we can not know Sreal and Sattack exactly, but we do know that |Sreal | = (1 − β)|S| and |Sattack | = β|S| (where β is the size of the attack), which is sufficient for our analysis. Furthermore, some of the noisy instances may in fact be correctly labelled. Let f be the fraction of the instance space labelled 1, and let µ be the class noise bias. Then a fraction f µ + (1 − f )(1 − µ) of the noisy instances are in fact correctly labelled. Let Sgood ⊇ Sreal be the instances that are actually labelled 1
We have departed from AA in several ways. First, AA use the notation k-α, γ-net; we refer to “denseness” because our biased class noise analysis involves an analogous notion of sparseness. Second, our proof is somewhat different and therefore our bound on the denseness probability differs slightly from AA’s. Most importantly, as is standard in PAC analysis, AA introduce an additional confidence parameter δ and solve Υd (k, /4LB, /2, |S|) > 1 − δ for |S|, in order to show that k-NN can PAC-learn efficiently. Since robustness is orthogonal to efficiency, we ignore this part of their analysis.
236
Nicholas Kushmerick
correctly, and Sbad = S \Sgood ⊆ Sattack be the instances that are actually labelled incorrectly. Again, we can not know Sgood or Sbad but we do know that the number of noise-free instances is |Sgood | = (1 − β)|S| + β|S|(f µ + (1 − f )(1 − µ)) ≥ |Sreal |. and the number of noisy instances is |Sbad | = β|S|(1 − f µ − (1 − f )(1 − µ)) ≤ |Sattack |. If λ = β(µ+ f − 2µf ) is the effective attack size, then |Sgood | = (1 − λ)|S| and |Sbad | = λ|S|. We require that both Sgood be ( k/2, α, γ)-dense and Sbad be (k/2, α, γ)sparse. Since these events are independent, the probability of their conjunction is the product of their probabilities. Therefore, if we can determine appropriate values for the distance thresholds α1 and α2 and probability thresholds γ1 and γ2 , then we have that the accuracy of k-NN when training on S with biased class noise is at least Υd ( k/2, α1 , γ1 , |Sgood |) · Υs (k/2, α2 , γ2 , |Sbad |). In Appendix A we prove the following theorem. Theorem 1 (Absolute accuracy). The following holds for any , β, µ, d, k, L, B, C ∈ CL and D ∈ DB . Let S be a sample of X d according to D with biased class noise rate β, and let H = k-NN(S). Then we have that Pr [E(H, C, D) < ] ≥ Υd ( k/2, /4LB, /4, (1 − λ)|S|) · Υs (k/2, /4LB, /4, λ|S|) , where λ = β(µ + f − 2µf ), and f is the fraction of X d labeled 1 by C. To summarize, Theorem 1 yields a worst-case lower bound on the accuracy of k-NN under biased class noise. On the positive side, this bound is “absolute” in the sense that it predicts (a probabilistic bound on) the actual error E(k-NN(S), C, D) as a function of the sample size |S|, noise rate β, and other parameters. In other words, the term “absolute” draws attention to the fact that this model takes account of the actual position along the learning curve. Unfortunately, like most PAC analyses, its bound is very weak (though still useful in practice; see Sec. 5).
4
Approximate Relative Accuracy
In contrast, the second model does not rely on a worst-case analysis and so makes tighter predictions than the first model. On the other hand, the model is only “approximate” because it makes two assumptions. First, it assumes that the training sample is large enough that the learning curve has “flattened out”. Second, it assumes that, at this flat part of the learning curve, k-NN achieves perfect accuracy except possibly on the boundary of the target concept. We call this second model “approximate” to draw attention to these assumptions, and “relative” to note specifically that it does not predict error on an absolute scale. To formalize these assumptions, let S be a training sample drawn from the distribution D over X d , and let C be the target concept. Let S be the fraction 1 − β of the instances in S that were (correctly) labeled by C during the biased class noise process. Let D be the distribution that is proportional to D except
Robustness Analyses of Instance-Based Collaborative Recommendation
237
that D [x] = 0 for all points x on the boundary between C and X d \ C. The assumptions of the second model can be expressed as: E(k-NN(S ), C, D ) = 0
(1)
Given this assumption, we can predict the error of k-NN as follows. To classify an instance x using a training set S, k-NN predicts the majority class of the k instances x1 , . . . , xk ∈ S that are closest to x. To classify x correctly, at least
k/2 of these k instances must have the correct class. If we randomly draw from D a point x ∈ X d , there are two cases: either C(x) = 1 (which happens with probability f ), or C(x) = 0 (which happens with probability 1 − f ), where as above f is the probability under D that C(x) = 1. In the first case, we need to have at least k/2 successes out of k trials in a Bernoulli process where the probability of success is equal to the probability that a neighbor xi of x will be labeled 1. We can calculate this probability as (1 − β) + βµ. The first term is the probability that xi is labeled 1 and xi ∈ S ; by (1), we know that this probability is 1 − β. The second term is the probability that xi is labeled 1 and xi ∈ S , by definition of the biased class noise process labels we know that this probability is βµ. In the second case, again we need at least k/2 successes, but with success probability (1 − β) + β(1 − µ), the probability that a neighbor xi of x will be labeled 0. The first term is the probability that xi is labeled 0 and xi ∈ S , and by (1) this happens with probability 1 − β. The second term is the probability that xi is labeled 0 and xi ∈ S , which occurs with probability β(1 − µ). The following theorem follows from this discussion. Theorem 2 (Approximate relative accuracy). The following holds for any β, µ, d, k, C and D. Let S be a sample of X d according to D with biased class noise rate β. Let S and D be as defined above. If assumption (1) holds, then E(k-NN(S), C, D ) = k k k k 1 − f· B(1 − β(1 − µ), k , k) − (1 − f ) · B(1 − βµ, k , k), k k k = k k = k 2 2 where f is the fraction of X d labeled 1 by C. Without more information, we can not conclude anything about E(k-NN(S), C, D) (which is what one can measure empirically) from E(k-NN(S), C, D ) (the model’s prediction) or from E(k-NN(S ), C, D ) = 0 (the assumption underlying the model). For example, if D just so happens to assign zero probability to points on C’s boundary, then E(k-NN(S ), C, D) = E(k-NN(S ), C, D ) and so in the best case E(k-NN(S), C, D) = 0. On the other hand, if all of D’s mass is on C’s boundary then in the worst case E(k-NN(S), C, D) = 1. Furthermore, it is generally impossible to know whether E(k-NN(S ), C, D ) = 0. Despite these difficulties, we will evaluate the model on real-world data by simply assuming E(k-NN(S ), C, D ) = 0 and D = D, and comparing the predicted and observed error.
238
Nicholas Kushmerick
+ ✸ ✷
1
+ ✸ ✷
+ ✸ ✷
✷
✷
✷
✸ ✷ 1 +
✸
0.8
+ ✸ ✷
+ ✸ ✷
+ ✸ ✷
+ ✸ ✷ + ✸ ✷
k = 10
✸ + ✷
L−Aabs (β) L−Aabs (1)
L−Aabs (1)
L−Aabs (β)
µ=1 µ=0 µ = 0.5
+ ✸ ✷
0.8
k = 10 1
0.6
+ ✸ ✷
0.4
0.2
µ=1 µ=0 µ = 0.5
0.6
✸ + ✷
+ ✸ ✷
0.4
0.2 + ✸ ✷ +
0 0
0.2
+ ✸
0.4 0.6 noise rate β
+ ✸
+ ✸ ✷
+ ✸ ✷
0
0.8
1
0
0.2
0.4 0.6 noise rate β
0.8
1
Fig. 1. MUSHROOM: Empirical (left) and predicted (right) absolute accuracy ✸ 1 +
✸ +
+ ✸
✸ +
✸ 1 +
✸ +
+ ✸
+ ✸
0.8
0.8 k = 10 ✸ +
L−Aabs (β) L−Aabs (1)
µ=1 µ=0
0.6
+ ✸
+ ✸
+ ✸ + ✸
k = 10
L−Aabs (1)
L−Aabs (β)
+ ✸
+ ✸
0.4
µ=1 µ=0
0.6
✸ +
+ ✸
0.4
0.2
0.2
0
0
+ ✸
0
0.2
0.4 0.6 noise rate β
0.8
1
+ ✸
0
0.2
0.4 0.6 noise rate β
0.8
1
Fig. 2. PTV: Empirical (left) and predicted (right) absolute accuracy
5
Evaluation
We evaluated the two models against two real-world learning tasks: – The MUSHROOM data-set from the UCI repository contains 8124 instances with 23 attributes, with no missing values. – The PTV collaborative recommendation data for television listing [www.ptv.com] contains 2344 instances (people) and 8199 attributes (television programs), and only 0.3% of the matrix entries are non-null. We discarded people who rated fewer than 0.05% of the programs, and programs rated by fewer than 0.05% of the people. The resulting 241 people and 570 programs had a sparseness of 15.5%. The original ratings (values from 1–4) were converted into binary attributes (‘like’/‘dislike’). We used the standard k-NN algorithm with no attribute or vote weighting. Distance was measured using the Euclidean metric (ignoring non-null attributes). All experiments use k = 10 neighbors. Our experiments use a variation on the standard cross validation approach. We repeat the following process many times. First, we randomly partition the entire set of instances into a real set R, a fake set F , and a testing set T . To implement the biased class noise model, a noisy set N containing β|R|/(1 − β)
Robustness Analyses of Instant-Based Collaborative Recommendation
259
Fig. 3. MUSHROOM: Empirical (left) and predicted (right) relative accuracy instances is then randomly drawn from F. The class attributes of the instances in N are then modified t o 1 with probability p and 0 with probability 1 - p. The I;-NN learning algorithm is then training on R U N (so the noise rate is N / R U N = p). We measure accuracy as the fraction of correct predictions for the test instances in T. For MUSHROOM, noise is added only to the class attribute defined by the data-set's authors. For PTV the class attribute (i.e., program to attack) is selected randomly.
Absolute accuracy model. The absolute accuracy model predicts Pr[E(I;-NN(Sp), C, D)
< t],
the probability that the accuracy exceeds 1- t , where So is a sample with biased class noise rate p. Let the model's predicted absolute accuracy from Theorem 1 be
A
Our empirical estimate Aabs(p) of this probability is simply the fraction of trials for which t exceeds the error. Recall that Theorem 1 requires a bound L on the perimeter of the target concept, and a bound B on the probability of any instance under the distribution D . Thus our model is not completely general, and furthermore it is difficult to estimate these parameters for a given learning task. However, it is easily shown that for small values of I;, Aabs(p) does not depend on L and B, and thus we do not need to tune these parameters of our model for each learning task. Due to the worst-case analysis, typically Aabs(P) >> 1 - clearly an absurd value. However, for the purposes of analysing robustness, such values are useful, because we are interested in the increase in error at noise rate P compared to p = 0. We therefore report results using the ratios (L - A,b,(p))/(L - A,b,(O)) and (L - X a b s ( p ) ) / ( ~- Xabs(0)), where L = A,b,(l) is a constant chosen to scale the ratios to [0,1]. The results for MUSHROOM with t = 0.25 are shown in Fig. 1. The predicted and observed accuracies agree reasonably well, even accounting for the fact that
240
Nicholas Kushmerick
Fig. 4. PTV: Empirical (left) and predicted (right) absolute accuracy
the data have been scaled t o [0,1]. The fit is by no means perfect, but we are satisfied with these results, since worst-case PAC-like analyses are usually so weak as to be incomparable to real data. Fig. 2 shows the results for P T V with t = 0.3. Here the fit is worse: P T V appears to be much more robust in practise than predicted, particular as P increases. We conjecture that this is due to the fact that the P T V data is highly noisy, but further analysis is needed to explain these data. Relative accuracy model. The relative accuracy model predicts &(I;-NN(So),C,D), the error of I;-NN when trained on a sample So with biased class noise rate P. Let A,,l(P) be the model's prediction from Theorem 2. Our empirical estimate X r e l ( ~ of ) this probability is simply the fraction of incorrectly classified test instances. As before, we scale all data to [O-11. The results for MUSHROOM are shown in Fig. 3 and the P T V results are shown in Fig. 4. The model fits the observed data quite well in both domains, though as before P T V appears to be inherently noisier than MUSHROOM.
6
Related Work
Collaborative recommendation has been empirically validated in numerous standard lLcustomer/product" scenarios [3]. However, there is relatively little theoretical understanding of the conditions under which the technique to be effective. Our work is highly motivated by ongoing empirical investigations of the robustness of collaborative filtering [9,8]. The ideas underlying Theorem 1 borrow heavily &om Albert et al's seminal PAC analysis of noise-free I;-NN [I]. There has been substantial theoretical algorithmic work on collaborative filtering [7;2,4]. For example, Azar et a1 [2] provide a unified treatment of several information retrieval problems, including collaborative filtering, latent semantic analysis and link-based methods such as hubs/authorities. They cast these problems as matrix reconstruction: given a matrix of objects and their attributes (eg, for collaborative filtering, the objects are products, the attributes are customers, and matrix entries store customers' ratings) from which some entries have been
Robustness Analyses of Instance-Based Collaborative Recommendation
241
deleted, the task is to reconstruct the missing entries (eg, predict whether a particular customer will like a specific product). Azar et al prove that the matrix entries can be efficiently recovered as long as the original data has a good low-rank approximation. The fundamental difference between all of these results and ours is that our biased class noise model is more malicious than simple random deletion of the matrix entries. It remains an open question whether these results can be extended to accommodate this model
7
Discussion
Collaborative recommendation has been demonstrated empirically, and widely adopted commercially. Unfortunately, we do not yet have a general predictive theory for when and why collaborative filtering is effective. We have investigated one particular facet of such a theory: an analysis of robustness, a measure of a recommender system’s resilience to potentially malicious perturbations in the customer/product rating matrix. This investigation is both practically relevant for enterprises wondering whether collaborative filtering leaves their marketing operations open to attack, and theoretically interesting for the light it sheds on a comprehensive theory of collaborative filtering. We developed and evaluated two models for predicting the degradation in predictive accuracy as a function of the size of the attack and other parameters. The first model uses PAC-theoretic techniques to predict a bound on accuracy. This model is “absolute” in that it takes account of the exact position of the system along the learning curve, but as a worst-case model it is problematic to evaluate its predictions . In contrast, the second model makes tighter predictions, but is “relative” in the sense that it assumes perfect prediction in the absence of the malicious attack. Our preliminary evaluation of the model against two realworld data-sets demonstrates that our model fits the observed data reasonably well.
Acknowledgments I thank M. O’Mahony, N. Hurley, G. Silvestre and M. Keane for helpful discussion, B. Smyth for the PTV data, and the Weka developers. This research was funded by grant N00014-00-1-0021 from the US Office of Naval Research, and grant SFI/01/F.1/C015 from Science Foundation Ireland.
References 1. M. Albert and D. Aha. Analyses of instance-based learning algorithms. In Proc. 9th Nat. Conf. Artificial Intelligence, 1991. 234, 235, 240, 243 2. Y. Azar, A. Fiat, A. Karlin, F. McSherry, and J. Saia. Spectral analysis of data. In Proc. 32nd ACM Symp. Theory of Computing, 2001. 232, 240
242
Nicholas Kushmerick
3. J. Breese, D. Heckerman, and C. Kadie. Empirical analysis of predictive algorithms for collaborative filtering. In Conf. Uncertainty in Artificial Intelligence, 1998. 232, 240 4. P. Drineas, I. Kerenidis, and P. Raghavan. Competetive recommender systems. In Proc. 32nd ACM Symp. Theory of Computing, 2002. 232, 240 5. D. Goldberg, D. Nichols, B. Oki, and D. Terry. Using collaborative filtering to weave an information tapestry. C. ACM, 35(12):61–70, 1992. 232 6. M. Kearns and M. Li. Learning in the presence of malicious errors. In Proc. ACM Symp. Theory of Computing, 1988. 233 7. R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tompkins. Recommender systems: A probabilistic analysis. In Proc. 39th IEEE Symp. Foundations of Computer Science, 1998. 232, 240 8. M. O’Mahony, N. Hurley, N. Kushmerick, and G. Silvestre. Collaborative recommendation: A robustness analysis. Submitted for publication, 2002. 232, 240 9. M. O’Mahony, N. Hurley, and G. Silvestre. Promoting recommendations: An attack on collaborative filtering. In Proc. Int. Conf. on Database and Expert System Applications, 2002. 232, 240 10. U. Shardanand and P. Maes. Social information filtering: Algorithms for automating “word of mouth”. In Proc. Conf. Human Factors in Computing Systems, 1994. 232
A
Proof of Theorem 1
The following two lemmas are easily proven. Lemma 1. Consider a binomial process where outcome o1 has probability at least ρ and outcome o2 consumes the remaining probability mass. The prob is at most Φ2 (ρ, s, t) = st B ability sof exactly s o1 -outcomes in st trials t−s max t , ρ , s, t , where B(p, s, t) = p (1 − p) . Lemma 2. Consider a trinomial process where outcome o1 has probability at least ρ and outcomes o2 and o3 consume the remaining probability mass. The probability of exactly most r ands s o2-outcomes in t trials is at r o1 -outcomes r s Φ3 (ρ, r, s, t) = rt t−r s T max t , ρ , r, t , s, t , where T (p, r, q, s, t) = p q (1 − p − q)t−r−s . The following lemma bounds the probability that a subset is sparse or dense. Lemma √ 3. The following holds for any d, α, γ, k, and distribution D. Let m = d/α and ρ = mγd . The probability that a sample S of X d drawn according to D is (k, α, γ)-dense is at least Φ2 (ρ, k , |S|), (2) Υd (k, α, γ, |S|) = 1 − md 0≤k
and the probability that S is (k, α, γ)-sparse is at least Υs (k, α, γ, |S|) = 1 − md Φ3 (ρ, k , k , |S|), k ,k ∈K
where K = {k , k | 0 ≤ k , k ≤ |S| ∧ k < k + k ≤ |S|}.
(3)
Robustness Analyses of Instance-Based Collaborative Recommendation
243
Proof. First consider (2). Partition X d into md squares, where m is chosen large enough so that any two points in one square are√at most distance α apart. √ By the Pythagorean Theorem, we require that m ≥ d/α, so choose m = d/α. Let F be the set of frequent squares: those with probability at least ρ = mγd . Since there are at most md squares not in F , the total probability of the non-frequent squares is at most md ρ = md · mγd = γ. If at least k sample points lie in each of the frequent squares, then the sample will be sufficiently dense for the points in the frequent squares. The probability of selecting a point in some particular heavy square is at least ρ, and the probability of not selecting a point in this square is at most 1−ρ. By Lemma 1, the probability of selecting exactly k out of |S| points in some particular heavy square is at most Φ2 (ρ, k , |S|). Therefore the probability of selecting fewer than k points in some particular heavy square is at most 0≤k
244
Nicholas Kushmerick
Therefore, assuming that Sgood is k/2, α, γ-dense and Sbad is k/2, α, γsparse, we have that E(k-NN(S), C, D) < 2γ + 2αLB. The first term counts instances whose noisy neighbours outvoted noise-free neighbours, and the second term counts mistakes that might occur on the boundary of C. To complete the proof, we must ensure that 2γ + 2αLB < . Since we seek a lower bound on the probability that E(k-NN(S), C, D) < , we can simply split the total permissible error equally between the two causes: 2γ = /2 and 2αLB = /2, or γ = /4 and α = /4LB.
iBoost: Boosting Using an instance-Based Exponential Weighting Scheme Stephen Kwek and Chau Nguyen Computational Learning Group, Department of Computer Science University of Texas at San Antonio San Antonio, TX 78249 {kwek,cnguyen}@cs.utsa.edu
Abstract. Recently, Freund, Mansour and Schapire established that using exponential weighting scheme in combining classifiers reduces the problem of overfitting. Also, Helmbold, Kwek and Pitt that showed in the prediction using a pool of experts framework an instance based weighting scheme improves performance. Motivated by these results, we propose here an instance-based exponential weighting scheme in which the weights of the base classifiers are adjusted according to the test instance x. Here, a competency classifier ci is constructed for each base classifier hi to predict whether the base classifier’s guess of x’s label can be trusted and adjust the weight of hi accordingly. We show that this instance-based exponential weighting scheme enhances the performance of AdaBoost.
1
Introduction
Recent research in classification problems has placed an emphasis on ensemble methods that construct a set of base classifiers instead of a single classifier. An unlabeled instance is then classified by taking a vote of the base classifiers’ predictions of its class label. Ensemble methods like Bagging [1] and AdaBoost [5] have been shown to outperform the individual base classifiers when the base inducer that produces the base classifiers is unstable. An inducer is said to be unstable if a slight change in the training examples results in a very different classifier being constructed. Further, the idea of ensemble methods also gives rise to the use of error-correcting output code technique in enhancing accuracy in multi-class classification problems. In ensemble methods, the vote of each base classifier either receives the same weight (e.g. Bagging and Arcing) or is weighted according to the estimated error rate (e.g. AdaBoost). Based on a recent work of Helmbold, Pitt and the first author [6] we propose an instance-based approach of assigning weights to the base classifiers. Instead of assigning a weight to a base classifier that is fixed for all instances, we attempt to assign a weight that is based upon how well we think the base classifier is going to predict on the label of the test instance. Intuitively, given an unlabeled instance x, if there is some indication that the base classifier’s T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 245–257, 2002. c Springer-Verlag Berlin Heidelberg 2002
246
Stephen Kwek and Chau Nguyen
prediction is correct then we should increase its weight. Otherwise, we should reduce its weight. This paper proposes a new version of boosting algorithm iBoost that uses such instance-based weighting scheme. We demonstrate that iBoost enhances the performance AdaBoost. Another difference between iBoost and AdaBoost is that iBoost adopts the exponential weighting scheme, as the work of Helmbold et. al. [6] is based on the weighted majority framework (see discussion in Section 3.1). We believe that the technique employed here can be applied to ensemble methods in general.
2
The AdaBoost Algorithm
Before elaborating further, we shall define some notations that we use in this proposal. A labeled instance is a pair x, y where x is an element from some instance space X and y comes from a set Y of nominal values. We assume a probability distribution D over the space of labeled instances. A sample S is a set of labeled instances S = {x1 , y1 , x2 , y2 , ...xm , ym } drawn from the independent and identical probability distribution D. A classifier or hypothesis is a mapping from X to Y . An inducer or learner takes a sample S as training data and constructs a classifier. In ensemble methods, multiple base classifiers are created by calling a base inducer over different training examples. Boosting was originally introduced by Schapire to address the question, posed by Kearns and Valiant [8], of whether a weak PAC (Probably Approximately Correct [17]) learner (which outputs a hypothesis that is slightly better than random guess) can be turned into a strong PAC learner of arbitrary accuracy 1 . Schapire [14] came up with the first provable polynomial time algorithm to ‘boost’ a weak learner into a strong learner. A year later, Freund [3] presented a much more efficient boosting algorithm. Unfortunately, the strong theoretical results assume the availability of a (polynomially) large training sample (depending on the desired accuracy and confidence). However, in most practical applications, the size of training sample is very limited which severely curb the usefulness of both algorithms. In 1995, Freund and Schapire [5] attempt to resolve this difficulty by introducing AdaBoost (Adaptive Boosting) as shown in Figure 1. The input to the algorithm is a set of labeled examples S = {(x1 , y1 ), ..., (xm , ym )}. As in Bagging, AdaBoost calls a base inducer repeatedly in a sequence of iterations t = 1, ..., T to produce a set of weak base classifiers {h1 , ..., hT }. However, unlike Bagging where the training sets are drawn uniformly in each iteration, AdaBoost maintains a distribution or a set of weights over the training set. The weight of this distribution on training example i on the tth iteration is denoted by Dt (i). Initially, all the weights are set equally to 1/m. On each iteration, the weights of those incorrectly classified examples are increased while the weights of those correctly classified are decreased. The base inducer’s task is to construct a base = yi ]. The effect of the classifier that minimizes the error t = P ri∼Dt [ht (xi ) 1
Assuming ‘polynomial’ number of labeled examples are available and the target concept to be learned is in the hypothesis class.
iBoost
247
Given: S = (x1 , y1 ), ..., (xm , ym ) where xi ∈ X, yi ∈ Y = {−1, +1} and the number of iterations T Initialize D1 (i) = 1/m for t = 1, ..., T Train base learner using distribution Dt Get base classifier ht : X → R = yi ] Compute t = P ri∼Dt [ht (xi ) Update: (yi ht (xi )) 1−t Dt (i) t Dt+1 (i) = Zt where Zt is a normalization factor that ensures Dt+1 is a probability distribution. Output the final classifier: H(x) = sign
T t=1
αt ht (x)
where αt =
1 1 − t ln 2 t
.
Fig. 1. The boosting algorithm AdaBoost for binary-class problem.
reweighting is that in the next iteration, the base inducer is forced to concentrate more on those difficult examples. yi that is incorrectly labeled by ht , we increase For each labeled example xi , its weight Dt+1 (i) by a factor of (1 − t )/t (of Dt (i)). Those that are correctly labeled have their weights reduced by a factor of t /(1 − t ). The final or combined classifier H is a weighted majority vote of the predictions made by the T base classifiers where the prediction of each ht is assigned a weight of 1 − t 1 αt = ln 2 t
3 3.1
i Boost Inspiration
The inspiration behind the theme for this proposal arises from earlier computational learning theory (COLT) research work on learning from a pool of experts. In [11], Littlestone and Warmuth studied the problem of making on-line prediction using a pool of experts. In their model, the learner faces a (possibly infinite) sequence of trials, with a boolean prediction to be made in each trial. The goal of the learner is to make minimum number of mistakes. The learner is allowed to make his prediction by observing how a given pool of experts predict. The
248
Stephen Kwek and Chau Nguyen
underlying assumption is that at least one of these experts will perform well but the learner does not know which one. They propose the weighted majority algorithm that works as follows. A weight is associated with each expert, and is initially set to one. The learner predicts negative if the sum of the weights of all the experts that predict negative is greater than that of the experts that predict positive, otherwise the learner predicts positive. When a mistake is made, the learner simply multiplies the weights of those experts that predict wrongly by some fixed non-negative constant β smaller than one. They showed that if the best expert makes at most η mistakes, then on any sequence of possibly infinite trials, the weighted majority algorithm makes at most O(log |E| + η) mistakes. Various alternative weighting schemes have also been proposed. In the weighted majority algorithm and its variants, the weight of an expert is β m where m is the number of errors made by the expert so far. Such weighting scheme is know more generally as exponential gradient descent where the number of mistakes or the error rate appear in the exponent of the weight. Although most work are in the online learning model, Freund et. al. [4] recently established theoretical result in the offline batch model. In their work, the final prediction is made by taking the weighted average of all hypotheses, weighted exponentially w.r.t. their training errors. They showed that this weighting scheme can protect against overfitting problem commonly encountered by those algorithms that predict with the best (single) hypothesis, and hence is more stable. Out of curiosity, we decided to investigate whether adapting an exponential weighting scheme in AdaBoost improves prediction accuracy. Unfortunately, we found a decrease in prediction accuracy. This preliminary study and an earlier work [6] of the first author (see discussion below), suggest that a better exponential weighting scheme variant of AdaBoost may have to take into consideration of the unlabeled test instance in assigning the hypotheses’ weights. Going back to the framework of prediction using a pool of experts, suppose the experts’ predictions and the actual outcome depend on some input (instance) in each trial. Notice that the weighted majority and its variations do not make use of information specific to the input when calculating the weights of the experts. This may turn out to be a missed opportunity as this information may help to determine which experts are likely to predict correctly. To illustrate this, consider the following example with a boolean instance space {0, 1}n and two experts, E0 and E1 that always give opposite predictions. If E0 makes at most a small number, m0 , of mistakes when one crucial component of the instance is set to 0, and E1 makes at most m1 mistakes when that crucial component is set to 1, then the weighted majority and similar schemes can be forced to make a mistake on almost every point in the instance space (more precisely, 2n −|m0 −m1 | mistakes). This remains true even if the learner uses table lookup to remember all the previous mistakes. However, if the learner uses E0 ’s predictions when the crucial input component was set to 0 and E1 ’s predictions otherwise, then the learner makes at most m0 + m1 mistakes when it maintains a table of labeled counterexamples. In other words, although neither expert is
iBoost
249
very competent, they become competent collectively if we consider restricting the use of each expert to the appropriate subset of the instance space. Unfortunately, those algorithms that employ the weighted majority techniques do not take advantage of the above situation. Hence, Helmbold, Pitt and the first author [6] proposed a theoretical framework to capture this notion of ‘the whole is often greater than the sum of its parts’ by considering the regions in which the experts are competent in. Within this framework, we established various positive theoretical results. The next natural step is to perform experimental work to empirically verify that the notion of competency regions is superior to simply taking a weighted vote of the experts. In trying to do so, we realize that the on-line prediction using a pool of experts bares great similarity to the ensemble methods in that both use a simple weighting scheme where each expert, or base classifier, has the same weight for all instances. Further, the base-classifiers in AdaBoost are trained using different distributions hence we expect their competency regions to be different. Thus, we decide to apply the idea of competency region to AdaBoost and also adopt the exponential weighting scheme. 3.2
The Algorithm i Boost
We started by experimenting with AdaBoost but with the weight of each base classifier adjusted according to whether there is evidence suggesting that it may predict well for that specific test instance (in addition to its overall training error t ). The experiment was implemented on top of the open source machine learning software Weka 3 [18] provided by the University of Waikato in New Zealand. We call our algorithm, which is shown in Figure 2, iBoost (instancebased boosting). We use the decision tree inducer J48 in the Weka software package (which implements a version of C4.5) to construct the base classifiers. We modify AdaBoost so that in the tth iteration, after the √ base decision tree classifier ht has been constructed, we label each sample as ‘ ’ or ‘× depending on whether it is labeled correctly by ht . We then use the decision tree inducer J48 to learn from this newly labeled sample a competency predictor ct for predicting whether hi ’s prediction can be trusted on a given unlabeled instance. Following the spirit of the weighted majority paradigm and the work of Freund et. al. [4], we adopt the exponential weighting scheme by setting the initial weight of each base classifier ht to e−t . Given an unlabeled instance x, the weight ht of the base classifier does not √ depend solely on √ ht ’s estimated error rate, t , but also on whether ct (x) is ‘ ’ or ‘×’. If ct (x) = then there is evidence that the prediction of ht on x’s label can be trusted. Thus, ht ’s weight should be increased. In this case, we treat ct as another expert that makes the same prediction on x as ht . It receives a weight of e−t where t is ct ’s estimated error rate on S. Thus, iBoost sets ht (x)’s weight αt to ηe−t + (1 − η)e−t . Here, η is a parameter between 0 and 1 that places the relative importance of the two experts ct and ht . Unless explicitly stated otherwise, we shall assume throughout this paper that η = 0.5. On the other hand, if ct (x) = × then it suggests that ht may not be competent in predicting x’s label. Here, we need to reduce the weight of ht . We do this by
250
Stephen Kwek and Chau Nguyen
Given:
S = {(x1 , y1 ), ..., (xm , ym ) : xi ∈ X, yi ∈ Y = {−1, +1}} and the number of iterations T Same code as the original AdaBoost but with the following addition
√ Create a sample St from S by labeling each example xi as ” ” or ”×” depending on whether ht (xi ) = yi . Train competency predictor ct using St . Compute ct ’s estimated error rate, t , on S. Output the final classifier: H(x) = sign
T
αt (x)ht (x)
.
t=1
where
αt (x) =
√
ηe−t+ (1 − η)e−t
max 0.0001e−t , ηe−t − (1 − η)e
−t
if ct (x) = ‘
if ct (x) = ‘×
Fig. 2. iBoost for binary class problems treating ct as another expert that predicts ¬ht (x), the opposite of ht ’s prediction. As before, ct gets a weight of e−t , and the overall weight of ht is set to ηe−t −(1− η)e−t . We avoid using negative weights because our preliminary investigation indicates that having negative weights decreases prediction accuracy. Thus, If this modified weight is zero or negative, we set it to 0.0001e−t . We choose not to set the weight to 0 in the latter situation just in case all the ct ’s predict ‘×’. 3.3
Previous Related Work
The idea for adjusting weights based on the unlabeled test instance has been pioneered earlier in neural network research by Jordan and Jacobs [7]. In their work, the predictions of the experts (i.e., base classifiers) are combined by a tree architecture neural network. The weights in the neural network is obtained by gating networks that take the test instance as input. Thus, the test instance effectively determined the weights. A closely related idea is that of RegionBoost proposed by Maclin [12]. In his work, he aimed to predict, for each base classifier, the probability that the test instance is misclassified. He proposed to replace the base classifier’s estimated (overall) error rate by this probability in assigning weight to the base classifier. He studied the usage of k-Nearest neighbor and neural network approach in estimating this probability. Using UCI data sets, he illustrated that RegionBoost has
iBoost
251
a slight edge over AdaBoost. iBoost is very similar to RegionBoost in the sense that both try to establish a better assignment of base-classifiers’ weights according to the test instance. However, the exponential weighting scheme is motivated by the work of Freund et. al. [4] and earlier COLT work on weighted majority weighting scheme. More importantly, the improvements obtained by iBoost are larger than that of RegionBoost. Further, while RegionBoost is even more susceptible to overfitting than AdaBoost [12], iBoost seems to better reduce the problem of overfitting (see Section 4). This is not surprising since Freund et. al. [4] have established theoretically that (weighted) averaging experts’ prediction using exponential weighting scheme protect against the overfitting problem. Besides taking a weighted vote, another popular way of combining base classifiers’ predictions is stacking [19]. In stacking, an inducer is used to learn how to combine the predictions of the base-classifiers. In this stacking framework, Todorovski and Dzeroski [16] proposed constructing a decision tree that selects a (single) base classifier to use. As in their work, iBoost invokes a decision tree inducer to learn to combine the base classifiers’ predictions. However, iBoost constructs multiple decision trees, one for each base classifier. Instead of selecting a single predictor, we take a weighted vote of the predictions. Further, their work is for combining heterogeneous classifiers produced by different inducers while iBoost is concerned about combining homogeneous classifiers obtained by bootstrap sampling (or reweighting) and using the same inducer.
4
Results
Improve Prediction Accuracy. At first we ran the algorithm through 32 UC Irvine data sets for classification problems over 10 iterations (See Table 1). Our iBoost algorithm beats AdaBoost on 19 data sets, loses to AdaBoost on 6 data sets and draws on 6 data sets. The average improvement based on 10fold cross-validation is about 0.6% which is not extremely impressive. However, upon closer inspection, there are significant number of data sets where AdaBoost predicts with error less than 5% and probably very close to the actual noise rate. This leaves very little room for improvement. Nevertheless, on the 12 data sets where AdaBoost achieves accuracy in the 90% range (See Table 1(C)), we still manage to squeeze in an improvement of 0.31% after 100 iterations, with iBoost winning AdaBoost on 6 data sets and losing on 2 data sets. In the 70+ % range, the improvement seems to be more noticeable. See Table 1. Among the 10 data sets, iBoost consistently improves over the performance of AdaBoost, except for one or two data sets. Further, the difference in their average performance widens from 0.80% at 10 iterations to 1.14% at 100 iterations. We perform Student’s one-sided t-test on the improvements obtained at 50 iterations where AdaBoost has the best performance. The level of significance is tabulated in Table 2A which shows that for many data sets, the confidence that iBoost outperforms AdaBoost is near or above 90%. The improvements are slightly more significance if we allow the mixing rate η to varies in incremental of 0.1. Here, for 6 out of the 10 data sets, the best performance is still
252
Stephen Kwek and Chau Nguyen
Table 1. The performance of iBoost vs. AdaBoost. Those numbers in bold face indicate the winner between AdaBoost and iBoost with the same number of iterations Data Sets in the 70+ % Iterations = 10 Iterations = 25 Ada iBoost Ada iBoost 78.08 78.24 76.00 77.60 69.58 72.38 70.28 73.78 72.60 74.90 75.00 75.70 73.31 73.70 73.70 73.96 73.83 74.77 76.17 75.23 77.89 77.23 78.55 79.87 76.19 76.87 77.21 77.21 80.00 80.37 81.85 81.48 79.33 80.29 84.62 83.65 76.48 76.60 78.25 77.54 75.73 76.53 77.16 77.60 1 9 4 5
range Iterations = 50 Ada iBoost 75.36 76.80 70.28 74.81 74.40 76.00 72.79 73.57 77.57 78.04 78.55 79.87 77.89 78.57 82.59 82.59 84.62 85.10 78.72 78.37 77.28 78.37 1 8
Iterations = 100 Ada iBoost 74.56 76.16 70.28 74.48 74.10 74.70 73.57 74.58 78.04 79.44 79.54 80.63 78.23 77.99 81.11 81.95 83.65 84.72 77.42 77.31 77.05 78.19 2 8
Data Sets in the 80+ % Iterations = 10 Iterations = 25 Data set Ada iBoost Ada iBoost B1 Audiology 84.07 84.96 84.96 84.96 B2 Autos 82.44 83.90 84.39 84.88 B3 Horse Colic 83.97 83.97 82.07 84.78 B4 Credit (A) 84.93 85.80 86.38 87.10 B5 Heart-Statlog 80.00 80.37 81.85 81.48 B6 Hepatitis 83.87 81.94 85.16 81.94 B7 Lymphography 83.78 81.08 82.43 82.43 B8 Waveform 81.08 81.54 82.92 84.06 Average 83.02 82.94 83.39 83.95 Win 2 5 2 4
range Iterations = 50 Ada iBoost 84.96 84.96 85.37 87.32 81.79 85.87 85.94 86.52 82.59 82.59 84.52 85.16 84.46 83.11 84.58 84.60 84.28 85.02 1 5
Iterations = 100 Ada iBoost 84.96 85.50 86.34 86.93 81.79 85.43 87.10 87.59 81.11 81.95 85.81 85.46 83.78 83.78 84.94 84.90 84.48 85.18 2 5
Data Sets in the 90+ % Iterations = 10 Iterations = 25 Ada iBoost Ada iBoost 99.67 99.78 99.67 99.67 95.85 95.57 95.99 96.28 99.73 99.73 99.68 99.68 93.73 93.73 95.16 94.59 94.67 94.67 94.67 94.00 99.50 99.47 99.50 99.47 98.57 98.61 98.70 98.70 99.15 99.07 99.02 98.99 92.83 92.97 92.39 94.00 94.94 95.17 94.94 95.86 92.22 93.03 95.25 96.06 95.05 95.05 95.05 96.04 96.33 96.40 96.67 96.94 3 5 4 5
range Iterations = 50 Ada iBoost 99.67 99.67 95.99 96.28 99.71 99.71 94.87 94.30 94.00 94.67 99.53 99.53 98.57 98.57 99.15 99.05 92.53 94.14 94.94 95.40 96.26 96.57 95.05 96.04 96.69 96.99 2 6
Iterations = 100 Ada iBoost 99.67 99.67 96.42 96.57 99.71 99.71 94.87 94.87 94.67 94.67 99.47 99.50 98.79 98.74 98.99 98.91 92.39 94.29 94.94 95.40 96.77 97.17 95.05 96.04 96.81 97.13 2 6
Data set A1 Balance-Scale A2 Breast-Cancer A3 Credit (G) A4 Diabetes A5 Glass A6 Heart-C A7 Heart-H A8 Heart-Statlog A9 Sonar A10 Vehicle Average Win
Data set C1 Anneal C2 Breast-W C3 Hypothyroid C4 Ionosphere C5 Iris C6 Kr-Vs-Kp C7 Segment C8 Sick C9 Soybean C10 Vote C11 Vowel C12 Zoo Average Win
η = 0.5 (i.e., placing equal importance in both base classifiers and competency predictors.). The relative performance of iBoost and AdaBoost on the 80+ % data sets is somewhat in between the performance of those 70+ % and 90+ % data sets (see Table 1(B)). Although the average improvements on the 80+ % and 90+ % data
iBoost
253
Table 2. (A) Test of significance using Student t-test. The average improvement is computed based only on those data sets that iBoost performs better in both cases where η = 0.5 and η is optimized. These data sets are marked with an asterisk (B) Using exponential weighting scheme alone (i.e. η = 1) does not increase accuracy η = 0.5 best η data set -∆error t-test η -∆error t-test A1 1.44 99.2% * 0.5 1.44 99.2% * A2 4.53 96.0% * 0.5 4.53 96.0% * A3 1.60 99.6% * 0.5 1.60 99.6% * A4 0.78 89.2% * 0.5 0.78 89.2% * A5 0.47 69.8% * 0.1 0.93 82.7% * A6 1.32 94.9% * 0.6 1.65 97.5% * A7 0.68 65.0% * 0.5 0.66 65.0% * A8 0.00 50% 0.7 0.37 83.0% A9 0.48 62.2% * 0.5 0.48 62.2% * A10 -0.35 25.0% 0.9 -0.12 38.8% Average * 1.41 84.5% 1.51 86.0% (A)
data set A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 Average:
accuracy η = 1 η = 0.5 75.36 76.80 69.93 74.81 74.28 76.00 72.79 73.57 77.10 78.04 78.22 79.87 77.89 78.57 82.22 82.59 85.10 85.10 78.61 78.37 77.14 78.37 (B)
sets are smaller that that of the 70+ % data set, the t-test results yield similar level of confidence 2 . Lastly, the improvement is also more pronounced if we consider only those binary class data sets2 . Competency Predictors Help. It is possible that the improvements obtained by iBoost is not so much due to the use of competency predictor but rather to the usage of exponential weighting scheme. To show that this is not the case, we compare iBoost with boosting using exponential weighting scheme but without adjusting the weights using the competency predictors. We can do this by simply setting η = 1. Table 2(B) shows the results obtained for those data sets having 70% accuracy where iBoost continues to outperform. The result indicates that the improvement is due to the use of competency predictors. Margin Distribution. The performance of AdaBoost has been studied in terms of the margins of the training examples [15]. The margin of an example (x, y) is defined to be T y t=1 αt ht (x) . margin(x, y) = T t=1 |αt | The magnitude of the margin can be interpreted as a measure of confidence in the prediction. It ranges from -1 to +1 and is positive if and only if AdaBoost predicts correctly. Schapire et. al. [15] proved that larger margins of the training set translate into a superior upper bound on the generalization error. Figure 3 2
Details omitted due to page limitation. More details will be provided in a full version of this paper.
254
Stephen Kwek and Chau Nguyen
10 iterations
50 iterations
1
1 Adaboost iBoost
Adaboost iBoost
0.8 distribution
distribution
0.8 0.6 0.4 0.2
0.6 0.4 0.2
0
0 -1
-0.5 0 0.5 margins
1
-1
-0.5 0 0.5 margins
1
Fig. 3. The margin distributions of iBoost vs. AdaBoost on the Colic data set shows the cumulative distributions of the margins for both AdaBoost and iBoost on the H. Colic data set at the 10th iterations and 50th iterations. The negative margins near zero tend to increase and many become positive. Further, The positive margins increase further towards one. We believe that this shift in margin gives rise to the improvement in accuracy. The observation is very typical on almost all data sets which iBoost performs well. Minimize Overfitting. Optiz and Maclin [13] illustrated that AdaBoost suffers from overfitting in certain situations. They illustrated this by using artificial data sets with one-sided noise. The data sets are defined over two relevant attributes and four irrelevant attributes. The target concept is a simple linear halfspace based on the relevant attributes. Points are generated within the target concept with 10% of the points being mislabeled. They showed that AdaBoost often produced significant increase in error (i.e. overfit). Maclin’s RegionBoost algorithm worsens this overfitting problem [12]. However, Our experiment (see figure 4) with 250 randomly generated data points indicates that iBoost tends not to overfit as badly. This is not surprising since Freund et. al. [4] had shown theoretically that the exponential weighting scheme reduces overfitting. Bias-Variance Analysis. Another way to understand how iBoost improves prediction accuracy is to decompose the estimated error rate into the bias and variance components [2,9]. The bias component measures how closely the learning algorithm’s average guess (over all possible training sets of the same size as the given labeled sample) matches the target, while the variance component measures how much the learning algorithm’s guess fluctuates for different training sets of a given size. Experiments performed using Dietterich and Kong’s definition3 of bias and variance [10] suggest that most of iBoost’s improvements are due to the reduction of the variance component, while the bias component
iBoost
255
Error (%)
Overfitting 24 22 20 18 16 14 12 10
Adaboost iBoost
5
10
15
20 25 iterations
30
35
40
45
Fig. 4. iBoost seems to minimize overfitting
remain fairly stable. This is somewhat surprising since in AdaBoost, the bias component is much larger than the variance component. However, this is somewhat unfortunate since it places a limitation of how much iBoost can outperform AdaBoost. Due to page limitation, we shall defer our discussion to a later full version of this paper.
5
Conclusion
In this paper, we introduce a variant of boosting, iBoost, that employs the exponential weighting scheme and the use of competency predictors in predicting whether the base classifiers can be trusted. The latter allows us to adjust the weights of the base classifiers by taking into consideration of how confident we are with their predictions on the unlabeled test instance. Although iBoost allows us to adjust the relative importance of the base classifiers and competency predictors, our experiments suggest that often it is best to place equal emphasis on both (i.e. setting η = 0.5). This also implies that the improvement is not solely due to the use of exponential weighting scheme (i.e. ignoring the competency predictors totally). However, due to the use of the exponential weighting scheme, which has been shown theoretically to reduce overfitting, iBoost tends to overfit less as compared to AdaBoost and RegionBoost. Thus, on the UCI benchmark data sets, iBoost exhibits a significant improvement in accuracy over AdaBoost. Bias-Variance decomposition of the error rates obtained by AdaBoost and iBoost suggests that the improvement of iBoost is due to the reduction of the variance component of the error rate. An inspection of the margin distributions on those data sets where iBoost performs
256
Stephen Kwek and Chau Nguyen
well indicate that iBoost does so by shifting the positive margins and negative margins close to zero towards 1.
Acknowledgement This work is partially supported by NSF grant CCR-0208935. We like to thank the reviewers for valuable comments and references to some relevant work. We also want to thank Tom Bylander for helpful suggestions. This project would have taken a longer time to complete if not for the open source code of Weka.
References 1. L. Breiman. Bagging predictors. In Machine Learning, volume 24, pages 123–140, 1996. 245 2. T. G. Dietterich and E. B. Kong. Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State University, 1995. 254 3. Y. Freund. Boosting a weak learning algorithm by majority. Inform. Comput., 121(2):256–285, September 1995. Also appeared in COLT90. 246 4. Yoav Freund, Yishay Mansour, and Robert Schapire. Why averaging classifiers can protect against overfitting. In Proc. of the 8th International Workshop on Artificial Intelligence and Statistics, 2001. 248, 249, 251, 254 5. Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In Proc. 13th International Conference on Machine Learning, pages 148–146. Morgan Kaufmann, 1996. 245, 246 6. David Helmbold, Stephen Kwek, and Leonard Pitt. Learning when to trust which experts. In Computational Learning Theory: EuroColt ’97, pages 134–149. Springer-Verlag, 1997. 245, 246, 248, 249 7. Michael I. Jordan and Robert A. Jacobs. Hierarchical mixtures of experts and the EM algorithm. CBCL Paper 83, M. I. T. Center for Biological and Computational Learning, August 1993. 250 8. Michael Kearns and Leslie Valiant. Cryptographic limitations on learning Boolean formulae and finite automata. J. ACM, 41(1):67–95, 1994. 246 9. Ron Kohavi and David H. Wolpert. Bias plus variance decomposition for zero-one loss functions. In Proc. 13th International Conference on Machine Learning, pages 275–283. Morgan Kaufmann, 1996. 254 10. Eun Bae Kong and Thomas G. Dietterich. Error-correcting output coding corrects bias and variance. In Proc. 12th International Conference on Machine Learning, pages 313–321. Morgan Kaufmann, 1995. 254 11. N. Littlestone and M. K. Warmuth. The weighted majority algorithm. Inform. Comput., 108(2):212–261, 1994. 247 12. Richard Maclin. Boosting classifiers regionally. In Proceedings of the 15th National Conference on Artificial Intelligence (AAAI-98) and of the 10th Conference on Innovative Applications of Artificial Intelligence (IAAI-98), pages 700–705, Menlo Park, July 26–30 1998. AAAI Press. 250, 251, 254 13. D. Optiz and R. Marlin. Popular ensemble methods: An empirical study. CBCL Paper UMD CS TR 98-1, University of Maryland, 19938. 254
iBoost
257
14. Robert E. Schapire. The strength of weak learnability. Machine Learning, 5(2):197– 227, 1990. 246 15. Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the margin: a new explanation for the effectiveness of voting methods. In Proc. 14th International Conference on Machine Learning, pages 322–330. Morgan Kaufmann, 1997. 253 16. Ljupˇco Todorovski and Saˇso Dˇzeroski. Combining classifiers with meta decision trees. In Machine Learning Journal, 2002. to appear. 251 17. L. G. Valiant. A theory of the learnable. Commun. ACM, 27(11):1134–1142, November 1984. 246 18. I. H. Witten and E. Frank. Nuts and bolts: Machine learning algorithms in java,. In Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations., pages 265–320. Morgan Kaufmann, 2000. 249 19. D. Wolpert. Stacked generalization. Neural Networks, 5(2):241–260, 1992. 251
Towards a Simple Clustering Criterion Based on Minimum Length Encoding Marcus-Christopher Ludl1 and Gerhard Widmer1,2 1 2
Austrian Research Institute for Artificial Intelligence, Vienna Department of Medical Cybernetics and Artificial Intelligence University of Vienna, Austria
Abstract. We propose a simple and intuitive clustering evaluation criterion based on the minimum description length principle which yields a particularly simple way of describing and encoding a set of examples. The basic idea is to view a clustering as a restriction of the attribute domains, given an example’s cluster membership. As a special operational case we develop the so-called rectangular uniform message length measure that can be used to evaluate clusterings described as sets of hyper-rectangles. We theoretically prove that this measure punishes cluster boundaries in regions of uniform instance distribution (i.e., unintuitive clusterings), and we experimentally compare a simple clustering algorithm using this measure with the well-known algorithms KMeans and AutoClass.
1
Introduction and Motivation
Clustering has long been recognized as a major research topic in various scientific fields, such as Artificial Intelligence, Statistics, Psychology, Sociology and others. Also termed unsupervised learning in Machine Learning, it represents a fundamental technique for Knowledge Discovery in Databases, useful, e.g., for data compression, data understanding, component recovery or class separation. Various methods for grouping a set of objects have been devised. The different approaches can roughly be categorized as follows: – Linkage-based: hierarchical methods (top-down/ dividing, bottom-up/ agglomerating), graph partitioning... – Density-based: kernel-density estimation, grid clustering... – Model-based: partitional clustering techniques, k-means, self-organizing maps, MML/MDL, mixture modeling... – Spectral-based: clusterings based on the eigenvectors of the (normalized) affinity matrix [4] gives an excellent overview of clustering techniques which can be adopted for use with large data sets. For the quite recent spectral approaches to clustering and segmentation, we refer the reader to [7] and [14]. MDL-based approaches are attractive because they provide a non-parametric way of automatically deciding on the optimal number of clusters [8], by trading T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 258–270, 2002. c Springer-Verlag Berlin Heidelberg 2002
Towards a Simple Clustering Criterion
259
off model complexity and fit on the data. However, up to now MDL (or MML)1 criteria have been used in clustering predominantly in mixture modelling algorithms (e.g., Snob [13] or AutoClass [3]), where they proved to be very effective. The downside, however, is that mixture models are extremely hard to interpret directly. Most clustering approaches make no effort of abstracting from the training instances and giving a description of the clusters in terms of the spaces covered (e.g. by their shapes). Overlapping clusters, probability distributions over cluster memberships (e.g. AutoClass) or centroid-based models (e.g., KMeans, which implicitly assumes that each cluster is modeled by a spherical Gaussian distribution [2]) make it difficult to develop a global viewpoint of the dependencies and subspaces created by the clusters. For the non-technical person, on the other hand, a clustering is simply a partition, possibly described by simple conditions on properties of cluster members. Thus, in this paper we focus on a novel MDL-based clustering criterion which has two main advantages: Firstly, it should produce easily interpretable clusterings based on space descriptions (here, we focus on the special case of hyperrectangles), while achieving the same quality of results as state-of-the-art clustering algorithms. Secondly, as an MDL-based method, it yields a non-parametric way of estimating the number of clusters. The basic idea is extremely simple. For the moment, we restrict ourselves to a simple attribute-value framework, where each example is a vector of values. We will view a clustering as an implicit restriction on the values allowed for attributes, given information about the cluster membership of an instance. Under this view, it is straightforward to formulate a generic MDL-type measure for the encoding length of a clustering. For the special case of clusters represented by hyper-rectangles and instances distributed uniformly within clusters (not within the instance space), we will derive an operational instantiation of this criterion and prove theoretically that this criterion behaves in an ‘intuitively correct’ way; that is to say, it punishes cluster boundaries in regions of uniform instance distribution (i.e., unintuitive clusterings). In experiments with synthetic and real-world datasets, the clusterings selected by our measure will be compared to the results of two ‘standard’ clustering algorithms (AutoClass and KMeans). The results produced by our measure are at least competitive (with respect to the quality measures we will apply in the experiments), even if the data to be clustered do not satisfy the uniform distribution condition, and the ‘true’ clusters are not generated by hyper-rectangles. 1
Although related concepts, there are subtle differences between MML and MDL [1]. In this paper, we will only be interested in the basic idea of selecting the hypothesis which yields the smallest two-part encoding, i.e. minimizes the sum of the length of the description of the model and the description of the data, relative to the model. We will not try to approximate an unknown probability distribution. In the following we will be using the term minimum description length (MDL) simply to mean this basic idea, not to relate our approach to either MDL or MML.
260
2
Marcus-Christopher Ludl and Gerhard Widmer
Minimum Length Encoding
To apply the MDL principle to the clustering task, we have to quantify exactly what amount of information a certain grouping of the examples contains, i.e. in what way and how much it reduces the cost for storing and/or transmitting the data in question. This can be answered by regarding a clustering as a restriction as to what values are allowed for certain attributes or as a narrowing of the involved intervals. An intuitive example should make the underlying idea clear: Consider an “object” o, which we know is a living being (animal or human). Failing more information we would probably assume the number of legs to be an integer value from, say, [0, 8], the height to be a real value from ]0m, 10m], and binary values for attributes like canf ly or sex and so on. Now, saying that o is a dog, would maybe restrict the attributes as follows: legs = 4, height ∈ [0.1m, 1m] and canf ly = N O. This classification would not affect the attribute sex. So, tagging the object with an additional class attribute, which in this case has the value dog, gives information about the object by restricting the possibilities. 2.1
Clustering Message Length
Definition 1. Let R be the relation scheme of a relational database r, n the number of tuples (i.e. examples) and c the number of clusters (i.e. distinguished classes). The clustering message length of the two-part clustering encoding, consisting of theory and data, is defined as follows: c.m.length(clustering) = ld n + c · length(clusterdescription)+ |elements(ci )| · lengthci (element) + n ld c ci ∈Clusters
The first line of this definition captures the theory part of the encoding, ld n is the cost for specifying an integer number in [0, n] (the number of clusters: c ≤ n). The second line specifies the code length: |elements(ci )| is the number of examples in ci , while lengthci encodes the cost for specifying an exact position in ci , i.e. one example in this cluster. The term n ld c is important: it accounts for the tagging, i.e. the mapping of an example to one of the clusters (note: this is the worst case cost). Note that we distinguish between message length (the length of the complete two-part encoding) and data code length (the length of the data encoded relative to the theory). 2.2
Uniform Distribution over Rectangles
To illustrate the basic principle, as a special case of the aforementioned clustering message length, we now only consider rectangular clusters (parallel to the axes) and a uniform distribution of examples within these clusters. I.e. we assume each position in a cluster to be equally likely and encode the cost for specifying one such position: This measures the cost for encoding one example relative
Towards a Simple Clustering Criterion
261
to the restrictions of the according cluster. Note that we restrict ourselves to the numerical case here (with the instance space “discretized” according to the resolution of measurement – see below). Definition 2. With the specifications of definition 1 holding, let m be the number of attributes, vj the number of values that attribute j (j ∈ [1, m]) can assume and vi,j the number of values that attribute j can assume in cluster i (i ∈ [1, c]). The rectangular uniform message length of the two-part clustering encoding is defined as follows: r.u.m.length(clustering) = ld n + c · 2 · c
m
ld vj +
j=1
|elements(ci )| ·
i=1
m
ld vi,j + n ld c
j=1
Again, the first line encodes the theory: Specifying rectangles parallel to the axes necessitates 2 values (min and max) from the domain of each attribute. The second line encodes the data, i.e. the examples relative to the (restricted) domains of the attributes. This definition holds only for numerical domains, of course. In such a case, b −a we can define vj = j ρj j , where [aj , bj ] is the domain of attribute j and ρj the resolution of measurement; vi,j and ρi,j can be defined accordingly. Note that this does not mean that we discretize the data – we merely calculate the number of possible positions to be encoded within each cluster by taking into account the resolution of measurement (as given by the data). Symbolic attributes would necessitate, e.g., a description length of one bit per value (if we allow all possible combinations). Furthermore, we may not need min and max values for all attributes in the numerical case. We chose this particular encoding for the sake of simplicity and to illustrate the basic principle. 2.3
The R.U.M.Length is Well-Behaved
Clearly, the theory length contributes to the overall message length only an additive constant, if we vary the number of examples, but not the number, sizes and positions of the clusters. So, with an increasing sample size from the same source, the exact way of encoding the clusters becomes less and less important. Instead, the data code length needs an important property, which we will prove below. Lemma 1. The rectangular uniform data code length (the data code part of the r.u.m.length) of a sequence of uniformly distributed values, split into two clusters by one inner split point, is equal to or higher than the rectangular uniform data code length of the complete cluster.
262
Marcus-Christopher Ludl and Gerhard Widmer
Proof: Let n be the number of uniformly distributed examples in an interval [a, b]. Let s be an inner split point, i.e. a < s < b, which splits the values into two regions. Let p = s−a b−a and n1 and n2 be the numbers of examples in these two regions, respectively. Because the examples are uniformly distributed, we may assume that n1 = pn and n2 = (1 − p)n. Furthermore, we’ll use the abbreviation h = b − a. Now, to prove the lemma, we have to show that n ld
hp h(1 − p) h + n ld 1 ≤ np ld + n(1 − p) ld + n ld 2 ρ ρ ρ ld hn ≤ ld(hp)np + ld(h(1 − p))n(1−p) + ld 2n
hn ≤ hnp pnp hn(1−p) (1 − p)n(1−p) 2n n hn ≤ hn pp (1 − p)1−p 2 1 pp (1 − p)1−p ≥ 2 To solve this inequation, we’ll prove that f (p) = pp (1 − p)1−p has an extreme value at ( 12 | 12 ) and that this is a minimum in ]0; 1[. First of all, note that f ( 12 ) = 12 . To differentiate f (p), we use the product rule and differentiate the logarithmic forms of the factors.2 This yields f (p) = pp (1 − p)1−p (ln p − ln(1 − p)) 1 1 p 1−p + f (p) = p (1 − p) p 1−p Setting f (p) = 0, yields p = 12 , thus ( 12 | 12 ) is the only extremum in ]0; 1[. Also, f ( 12 ) > 0, which means that ( 12 | 12 ) is a minimum. This means that 1 f (p) = pp (1 − p)1−p ≥ 2 for p ∈]0; 1[. Note that this proof works only because we assume a uniform distribution of values in the interval [a; b]. QED Theorem 1. The rectangular uniform code length of a sequence of uniformly distributed values, split into two or more clusters, is equal to or higher than the rectangular uniform code length of the complete cluster. Proof: This follows by induction from lemma 1. Remark: By this theorem we know that the rectangular uniform message length punishes “unnecessary” and “unintuitive” splits within a uniformly distributed region. Even more than that: We may expect the best clustering, i.e. the one with the lowest r.u.m.length to be one where regions of different density are separated from each other. 2
Logarithmic differentiation is a not too well known technique for finding the derivaboth tive of a form such as f (x) = g(x)h(x). Takinghthe logarithm, differentiating i
(x) sides and solving for f (x), yields: f (x) = f (x) h (x) ln g(x) + h(x) gg(x) .
Towards a Simple Clustering Criterion
3
263
Experimental Evaluation
For evaluating the capabilities of such an MDL-based clustering measure, we tested the rectangular uniform message length, together with a brute-force stochastic algorithm, on several artificial datasets and two real-world datasets. 3.1
Evaluation Methodology
Recall: For computing recall values we made use of the methodology introduced in [5], yet slightly modified the definitions. In the following we will briefly explain the necessary terms, however, for further details, we refer the reader to the original paper. A component is a set of related entities; in our context it may suffice to regard a component as a set of examples, i.e. a single cluster. Let Ref erences be the set of reference components in a clustering (the “true” clusters) and Candidates the set of candidate components (the clustering produced by the algorithm). The basic tool for comparing candidate and reference components is the degree of overlap between a candidate and a reference: overlap(R, C) =
|elements(R) ∩ elements(C)| |elements(R) ∪ elements(C)|
The accuracy of a matching between two sets of components is then based on the degree of overlap:
acc ({R1 , ..., Rm }, {C1 , ..., Cn }) = overlap(
m
elements(Ri ),
i=1
n
elements(Cj ))
j=1
Furthermore, in order to identify corresponding subcomponents (e.g. candidates that are similar only to a part of a reference component), the following partial subset relationship is used (0.5 < p ≤ 1.0 is a tolerance parameter, so no candidate can be involved in more than one such matching): C ⊆p R ⇔
|elements(R) ∩ elements(C)| ≥p |elements(C)|
Often a clustering algorithm might not be able to identify the reference components exactly. In such cases one reference component might be (partly) covered by more than one candidate component. Thus, to account for different “granularities”, for each reference component we compute the set of candidate components which at least partly cover it (note however that p > 0.5): matchings(R) = {C|C ⊆p R} Based on these concepts and abstracting from granularity, we can now define recall, which provides a summarizing value for how well the learned clusters “reconstruct” the original classes (i.e. coincide with the reference clustering).
264
Marcus-Christopher Ludl and Gerhard Widmer
recall =
R∈Ref erences
acc({R}, matchings(R))
|Ref erences|
Entropy: Additionally we also computed entropy values for comparing candidate clusterings, using the definitions from [12]. Briefly: For each cluster C within a candidate clustering and each cluster R in the reference clustering we first calculate the probability that a member of candidate cluster C belongs to reference cluster R. Let this be pC,R . The entropy of each candidate cluster is then calculated as follows: EC = − pC,R log pC,R R
The entropy of a complete candidate clustering is then calculated as the weighted sum over the entropies of the single clusters (n being the number of instances): E=
|elements(C)| C
3.2
n
EC
Comparative Evaluation
As [5] note, “for comparable evaluations of automatic clustering techniques, a common reference corpus [...] is needed for which the actual components are known.” Thus, for comparative experiments we used several synthetic and two real-world datasets with different properties. The first synthetic dataset, introduced by Ripley [10], represents a two-dimensional two-class problem. We used the 250 examples from the training set. The class information was ignored in the learning phase, but was used as the reference clustering for the calculation of the recall values. Figure 1 shows the Ripley training set. Furthermore, we generated two two-dimensional datasets, one of which consisted of four rectangular clusters (two of which slightly overlapping), with 500 instances uniformly distributed within these clusters (figure 2), while the other dataset contained 1000 instances, normally distributed around three centers (figure 3). Finally we generated a dataset with 1000 instances and three classes similar to figure 3, but extended it into four dimensions and used mixed distributions: Two of the classes were normally distributed, the third one uniformly within the bounds of a hyperrectangle. In addition to the aforementioned synthetic datasets we also used two realworld datasets from the UCI machine learning repository (4 and 13 dimensions, respectively), which were originally intended for classification purposes. In this context we simply interpreted the class labels as reference clusters.3 3
For the purpose of these experiments we assumed that the class distribution within the instance space correlates with a possible clustering of the dataset. Of course, there is no theoretical argument supporting the validity of this assumption.
Towards a Simple Clustering Criterion
1.2
1
0.8
0.6
0.4
0.2
0
-0.2 -1.5
-1
-0.5
0
0.5
1
Fig. 1. The Ripley-dataset: two dimensions, two classes 100
80
60
40
20
0 0
20
40
60
80
100
Fig. 2. The Uniform-dataset: two dimensions, four classes 120
100
80
60
40
20
0 0
20
40
60
80
100
120
140
Fig. 3. The Normal-dataset: two dimensions, three classes
265
266
Marcus-Christopher Ludl and Gerhard Widmer
Table 1. Comparative results for several synthetic datasets (see text). Listed are recall values, entropies and number of classes found dataset att. cl. KMeans ripley (250) 2 2 71.07% / 0.32 / 8 uniform (500) 2 4 97.53% / 0.06 / 7 normal (1000) 2 3 88.52% / 0.24 / 3 mixed (1000) 4 3 96.65% / 0.08 / 3 iris (150) 4 3 80.74% / 0.29 / 3 wine (178) 13 3 43.33% / 0.65 / 4
AutoClass 76.18% / 0.30 / 93.90% / 0.09 / 86.89% / 0.22 / 96.67% / 0.08 / 79.74% / 0.15 / 94.87% / 0.09 /
6 7 5 7 7 4
R.U.M. 71.18% / 0.43 98.73% / 0.03 90.47% / 0.21 94.43% / 0.12 77.14% / 0.27 90.63% / 0.19
/ / / / / /
4 5 3 4 4 4
To be able to judge the performance of the MDL evaluation scheme without having to devise a clustering algorithm, we used a stochastic procedure: We generated 100000 random rectangular clusterings by first generating a random number of rectangles (3 to 6), then using these rectangular regions as classifiers4 and finally evaluating the resulting clusterings by the r.u.m.length. The best so-achieved classification was then compared against the results of AutoClass and KMeans5 . KMeans is known to converge to a local optimum of its quality measure [11], so it is reasonable to assume that the final result will be a near-optimal clustering according to this measure. We initialized KMeans with the number of initial centroids ranging from 2 to 10 and accepted the best result on the datasets. For AutoClass we used the default parameters. In all of the evaluations we set the tolerance level p to 0.7, as suggested by [5] (see section 3.1). Refer to table 5 for the results. As can be seen, the granularity (number of clusters found) of the best rectangular clustering according to the r.u.m.length is in most cases better (i.e. lower) than the one achieved by AutoClass or KMeans. In addition, the r.u.m.length yields clusterings with about the same recall and entropy levels (no significance tests applied) – in our rather small experimental setting, these are quite promising results. 3.3
Comparing the Measures
Even under the assumption that both KMeans and AutoClass produce clusterings which are (locally) optimal relative to a specific clustering criterion, it might be argued that the effects from the first set of experiments could be due to the massive “search” performed by the stochastic method (versus the deterministic procedures used by KMeans and AutoClass). Thus, in a second set of experiments, we abstracted from specific algorithms and aimed at a direct comparison of the best clusterings according to 4 5
Overlapping regions were treated as separate clusters, so that the number of theoretically possible clusters was much higher than 6. We used the freely available version by Dan Pelleg [9].
Towards a Simple Clustering Criterion
267
Table 2. Results for comparing the measures r.u.m.length against distortion (as used by KMeans). Listed are recall values, entropies and number of classes found instances noise 100 100 100 1000 1000 1000
level 05% 15% 30% 05% 15% 30%
distortion 86.31% / 0.13 / 79.38% / 0.30 / 09.93% / 0.91 / 70.52% / 0.17 / 81.43% / 0.28 / 74.14% / 0.40 /
4 5 5 5 4 5
R.U.M. 86.3%1 / 0.13 86.25% / 0.24 23.00% / 0.92 69.25% / 0.20 80.08% / 0.33 80.36% / 0.38
/ / / / / /
4 4 2 3 4 4
the r.u.m.length and the measure used by KMeans. After all, what we would like to compare are the clusterings which a suitable algorithm could theoretically produce using the respective measures. KMeans uses the following quality measure [9]: distortionΦ =
1 · R
d2 (x, Φ(x))
x∈Instances
with R being the total number of points and Φ representing a clustering, i.e. a mapping which associates a centroid with every instance. To compare this measure against our MDL-based criterion, we again used our mixed-dataset from section 3.2. In addition we applied background noise (attribute noise) of varying intensity to the data: At a 30% noise level, e.g., 30 percent of the instances were randomly (uniformly) distributed within the instance space. Such examples were not classified into any of the three reference clusters, so that actually there were four different classes now, one of which represented noise. By the same process as before, we randomly generated a small number of hyperrectangles (3 to 6) within the bounds of the instance space and used these rectangular regions to classify the instances. This time, however, we evaluated the resulting clusterings both by the r.u.m.length and by distortion as used by the KMeans algorithm. We repeated this process 100000 times and accepted the best clusterings according to each of the two measures as the respective results. Thus, each criterion was tested on exactly the same set of random clusterings. As can be seen in table 3.3, both measures achieved good recall values and entropies for low noise settings and gradually deteriorated with increasing noise – unintuitive results (compare the values of 1000/05 against 1000/15) are probably due to the stochastic procedure. In two cases distortion achieved a slightly better fit, whereas in most cases r.u.m.length was able to select a better candidate clustering. In all of the cases, however, the granularity (number of clusters) produced by r.u.m.length was lower than or equal to the one produced by distortion, which leads us to the assumption that this measure produces simpler clusterings.
268
3.4
Marcus-Christopher Ludl and Gerhard Widmer
Scalability
In our opinion, the experiments conducted so far provide evidence that the clustering message length does indeed yield good clusterings in terms of simplicity compared to state-of-the-art methods, while hardly losing on the quality in terms of recall or entropy. We are, however, aware of the fact that in the context of the special case we evaluated in this paper (the rectangular uniform message length), scalability to high dimensions is problematic for two reasons: Firstly, there is no constructive algorithm yet which uses this criterion, and the stochastic procedure we used in the experiments is, of course, not suitable for dealing with very high dimensions, because the number of necessary trials would explode exponentially. Secondly, the restriction to hyperrectangles yields vastly inefficient clusterings, when the reference clusters have more complex shapes. Thus, in higher dimensions, efficiently determining complex cluster shapes (e.g., rotated bounding boxes, halfspaces etc.) becomes more and more important. This is one of our future research directions.
4
Summary and Further Work
Again, we would like to make it clear that what we presented is not a complete clustering algorithm itself, but an MDL-based clustering criterion which could enable a suitable learner to find the theoretically best clustering based on space descriptions. Additionally, the clustering message length yields a non-parametric criterion for the appropriate number of clusters in a database. Experiments in synthetical and real-world datasets show that the best clustering according to a special case of this criterion in most cases outperforms the clusterings found by AutoClass and KMeans in terms of granularity, while achieving about the same recall and entropy values. In addition, we directly compared our criterion against distortion (the measure used by KMeans) and achieved favorable results. Furthermore we provided a theoretical proof that this measure – the rectangular uniform message length – is well-behaved. Thus, in our – admittedly – rather small experimental setting we could produce quite encouraging results. One of our current research topics is the theoretical and practical evaluation of the clustering message length in larger settings, the extension to different cluster shapes and the development of a clustering algorithm which optimizes exactly this criterion.6 Finally, we would like to mention that – in contrast to measures based on the normal distribution – the r.u.m.length assigns lower (i.e. better) ratings to clusterings whose boundaries coincide with “edges” in the instance space, where significant changes in the density distribution occur. This criterion could therefore be used as a non-parametric way for detecting “edges” between areas of different densities. An application in a multivariate discretization algorithm, 6
The basic idea could be to start from an estimated centroid (density center) of the dataset and linearly extend the scope of the cluster to include more and more instances, while calculating the clustering message length for each step.
Towards a Simple Clustering Criterion
269
where such a module is of crucial importance [6] is another one of our current research topics.
Acknowledgements This research is supported by the Austrian Fonds zur F¨ orderung der Wissenschaftlichen Forschung (FWF) under grant no. P12645-INF. We would like to thank Johannes F¨ urnkranz and Johann Petrak for helpful discussions and Markus Mottl for his help on the programming language OCaml.
References 1. R. A. Baxter and J. Oliver. MDL and MML: Similarities and differences. Technical report, Dept. of Computer Science, Monash University, Clayton, 1994. (TR 207). 259 2. P. S. Bradley and U. M. Fayyad. Refining initial points for k-means clustering. In Proceedings of the 15th Int. Conference on Machine Learning, 91–99, 1998. 259 3. P. Cheeseman, J. Kelly, M. Self, J. Stutz, W. Taylor, and D. Freeman. Autoclass: A bayesian classification system. In Proceedings of the 5th International Workshop on Machine Learning, 54–64, 1988. 259 4. D. Keim and A. Hinneburg. Clustering techniques for large data sets: From the past to the future. In Tutorial Notes for ACM SIGKDD 1999 International Conference on Knowledge Discovery and Data Mining, San Diego, CA, 1999. 258 5. R. Koschke and T. Eisenbarth. A framework for experimental evaluation of clustering techniques. In Proceedings of the International Workshop on Program Comprehension (IWPC2000), Limerick, Ireland, 2000. IEEE. 263, 264, 266 6. M.-C. Ludl and G. Widmer. Relative unsupervised discretization for association rule mining. In Proceedings of the 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD2000), Lyon, 2000. 269 7. A. Y. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Proceedings of NIPS 14, 2002. (to appear). 258 8. J. J. Oliver, R. A. Baxter, and C. S. Wallace. Unsupervised learning using MML. In Proceedings of the 13th International Conference on Machine Learning, 364–372, San Francisco, CA, 1996. Morgan Kaufmann. 258 9. D. Pelleg and A. Moore. Accelerating exact k-means algorithms with geometric reasoning. In Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining (KDD99), 277–281, 1999. 266, 267 10. B. D. Ripley. Pattern recognition and neural networks. Statistics, 33:1065–1076, 1996. 264 11. S. Z. Selim and M. A. Ismail. K-means-type algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(1):81–87, 1984. 266 12. M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. In KDD Workshop on Data Mining, 2000. 264 13. C. S. Wallace and D. L. Dowe. Intrinsic classification by MML – the SNOB program. In Proceedings of the 7th Australian Joint Conference on Artificial Intelligence, 37–44, Singapore, 1994. World Scientific. 259
270
Marcus-Christopher Ludl and Gerhard Widmer
14. Y. Weiss. Segmentation using eigenvectors: A unifying view. In Proceedings of the IEEE International Conference on Computer Vision (ICCV99), 975–982, 1999. 258
Class Probability Estimation and Cost-Sensitive Classification Decisions Dragos D. Margineantu The Boeing Company, Adaptive Systems, Mathematics & Computing Technology P.O.Box 3707, M/S 7L-66, Seattle, WA 98124-2207, USA
[email protected]
Abstract. For a variety of applications, machine learning algorithms are required to construct models that minimize the total loss associated with the decisions, rather than the number of errors. One of the most efficient approaches to building models that are sensitive to non-uniform costs of errors is to first estimate the class probabilities of the unseen instances and then to make the decision based on both the computed probabilities and the loss function. Although all classification algorithms can be converted into algorithms for learning models that compute class probabilities, in many cases the computed estimates have proven to be inaccurate. As a result, there is a large research effort to improve the accuracy of the estimates computed by different algorithms. This paper presents a novel approach to cost-sensitive learning that addresses the problem of minimizing the actual cost of the decisions rather than improving the overall quality of the probability estimates. The decision making step for our methods is based on the distribution of the individual scores computed by classifiers that are built by different types of ensembles of decision trees. The new approach relies on statistics that measure the probability that the computed estimates are on one side or the other of the decision boundary, rather than trying to improve the quality of the estimates. The experimental analysis of the new algorithms that were developed based on our approach gives new insight into cost-sensitive decision making and shows that for some tasks, the new algorithms outperform some of the best probability-based algorithms for cost-sensitive learning.
1
Introduction
The general framework for supervised learning assumes that a set of labeled examples xi , yi (called training data) is available, where xi is a vector of continuous or discrete values called attributes and yi is the label of xi . The framework further assumes that there exists an underlying, unknown function, f (x) = y that maps the attribute vectors to the set of possible labels. A learned model outputs a hypothesis h(x) which is an approximation of f (x), and minimizes the expected loss on previously unseen examples. In the case of classification, the labels are elements of a discrete set of classes {1, 2, . . . , K}. T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 270–281, 2002. c Springer-Verlag Berlin Heidelberg 2002
Class Probability Estimation
271
The last few years have seen supervised learning and classification methods applied to an increasing variety of applications, such as fraud and intrusion detection, medical and biological data analysis, remotely-sensed image analysis, prediction of natural disasters, time series prediction, and text analysis. While most of the research efforts in classification have studied algorithms that learn models that try to minimize the proportion of errors (mistakes) that are made (or, 0/1-loss), for most (if not all) of the practical applications mentioned above, the learned classifiers are required to minimize a non-uniform loss function that is different from the 0/1-loss. Indeed, in all practical situations different kinds of prediction errors have different costs and a more realistic performance measure of a classification system is the total loss, calculated as the sum of the costs of all errors made by the system. The algorithms described in this paper assume that, for a K-class problem, a K-by-K loss matrix L is available at learning time. The contents of L(i, j) specify the cost incurred when an example is predicted to be in class i when in fact it belongs to class j. We will further assume that L is stationary, that is, none of the values in L changes during the learning or the decision making process. To illustrate the properties of a loss matrix, let us consider the 2 × 2 cost matrix shown in Table 1. In this example the loss matrix indicates that if an example is labeled by the classifier as to be in class 2 when in fact it belongs to class 1, there will be a loss of about 100.5, while labeling an example from class 2 as being in class 1 incurs only a cost of 3.0. Without loss of generality we can assume that the values in L represent dollar amounts (see [15] and [22] for more detailed discussions), and that there are no costs associated with correct decisions ([22] shows that a loss matrix can always be transformed into an equivalent one with zero values on the diagonal). Conceptually, given the general supervised learning framework, there are three major types of strategies for cost-sensitive learning. The most common practical approach to cost-sensitive classification is to manipulate the training data (i.e., modify its distribution) in order to make the 0/1-loss learning algorithm output a hypothesis that minimizes the costs of the decisions for future examples. For two-class problems, the simplest and most common way to do this is to present the learning algorithm with a training set in which the proportions of examples in the two classes are changed according
Table 1. Example of a loss matrix for a 2-class problem Correct Predicted Class 1 1 0.0 2 100.5
Class 2 3.0 0.0
272
Dragos D. Margineantu
to the ratio of the cost values [8]. This procedure is called stratification (or rebalancing), and it is usually implemented by undersampling the examples from the less expensive class, or by oversampling the examples from the more expensive class [18]. Another method that uses training data manipulation to learn a costsensitive classifier is Domingos’ MetaCost [11]. MetaCost is an algorithm that employs the learned class probability estimates for the training instances and relabels them. Then, it trains a 0/1-loss classification algorithm on the relabeled data to output the cost-sensitive hypothesis. The second approach to the cost-sensitive learning problem is to change the internal mechanisms of the algorithm that compute the output hypothesis such that the algorithm will make use of the cost function (as an input parameter) to build the classifier [12,4,16,21,19]. Finally, the third approach uses the class probability estimates on the unseen (test) instances computed by the learned model. If the probabilities for each class given an example x, P (yi |x) are available, x should be labeled with yopt , the class that minimizes the conditional risk of the labeling decision [13,20]: yopt = argmin R(y|x) = argmin y∈Y
y∈Y
K
P (j|x)L(y, j).
(1)
j=1
For this strategy, there are two distinct steps: estimating the class probilities and making the decision. No information about the loss function is used in during the probability estimation process, and therefore, there is no need to retrain the models if the loss function changes. It is important to observe that any loss matrix defines precise decision boundaries - points for which the minimum conditional risk is reached two or more classification deicisions. In general, for a K-class task the decision boundaries are defined by points x for which there exist at least two distinct class labels j and k such that (the class probabilities of x satisfy) R(j|x) = R(k|x); R(i|x) ≥ R(j|x), ∀i, 1 ≤ i ≤ K; K i=1 P (i|x) = 1. For two-class problems (with classes 0 and 1) with loss matrices L having zero values on the diagonal, the decision L(0,1) . If the estimate boundary is defined by β = P (0|x) = 1 − P (1|x) = L(0,1)+L(1,0) for an instance happens to be exactly on the decision boundary, the label of that instance is assigned by tossing a fair coin. This paper presents a new approach to the cost-sensitive learning problem that relies on a learned probabilistic model, but with the specific target of minimizing the cost incurred by the decisions rather than attempting to improve the overall quality of the probabilities. To achieve this target, our methods compute confidence estimates for the class probabilities, and make the decisions based on those estimates. The next section presents the challenges of using decision tree algorithms for learning probabilistic models. Section 3 describes the new methods for costsensitive learning and gives an overview of the random forest algorithms that are used. Section 4 presents an experimental analysis of the new methods. Section 5 summarizes the paper and draws the conclusions.
Class Probability Estimation
2
273
Decision Trees for Probability Estimation
Decision tree algorithms [8,27] are among the most popular tools for building classification models. Any decision tree D can be transformed into a class probability estimator. The probability estimate of class j for an arbitrary instance x is P (j|x) =
Nj (Dx ) , N (Dx )
(2)
where Dx is the leaf of tree D that is reached by x, N (Dx ) is the total number of training examples that are assigned to Dx , and Nj (Dx ) is the number of training examples belonging to class j that reach leaf Dx . As noted by several researchers [29,5,26,6,25], the class probability estimates of the decision trees are poor. There are three major factors that cause this deficiency. First, the greedy induction mechanism that splits the data into smaller and smaller sets leads to probability estimates that are computed based on very small samples, and this leads to inaccurate estimates. Second, most of the existing decision-tree induction algorithms focus on minimizing the number misclassifications (through the purity-based heuristics) and on minimizing the size of the model (through the pruning procedure). This causes the learned models to compute class probabilities that are too extreme (i.e., close to 0.0 and 1.0), as in the example above, and therefore incorrect. The third factor is the shape of the decision tree hypotheses (piecewise linear decision boundaries). This kind of decision space assigns uniform probability values to points that are in the same region and will not differentiate between points that are closer to the boundary of the region and points that are farther from the boundary. Lately, several researchers have addressed the problem of improving the probability estimates computed by decision trees and other classification methods. One solution [4,30] is to apply a Laplace correction (or Dirichlet prior) as follows P (j|x) =
Nj (Dx ) + λj K N (Dx ) + λi
(3)
i=1
The Laplace correction [17,9] will smooth probability estimates that are too extreme because of the small size of the sample that reaches the leaf. This smoothing permits reducing the effects of the second cause for inaccurate estimates (extreme probabilities), described at the beginning of this section. To handle the other two sources of inaccuracy of tree-based probability estimates, one of the most effective techniques has proven to be the averaging of the probabilities computed by multiple models generated by Bagging [8]. Each of the models is trained using a bootstrap replicate [14] of the training data. Provost and Domingos [25] have developed one of the best tree-based class probability estimation algorithms by combining Laplace correction and Bagging. They called resulting method Bagged Probability Estimation Trees (or, B-PETs).
274
3
Dragos D. Margineantu
Learned Probabilistic Models and Decision Making
The truth is that, while research efforts for improving the overall quality of probability estimates computed by different learning algorithms are worthwhile, making the best (cost-sensitive) classification decision does not always require highly accurate probabilities. For example, consider a (two-class) problem (for which the class labels are 0 and 1) that has a loss of fifty associated with mispredicting a positive (1) example and a loss of one associated with the opposite error. For an arbitrary instance x, any estimated probability value P (1|x) that 1.0 , 1.0] will lead to diagnosing x as belonging to class 1. falls in the interval ( 51.0 Therefore, if a system estimates that the likelihood of P (1|x) falling in the interval [0.8, 1.0] is 99%, we should be more confident classifying x into class 1, than 1.0 in a situation in which the system estimates that the likelihood of P (1|x) > 51.0 1.0 is about the same as the likelihood of P (1|x) < 51.0 . In other words, accurate probability estimates are sufficient but not necessary. In order to minimize the costs associated with different decisions, it is important however to know how much confidence we can have in the computed class probabilities, and, if possible, to use the confidence estimates to make better decisions especially in the case of points that lie close to the decision boundary, or in the case of points that have a wide confidence interval for the probability estimates. Given that we are dealing with estimates of a variable (the class probability), these observations have led us to combining estimates of the shape of the distribution of the (probability) estimates together with the loss function, for a decision making procedure. Based on these observations, we propose the following decision making procedure for two-class problems. Let x be an arbitrary instance, and L the loss matrix. Let β (0 ≤ β ≤ 1)be the decision boundary defined by L. First, compute an estimate of the probability P(0|x, L) that the learner will output a class probability estimate P (0|x) that is smaller than β. Next, use the computed estimate to decide on the class of x by using Equation 1. Training a series of probability estimators provides a good means to empirically estimate the distribution of the class probabilities for an arbitrary instance. In particular we use Bagging to compute the estimates. The pseudo code of the procedure is presented in Table 2. We have called this generic procedure Confidence-based Probability Estimation (or CPE) because it makes the classification decision based on the “confidence” in the probability estimates (given by the probability distribution of the probabilities) and their values relative to the decision boundary. One way of computing the probability from line [7] of the code is to approximate it by the proportion of models whose estimate is smaller than β. The second possibility is to compute the normal approximation of the distribution of the estimates N (p) and to assign P := 0
β
N (p)dp.
Class Probability Estimation
275
Table 2. Pseudo code for the proposed algorithm for making cost-sensitive decisions with Confidence-based Probability Estimation (C-PE) Input: a set S, of m labeled examples: S =< (xi , yi ), i = 1, 2, . . . , m >, labels yi ∈ Y = {1, 2}, λ (a learning algorithm that computes class probability estimates), L (a loss matrix), x (an unlabeled example), [1] [2] [3] [4] [5] [6] [7] [8]
for t = 1 to T do St := (Bootstrap) sample of S; θ := Train λ(St ); // the learned model 0 (θx ) Pt (0|x) := NN(θ ; Pt (1|x) := 1 − Pt (0|x); x) endfor β := DecisionBoundary(L); P(0|x, L) := P r(P (j|x) < β); P(1|x, L) := 1 − P(0|x, L); endfor
Output: hCP E (x) = argmin y∈Y
1
P(j|x, L)L(y, j)
j=0
// the optimal prediction w. respect to L and P
As base learning algorithm λ we have used decision trees. However, given that we were not only interested in having unbiased estimates of the mean but also a good estimates of the variance of the computed probabilities, we have explored adding different sources of randomness to the original tree learning algorithm: random split selection, and random attribute selection. Breiman has proposed a unified view of these techniques under the name of Random Forests [7] and has analyzed them in the context of 0/1-loss classification. In the case of random splits, during the tree learning procedure, instead of selecting the best potential split, the algorithm will choose a split at random from among the N best potential splits. This procedure was introduced by Dietterich [10] and used classification problems. For the random attribute selection procedure, at each node, a subset of size F of the attributes is selected at random and the best potential split (the one that gives the highest gain ratio) on those attributes is chosen. Amit and Geman [1] have first explored this technique.
4
Experimental Analysis
We have implemented the methods described in the previous section by using Quinlan’s C4.5 decision tree learning algorithm [27] as base learner. The first implementation of C-PE (denoted as Bag) grows each tree using the standard procedure. The second implementation (RS) selects randomly in each node a split from among the ten best splits. The third implementation selects at each
276
Dragos D. Margineantu
node a random subset of the attributes of size F . Two versions of this method were tested: RA-1 (F = 1) and RA-logN (F = log(N ), where M is the number of attributes). Pruning was never used in the algorithms that were tested. We have also implemented Provost and Domingos’ B-PET algorithm to compare the decisions made by the C-PE methods (relying on P(y|x, L)) with the decisions that rely on class probability estimates (P (y|x)). We have tested all algorithms on ten data sets (see Table 3). Except for the Donations-bin data set, all were drawn from the UC Irvine Repository [3]. Donations-bin is the binary version of the KDD Cup 1998 data [2] for which the goal is to determine whether a person has made a donation after a direct mail campaign. The format of the data is similar to the one used in other studies: seven attributes, 95412 instances for training and 96367 instances for testing. Unfortunately, these data sets do not have associated loss matrices L. Therefore, we generated loss matrices at random according to some loss matrix models. Table 4 describes four loss models, M1 through M4. The second column of the table describes how the misclassification costs were generated for the off-diagonal elements of L. In all cases, the costs are drawn from a uniform distribution over some interval. The diagonal values are always 0. Given that the new methods presented here were specifically designed to minimize the loss associated with the classification decisions, we have used the BDeltaCost paired test presented in [23]. Appendix A gives a more detailed description of the test. We have chosen to use the BDeltaCost test rather than ROC methods because the ROC methods give an overall measure of the quality of the rankings, whereas in our case we needed a statistical test for comparing models when the loss matrix is known. In other words, we focus on the analysis of the quality of the decisions of the different models. Table 4. The models employed for generating the loss matrices used in the experData Evaluation iments. Unif[a, b] indicates a Set Size Method uniform distribution over the 95412/96367 test set 699 10-fold xval [a, b] interval. The diagonal el286 10-fold xval ements of the loss matrices are 155 10-fold xval always zero
Table 3. Data sets studied in this paper Name Donations-bin Breast cancer (Wis.) Breast cancer (Yug.) Hepatitis Horse colic King-rook vs. king-pawn Labor negotiations Liver disease Sonar Voting records
200 3196 57 345 208 435
10-fold 10-fold 10-fold 10-fold 10-fold 10-fold
xval xval xval xval xval xval
Loss Model M1 M2 M3 M4
L(i, j) i = j Unif[0, 5] Unif[0, 7] Unif[0, 10] Unif[0, 20]
Class Probability Estimation
277
Performance was evaluated either by 10-fold cross validation or by using a test set (as noted in Table 3). For each cost model we generated ten loss matrices, and performed 10-fold cross validation on the Irvine ML data sets. This gives us 10 (matrices) × 4 (models) × 10 (folds) = 400 runs of the algorithms for each Irvine ML data set. In the case of the Donations-bin data, the evaluation was performed on the test set, resulting in 80 runs. For each of the runs, we performed the BDeltaCost statistical test to determine whether the learned models had statistically significant different expected losses, based on the 95% confidence interval. Initially we have set the number of bagging rounds to be T = 100. We tested separately two versions of the C-PE algorithms. The first version (C-PE-counts) estimates P by using the counts of the individual computed class probabilities P on each side of the decision boundary. The results for the Donations-bin data are shown in Table 5. The results for the Irvine sets are presented in Table 6. Each cell of the tables represents the percentage of wins, ties, and losses (respectively) for the algorithms that are tested. For example, the cell in row RA-1, column B-PET from Table 6 indicates that when RA-1 and B-PET were compared, for 20.2% of the runs RA-1 outperformed B-PET, in 22.3% of the runs B-PET outperformed RA-1 and for 57.5% of the runs BDeltaCost could not reject the null hypothesis based on a 95% confidence interval. The second version of our algorithms (C-PE-normal) estimates P by computing the normal approximation N of P . The results for the Donations-bin data are presented in Table 7. The results for the Irvine sets are shown in Table 8. Next, we tested the influence of the size of the ensemble on the performance of the algorithms. We reran all experiments for T = 50, and T = 200. While, the quality of all C-PE decisions was slightly worse (compared to the B-PETs) for T = 50, it has improved for T = 200 only for the smaller Irvine data sets.
Table 5. Results on Donations-bin for C-PE-counts (T = 100) B-PET Bag RS RA-1 RA-logN 20-75-5 20-75-5 20-80-0 20-80-0 RA-1 15-60-25 20-60-20 0-100-0 RS 15-60-25 20-65-15 Bag 0-80-20
Table 6. Results on the UCI data sets for C-PE-counts (T = 100) B-PET Bag RS RA-1 RA-logN 44.8-42.2-11 17.4-52.5-30.1 17-59.8-23.2 48.2-43.4-8.4 RA-1 20.2-57.5-22.3 9.3-48.9-41.8 6.3-45.1-48.6 RS 42.1-48.7-9.2 22.6-51.9-25.5 Bag 43.2-48.6-8.3
278
Dragos D. Margineantu
Table 7. Results on Donations-bin for C-PE-normal (T = 100) B-PET Bag RS RA-1 RA-logN 20-75-5 25-75-0 25-75-0 20-80-0 RA-1 15-60-25 20-60-20 5-95-0 RS 15-60-25 20-60-20 Bag 0-75-25
Table 8. Results on the UCI data sets for C-PE-normal (T = 100) B-PET Bag RS RA-1 RA-logN 45.9-43.1-11 20.7-48.6-30.7 25-56.8-18.2 44-47-9 RA-1 18.9-58.1-23 9.7-51-39.3 8.7-45.8-45.5 RS 44.4-44.5-11.1 19.8-50.6-29.6 Bag 46.3-46.2-7.5
5
Summary and Conclusions
We have presented a new approach to cost-sensitive classification. The methods that we proposed make a decision not only based on an estimate of the mean of the probabilities computed by the models in the ensemble, but they employ the distribution of individual probability estimates of the classifiers together with the loss matrix. Instead of outputting the average of the individual estimates of the component classifiers the way B-PETs do, the C-PE algorithms compute an estimate of the distribution of class probabilities and makes a decision based on that estimate and the loss function. C-PE provides a mechanism to make accurate cost-sensitive decisions even if accurate class probability estimates are hard or impossible to compute (because of inherent deficiencies of the algorithms, or because of the distribution of the data). C-PE is sensitive not only to the loss function, but also to the hypothesis learned by the base algorithm. In the case of the UCI data sets we can observe that, the RA-logN, RS and Bag versions of C-PE outperform the Bagged Probability Estimation Trees (B-PET). However in the case of the very large Donations data set, B-PET is marginally outperformed only by RA-logN and performs much better than Bag. This shows that for larger data sets, B-PET is able to compute more accurate probabability estimates P (y|x), whereas in the case of smaller data sets the confidence-based estimates are better for different amounts of randomness. If we were to rank the C-PE methods based on the amount of randomness that they add to the procedure, RA-1 adds the largest amount, and the results show that this might lead to larger losses associated with the decisions. The best overall performance belongs to the RA-logN implementation of C-PE. This might be the case because it adds the right amount of randomness to the bagging procedure. It would be interesting to analyze the performance of RS for different values of the number of splits (among which the random selection is made).
Class Probability Estimation
279
The experiments also show that a larger value for T (the number of bagging rounds) helps improving the quality of the decisions on the smaller data sets.
6
Discussion
To our knowledge, the only decision making approach that has used a confidence measure for probability estimates was presented in the work of Pednault et al. [24]. Saar-Tsechansky and Provost [28] compute an estimate of the variance of the class probabilities for unlabeled examples to decide on the set of instances to be labeled next, within an active learning procedure. Preliminary experiments show that combining C-PE with uncertainty sampling in a cost-sensitive active learning procedure improves in terms of the number of examples that are needed to achieve similar performance, over a an active learning procedure that relies on probability estimates computed by B-PETs.
Acknowledgement I would like to thank Roberto Altschul, Foster Provost, Claudia Perlich, Tom Dietterich, Rodney Tjoelker, and Pedro Domingos for their comments and discussions on confidence-based decision making and cost-sensitive learning. I would also like to thank Ed Pednault for earlier discussions we had on this topic.
References 1. Y. Amit and D. Geman. Shape quantization and recognition with randomized trees. Neural Computation, 9:1545–1588, 1997. 275 2. S. D. Bay. The UCI KDD archive. University of California, Irvine, Dept. of Information and Computer Sciences, 1999. [http://kdd.ics.uci.edu/]. 276 3. C. L. Blake and C. J. Merz. UCI repository of machine learning databases. University of California, Irvine, Dept. of Information and Computer Sciences, 1998. [http://www.ics.uci.edu/~mlearn/MLRepository.html]. 276 4. J. P. Bradford, C. Kunz, R. Kohavi, C. Brunk, and C. E. Brodley. Pruning decision trees with misclassification costs. In C. Nedellec and C. Rouveirol, editors, Lecture Notes in Artificial Intelligence. Machine Learning: ECML-98, Tenth European Conference on Machine Learning, Proceedings, volume 1398, pages 131–136, Berlin, New York, 1998. Springer Verlag. 272, 273 5. A. P. Bradley. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30:1145–1159, 1997. 273 6. L. Breiman. Out-of-bag estimation. Technical report, Department of Statistics, University of California, Berkeley, 1998. 273 7. L. Breiman. Random forests. Technical report, Department of Statistics, University of California, Berkeley, 2001. 275 8. L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth International Group, 1984. 272, 273
280
Dragos D. Margineantu
9. B. Cestnik. Estimating probabilities: A crucial task in machine learning. In L. C. Aiello, editor, Proceedings of the Ninth European Conference on Artificial Intelligence, pages 147–149, London, 1990. Pitman Publishing. 273 10. T. G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting and randomization. Machine Learning, 40(2):139–158, 2000. 275 11. P. Domingos. Metacost: A general method for making classifiers cost-sensitive. In Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining, pages 155–164, New York, 1999. ACM Press. 272 12. C. Drummond and R. C. Holte. Exploiting the cost (in)sensitivity of decision tree splitting criteria. In Machine Learning: Proceedings of the Seventeenth International Conference, pages 239–246, San Francisco, CA, 2000. Morgan Kaufmann. 272 13. R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley and Sons, Inc. - Interscience, second edition, 2000. 272 14. B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman and Hall, New York, 1993. 273, 281 15. C. Elkan. The foundations of cost-sensitive learning. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence. Morgan Kaufmann Publishers, Inc., 2001. 271 16. W. Fan, S. J. Stolfo, J. Zhang, and P. K. Chan. AdaCost: Misclassification costsensitive boosting. In Machine Learning: Proceedings of the Sixteenth International Conference, pages 97–105, San Francisco, 1999. Morgan Kaufmann. 272 17. I. J. Good. The Estimation of Probabilities: An Essay on Modern Bayesian Methods. M. I. T. Press, Cambridge, Mass., 1965. 273 18. N. Japkowicz. The class imbalance problem: Significance and strategies. In Proceedings of the 2000 International Conference on Artificial Intelligence (IC-AI’2000), 2000. 272 19. M. Kukar and I. Kononenko. Cost-sensitive learning with neural networks. In Proceedings of the Thirteenth European Conference on Artificial Intelligence, Chichester, NY, 1998. Wiley. 272 20. T. Leonard and J. S. J. Hsu. Bayesian Methods, An Analysis for Statisticians and Interdisciplinary Researchers. Cambridge University Press, 1999. 272 21. D. D. Margineantu. Building ensembles of classifiers for loss minimization. In M. Pourahmadi, editor, Models, Predictions and Computing: Proceedings of the 31st Symposium on the Interface, volume 31, pages 190–194. The Interface Foundation of North America, 1999. 272 22. D. D. Margineantu. Methods for cost-sensitive learning. Technical report, Department of Computer Science, Oregon State University, Corvallis, OR, 2001. 271 23. D. D. Margineantu and T. G. Dietterich. Bootstrap methods for the cost-sensitive evaluation of classifiers. In Machine Learning: Proceedings of the Seventeenth International Conference, pages 583–590, San Francisco, CA, 2000. Morgan Kaufmann. 276, 281 24. E. P. D. Pednault, B. K. Rosen, and C. Apte. The importance of estimation errors in cost-sensitive learning. In Cost-Sensitive Learning Workshop Notes, 2000. 279 25. F. Provost and P. Domingos. Well-trained PETs: Improving probability estimation trees. Technical Report IS-00-04, Stern School of Business, New York University, 2000. 273 26. F. Provost, T. Fawcett, and R. Kohavi. The case against accuracy estimation for comparing classifiers. In Proceedings of the Fifteenth International Conference on Machine Learning, pages 445–453. Morgan Kaufmann, San Francisco, 1998. 273
Class Probability Estimation
281
27. J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco, 1993. 273, 275 28. M. Saar-Tsechansky and F. Provost. Active learning for class probability estimation and ranking. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, pages 911–917. AAAI Press/MIT Press, 2001. 279 29. P. Smyth, A. Gray, and U. Fayyad. Retrofitting decision tree classifiers using kernel density estimation. In Machine Learning: Proceedings of the Twelvth International Conference, pages 506–514, 1995. 273 30. B. Zadrozny and C. Elkan. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In Proceedings of the Eighteenth International Conference on Machine Learning, pages 609–616, San Francisco, CA, 2001. Morgan Kaufmann. 273
A
BDeltaCost
BDeltaCost is a paired test that computes a confidence interval for the expected difference in cost of two classifiers. The test is based on the idea of bootstrap [14], a computational method that is used for estimating the standard error of a parameter of an unknown distribution, based on a random sample S drawn from that distribution. The bootstrap works by drawing with replacement T samples for S, each consisting of a number of data values equal to the number of elements in S. The value parameter of interest is computed for each of these samples. The standard error is estimated by the sample standard deviations of the T replicates (also called bootstrap replicates). In a similar way, BDeltaCost tests the null hypothesis H0 that two classifiers have the same expected loss (on new test data) against the alternative hypothesis H1 that the two classifiers have different losses. The test draws repeated samples of the data and calculates the differences in loss for the two classifiers, sorts the resulting values in ascending order and rejects the null hypothesis if 0 is not contained by the interval defined by the middle c% values, for a c% confidence interval (e.g. for a 95% confidence interval and T = 1000 the test will check the interval between the 26th and 975th value). The way the test has been designed, Laplace corrections can be used to correct for zero values (that occured because of the small size of the test set) in the confusion matrices. Margineantu and Dietterich [23] have shown that the BDeltaCost test works better and gives tighter confidence intervals, than the standard tests based on the normal distribution.
On-Line Support Vector Machine Regression Mario Martin Software Department, Universitat Polit`ecnica de Catalunya Jordi Girona 1-3, Campus Nord, C6., 08034 Barcelona, Catalonia, Spain
[email protected] Abstract. This paper describes an on-line method for building ε-insensitive support vector machines for regression as described in [12]. The method is an extension of the method developed by [1] for building incremental support vector machines for classification. Machines obtained by using this approach are equivalent to the ones obtained by applying exact methods like quadratic programming, but they are obtained more quickly and allow the incremental addition of new points, removal of existing points and update of target values for existing data. This development opens the application of SVM regression to areas such as on-line prediction of temporal series or generalization of value functions in reinforcement learning.
1
Introduction
Support Vector Machines, from now on SVM, [12] have been one of the most developed topics in Machine Learning in the last decade. Some reasons that explain this success are their good theoretical properties in generalization and convergence –see[2] for a review. Another reason is their excellent performance in some hard problems –see for instance [9,4]. Although SVMs are being used mainly for classification tasks, they can also be used to approximate functions (what is called SVM regression). One problem that prevents a wider use of SVMs for function approximation is that, though their good theoretical approaches, they are not applicable on-line, that is, in cases where data is sequentially obtained and learning has to be done from the first data. One paradigmatic example is the on-line prediction of temporal series. When new data arrive, learning has to begin from scratch. SVMs for regression have not been either suitable for problems where the target values of existing observations change quickly, for instance, in reinforcement learning [11]. In reinforcement learning, function approximation is needed to learn value functions, that is, functions that return for each state the future expected reward if the agent follows the current policy from that state. SVMs are not used to approximate value functions because these functions are continuously update as the agent learns and changes its policy. One time, the estimated future reinforcement from state s is y, but later (usually very soon) a new estimation returns another value for the same state. Using SVM regression in this case implies again learning from scratch. T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 282–294, 2002. c Springer-Verlag Berlin Heidelberg 2002
On-Line Support Vector Machine Regression
283
One alternative is to adapt to the regression case one of the recent algorithms for incremental SVMs [3] and on-line learning [5,7,6], but such techniques return approximate solutions (though in some cases the error is bounded) and, more important, they do not allow to remove data or update target values of data. The only exception we know of is [1] but it is only described for classification tasks. In order to allow the application of SVMs for regression to these areas, this paper describes the first (at our best knowledge) exact on-line learning algorithm for SVM function approximation. The algorithm is based in three actions that allow respectively (1) incrementally add new data to the SVM, (2) remove data from the SVM, and (3) update target values for existing data in the SVM. The algorithm we propose is an extension of the work proposed in [1] for incremental SVM learning for classification tasks, but now applied to function approximation. In brief, the key idea of the algorithm consists in finding the appropriate Karush-Kuhn-Tucker (KKT) conditions for new or updated data by modifying its influence (β) in the regression function while maintaining consistence in the KKT conditions for the rest of the data used for learning. This idea is fully explained throughout the paper.
2
Reformulation
Specifically, we propose in this paper a method for the on-line building of ε-insensitive support vector machines for regression. The goal of this kind of machines is to find a function that presents at most ε deviation from the target values [12] while being as “flat” as possible. This version of SVM regression is appealing because not all vectors become support vectors, which is not the case in other approaches [10]. SVMs for regression are usually solved by resorting to a standard dualization method using Lagrange multipliers. The dual formulation for ε-insensitive support vector regression is to find values for α, α∗ that minimize the following quadratic objective function: W =
1 (αi − α∗i )Qij (αj − α∗j ) − yi (αi − α∗i ) + ε (αi + α∗i ) 2 ij i i
(1)
subject to the following constraints: 0 ≤ αi , α∗i ≤ C (αi − α∗i ) = 0
(2) (3)
i
where Q is the positive-definite kernel matrix Qij = K(xi , xj ), and ε > 0 is the maximum deviation allowed. Including in (1) a Lagrange multiplier for constraint (3), we get the following formulation:
284
Mario Martin
1 (αi − α∗i )Qij (αj − α∗j ) − 2 ij yi (αi − α∗i ) + ε (αi + α∗i ) + b (αi − α∗i ) W =
i
i
(4)
i
with first order conditions for W : gi = gi∗ =
∂W = Qij (αj − α∗j ) − yi + ε + b ∂αi j
(5)
∂W = − Qij (αj − α∗j ) + yi + ε − b = −gi + 2ε ∂α∗i j
(6)
∂W = (αj − α∗j ) = 0 ∂b j
(7)
Renaming (αi − α∗i ) to βi for simplicity, we have: gi = gi∗ =
2.1
∂W = Qij βj − yi + ε + b ∂αi j
(8)
∂W = − Qij βj + yi + ε − b = −gi + 2ε ∂α∗i j
(9)
∂W = βj = 0 ∂b j
(10)
Separation of Data
The first order conditions for W lead to the Karush-Kuhn-Tucker (KKT) conditions, that will allow the reformulation of SVM for regression by dividing the whole training data set D into the following sets: margin support vectors S (where gi = 0 or gi∗ = 0), error support vectors E (where gi < 0), error star support vectors E ∗ (where gi∗ < 0), and the remaining vectors R. Specifically, centering on gi , KKT conditions are: 2ε < gi → gi∗ < 0 βi = −C i ∈ E ∗ ∗ gi = 2ε → gi = 0 −C < βi < 0 i ∈ S i∈R 0 < gi < 2ε → 0 < gi∗ < 2ε βi = 0 ∗ g = 0 → g = 2ε 0 < β < C i∈S i i i gi < 0 → gi∗ > 2ε βi = C i∈E Figure 1 shows the geometrical interpretation of these sets in the feature space. Note that j Qij βj + b − yi is the error of the target value for vector i. Thus gi and gi∗ can be thought as thresholds for error in both sides of the ε-tube.
On-Line Support Vector Machine Regression
285
Fig. 1. Decomposition of D following KKT conditions into margin support vectors S, error support vectors E, error support vectors star E ∗ and remaining vectors R. Cross marks represent vectors in the feature space. S vectors are exactly on the margin lines, R vectors are inside the ε-tube (grey zone), and E and E ∗ vectors are outside the ε-tube The division of the data set into subsets and the characterization of β values for each subset, allow us to rewrite equations (8), (9) and (10), for all vectors i ∈ D, as follows: gi =
Qij βj + C
j∈S
j∈E
Qij − C
Qij
j∈E ∗
−yi + ε + b
(11)
gi∗ = −gi + 2ε
(12)
βj + C|E| − C|E ∗ | = 0
(13)
j∈S
3
On-Line Support Vector Regression
In order to build exact on-line support vector machines for regression, we need to define three incremental actions: add one new vector: One new observation xc is added to the data set D with the target value yc . This operation should include the corresponding vector in the feature space with the “exact” βc value but without beginning from scratch. remove one vector: One existing observation xc in D with target value yc is removed from the data set. The resulting SVM should be the same that would be training from scratch a SVM with D − {c}.
286
Mario Martin
update one vector: One existing observation xc in D with target value yc changes the target value to yc . As in the previous cases the resulting machine should be the same that would be training from scratch a SVM with exact methods. In this section we will describe how these actions can be efficiently implemented. Addition and update actions will consist in finding consistent KKT conditions for the vector being added or updated. Removal will be based on diminishing the influence of the vector being removed on the regression tube until it vanishes. 3.1
Adding One New Vector
A new vector c is added by inspecting gc and gc∗ . If both values are positive, c is added as an R vector because that means that the new vector lies inside the εtube (see KKT conditions). When gc or gc∗ are negative, the new vector is added by setting its initial influence on the regression (βc ) to 0. Then this value is carefully modified (incremented when gc < 0 or decremented when gc∗ < 0) until its gc , gc∗ and βc values become consistent wrt KKT conditions (that is, gc < 0 and βc = C, or gc∗ < 0 and βc = −C, or 0 < βc < C and gc = 0, or −C < βc < 0 and gc∗ = 0). Modification of βc Variations in the βc value of the new vector c, influence gi , gi∗ and βi values of the other vectors in D, and thus, can force the transfer of some vectors from one set S, R, E or E ∗ to another set. This transfer means that gi , gi∗ and βi values for vector i become no longer consistent with the KKT conditions of the set where vector i is currently assigned, but become consistent with the KKT conditions of another set. The modification of βc must take into account these transfers between sets. This section describes how the modification of βc influences gi , gi∗ and βi values of the vectors in D while sets S, E, E ∗ and R remain constant. In the next section we describe how to deal with vector migrations between sets. From equations (11), (12), and (13) it is easy to calculate the variation in gi , gi∗ and βi when a new vector c with influence βc is added without migration of vectors between sets S, E, E ∗ and R: ∆gi = Qic ∆βc +
Qij ∆βj + ∆b
(14)
∆gi∗ = −∆gi
(15)
j∈S
∆βc +
∆βj = 0
(16)
j∈S
Note that while one vector remains in E, E ∗ or R sets, its β value does not change.
On-Line Support Vector Machine Regression
287
In particular, if margin support vectors must remain in S, then ∆gi ≡ 0 for i ∈ S. Thus, if we isolate ∆βc terms in equations (14) and (16) for vectors i ∈ S, we get: Qij ∆βj + ∆b = −Qic ∆βc (17) j∈S
∆βj = −∆βc
(18)
j∈S
That, assuming S = {S1 , S2 , · · · , Sl }, can be matricialy formulated as follows: 1 ∆b Q S1 c ∆βS1 (19) Q · . = − . ∆βc .. .. ∆βSl Q Sl c where Q is defined as:
0 1 1 QS1 ,S1 Q=. .. .. .
··· 1 · · · QS1 ,Sl .. .. . .
(20)
1 QSl ,S1 · · · QSl ,Sl From (19),
∆b 1 ∆βS1 Q S1 c .. = −Q−1 · .. ∆βc . . ∆βSl
(21)
Q Sl c
and thus, ∆b = δ∆βc
(22)
∆βj = δj ∆βc
∀j ∈ S
(23)
where
δ
1
Q S1 c δS 1 .. = −R .. . . δS l Q Sl c
(24)
and R = Q−1 . Equations (22) and (23) show how the variation in the βc value of a new vector c influences βi values of vectors i ∈ S. The δ values are named coefficient
288
Mario Martin
sensitivities from [1]1 . Note that β values for vectors not in S do not change while these vectors do not migrate to another set. Thus, we can extend equation (23) to all vectors in D by setting δi ≡ 0 for i ∈ S. Now, we can obtain for vectors i ∈ S how gi and gi∗ change as βc changes. From equation (14), we replace ∆βj and ∆b by their equivalence in equations (22) and (23). ∆gi = Qic ∆βc + Qic ∆βc +
Qij ∆βj + ∆b =
j∈S
Qij δj ∆βc + δ∆βc =
j∈S
Qic + Qij δj + δ ∆βc = j∈S
γi ∆βc
(25)
∀i ∈ S
(26)
where γi = Qic +
Qij δj + δ
j∈S
The γ values are named margin sensitivities and are defined only for non margin support vectors because for i ∈ S, ∆gi = 0. As we have done with coefficient sensitivities, if we extend equation (25) to all vectors in D, we must set γi ≡ 0 for i ∈ S. Equation (25) shows how gi changes as βc changes, but indirectly also shows how gi∗ changes, because equation (15) states that ∆gi∗ = −∆gi . Summarizing, equation (25) shows, for vectors not in S, how gi and gi∗ values change as βc changes (note that their β value does not change). Equation (22) shows how βi for vectors i ∈ S change as βc changes (note that ∆gi and ∆gi∗ is 0 for these vectors). Finally, equation (23) shows how b varies as βc changes. All these equations are valid while vectors do not migrate from set R, S, E or E ∗ to another one. But in some cases, in order to reach consistent KKT conditions for the new vector c, it could be necessary to change first the membership of some vectors to these sets. Well, do not worry. Modify βc in the right direction (increment or decrement) until one migration is forced. Migrate the vector updating S, E, E ∗ and R sets adequately, and then continue the variation of βc . Migration of Vectors between Sets This section describes all possible different kinds of migrations between sets S, E, E ∗ and R, and how they can be detected. One vector can migrate only from its current set to a neighbor set. Figure 1 shows the geometrical interpretation of each set and from it we can infer the following possible migrations. 1
Note that [1] use the β symbol for representing this concept. As β is widely used in SVM regression as (α − α∗ ), we have decided to change the notation.
On-Line Support Vector Machine Regression
289
from E to S: One error support vector becomes a margin support vector. This migration can be detected when updating gi for i ∈ E following equation (25), gi (that was negative) becomes 0. The maximum variation in βc that does not imply migrations from E to S can be calculated as follows: The maximum ∆gi allowed for one vector i ∈ E is (0 − gi ), that is, from gi < 0 to gi = 0. From equation (25) we have, ∆βc = ∆gi γi−1 . Thus, the maximum variation allowed without the migration of vector i from E to S can be equated as: (0 − gi )γi−1 . Calculating this value for all vectors in E and selecting the minimum value, we obtain the maximum variation allowed in βc that does not force migration of vectors from E to S. from S to E: One margin support vector becomes an error support vector. This migration is detected when, updating βi for i ∈ S following equation (23), βi (that was 0 < βi < C) becomes C. Similarly to the previous case, from equation (23), ∆βc = ∆βi δi−1 . Thus, the maximum variation allowed without the migration of vector i from S to E can be formulated as: (C − βi )δi−1 . Calculating this value for all vectors in S and selecting the minimum value, we obtain the maximum variation allowed in βc that does not force migration of vectors from S to E. from S to R: One margin support vector becomes a remainder vector. This happens when updating βi for i ∈ S following equation (23), βi (that was 0 < βi < C or −C < βi < 0) turns into 0. The maximum variation allowed without the migration of vector i from S to R can be formulated as in the previous case as follows: (0 − βi )δi−1 . Calculating this value for all vectors in S and selecting the minimum value, we obtain the maximum variation allowed in βc that does not force migration of vectors from S to R. from R to S: One remainder vector becomes a margin support vector. This case is detected when the update of gi or gi∗ for i ∈ R (thus with gi > 0 and gi∗ > 0) causes that one value becomes 0. The maximum variation in βc that does not imply migrations from R to S is calculated by collecting (0 − gi )γi−1 and (0 − gi∗ )γi−1 for all vectors in R and selecting the minimum value. This is the maximum variation allowed in βc that does not force migration of vectors from R to S. from S to E ∗ : One margin support vector becomes an error support vector. This case is detected when, in the update of βi for i ∈ S the value changes from −C < βi < 0 to −C. The maximum variation in βc that does not imply migrations from S to E ∗ is calculated by collecting (−C − βi )δi−1 for all vectors in S and selecting the minimum value. from E ∗ to S: One error support vector becomes a margin support vector. This last case is detected when updating gi∗ for vectors i ∈ E ∗ , the value for one vector becomes gi∗ = 0. The maximum variation in βc that does not imply migrations from E ∗ to S is calculated by collecting (0 − gi∗ )γi−1 for all vectors in E ∗ and selecting the minimum value.
290
Mario Martin
The only memory resources required in order to monitorize KKT conditions fulfilled by vectors in D are: gi and gi∗ for vectors i ∈ S, and βi for vectors i ∈ S. In addition, in order to efficiently update these variables we also need to maintain Qij for i, j ∈ S –needed in equation (26)–, and R –needed in equation (24). Note that each possible migration is from S or to S and thus, after any migration, S must be updated. This implies that, in addition to the update of gi and gi∗ for vectors i ∈ S, and the update of βi for i ∈ βi , also matrixes Qij for i, j ∈ S and R, must be updated. To update matrix Q is easy because it only consists in adding/removing the row and column with the kernel values of the margin support vector added/removed. But the efficient update of matrix R is not obvious. In the following section we describe how to efficiently maintain matrix R. Updating R Matrix R is defined in (24) as the inverse of Q, which at the same time, is defined in (20). Note that we only need R for the update of β values, not Q. When one vector becomes a margin support vector (for instance due to a migration from another set) matrix Q should be updated and, thus, R should be updated too. The naive idea of maintaining Q and calculate its inverse to obtain R is expensive in memory and time resources. Instead of this, we will work on R directly. The updating procedure is an adaptation of the method proposed by [1] for classification to the regression problem. On one hand, when we are adding one margin support vector c, matrix R is updated as follows: δ 0 δS 1 . 1 .. R .. R := . · δ δ S 1 · · · δS l 1 + γ c 0 δS l 0 ··· 0 0 1
(27)
On the other hand, when margin support vector k is removed, matrix R is updated as follows: Rij := Rij − R−1 kk Rik Rkj
∀j, i = k ∈ [0..l]
(28)
where the index 0 refers to the b-term. Finally, to end the recursive definition of the R matrix updating, it remains to define the base case. When adding the first margin support vector, the matrix is initialized as follows: R := Q
−1
0 1 = 1 Qcc
−1
−Qcc 1 = 1 0
(29)
On-Line Support Vector Machine Regression
291
Procedure for Adding One New Vector Taking into account the considerations of the previous sections, the procedure for the incremental addition of one vector results as follows: 1. Set βc to 0 2. If gc > 0 and gc∗ > 0 Then add c to R and exit 3. If gc ≤ 0 Then Increment βc , updating β for i ∈ S and gi , gi∗ for i ∈ S, until one of the following conditions holds: - gc = 0: add c to S, update R and exit - βc = C: add c to E and exit - one vector migrates from/to sets E, E ∗ or R to/from S: update set memberships and update R matrix. Else {gc∗ ≤ 0} Decrement βc , updating β for i ∈ S and gi , gi∗ for i ∈ S, until one of the following conditions holds: - gc∗ = 0: add c to S, update R and exit - βc = −C: add c to E ∗ and exit - one vector migrates from/to sets E, E ∗ or R to/from S: update set memberships and update R matrix. 4. Return to 3 In this procedure, the influence on the regression of vector c to be added (βc ) is incremented until it reaches a consistent KKT condition. Increments in βc are done monitoring gi , gi∗ and βi of the whole set of vectors D. When one vector i does no longer fulfill the KKT conditions associated with the set where it was assigned, the vector is transferred to the appropriate set and variables are updated as necessary. This procedure always converges. The time cost to add one vector is linear in time with the number of vectors in D. The memory resources needed are quadratic in the number of vectors in S, because of matrix R. 3.2
Removing One Vector
The procedure for removing one vector from D uses the same principles that the procedure for adding one new vector. One vector c can be safely removed from D only when it does not have any influence on the regression tube. This only happens when the vector lies inside the ε-tube, or in other words, when βc = 0. If βc is not 0, the value must be incremented or decremented (depending on the sign of βc ) until it reaches 0. As in the case of adding one new vector, the
292
Mario Martin
modification of βc can change the membership to E, E ∗ , R and S of some other vectors in D. Thus, the modification of βc must be done carefully, keeping an eye on possible migrations of vectors between sets. The algorithm for the on-line removal of one vector is the following: 1. If gc > 0 and gc∗ > 0 Then remove c from R and exit 2. If gc ≤ 0 Then Decrement βc , updating β for i ∈ S and gi , gi∗ for i ∈ S, until one of the following conditions holds: - βc = 0: remove c from R and exit - one vector migrates from/to sets E, E ∗ or R to/from S: update set memberships and update R matrix. Else {gc∗ ≤ 0} Increment βc , updating β for i ∈ S and gi , gi∗ for i ∈ S, until one of the following conditions holds: - βc = 0: remove c from R and exit - one vector migrates from/to sets E, E ∗ or R to/from S: update set memberships and update R matrix. 3. Return to 2 As in the case of on-line addition of one vector, the procedure always converge. The time cost is linear in |D| while the memory cost is quadratic in |S|. 3.3
Updating Target Value for Existing Data
The obvious way to update the target value for one existing vector c in D consists in making good use of the previous actions. In order to update the pair < xc , yc > to < xc , y c > we can follow this procedure: 1. on-line removal of < xc , y c > 2. on-line addition of < xc , y c > Equations (8) and (9) show that the update of the target value yc changes gc and gc∗ . Thus, usually after an update, gc , gc∗ and βc values are no longer consistent with KKT conditions. An alternative and more efficient way of updating the target value consists in varying βc until it becomes KKT-consistent with gc and gc∗ like in the removal and addition cases. This procedure is described in [8].
On-Line Support Vector Machine Regression
4
293
Conclusions
In this paper, we have shown the first on-line procedure for building ε-insensitive SVMs for regression. An implementation of this method in Matlab is available at http://www.lsi.upc.es/~mmartin/svmr.html. The aim of this paper is to open the door to SVM function approximation for applications that receive training data in an incremental way, for instance on-line prediction of temporal series, and to applications where the target for the training data changes very often, for instance reinforcement learning. In addition to the on-line property, the proposed method presents some interesting features when compared with other exact methods like QP. First, the memory resources needed are quadratic in the number of margin support vectors, not quadratic on the total number of vectors. Second, empirical tests of the algorithm on several regression sets show comparable (or better) speeds in convergence, which means that the on-line learning procedure presented here is adequate even when the on-line property is not strictly required.
Acknowledgement I would thank to the people of the SVM seminar at the LSI department in the UPC for the enlightening discussions about SVM topics, specially to Cecilio Angulo for his comments and interest in this paper. I would also thank to Gert Cauwenberghs for making the incremental SVMc program available. This work has been partially supported by the Spanish CICyT project TIC 2000-1011.
References 1. G. Cauwenberghs and T. Poggio. Incremental and decremental support vector machine learning. In T. G. Dietterich T. K. Leen and V. Tresp, editors, Advances in Neural Infomation Processing Systems 13, pages 409–415. MIT Press, 2001. 282, 283, 288, 290 2. N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2000. 282 3. C. Domeniconi and D. Gunopulos. Incremental support vector machine construction. In N. Cercone, T. Lin, and X. Wu, editors, Proceedings of the 2001 IEEE Intl. Conference on Data Mining, pages 589–592. IEEE Computer Society, 2001. 283 4. S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In 7th International Conference on Information and Knowledge Management, ACM-CIKM98, pages 148–155, 1998. 282 5. C. Gentile. A new approximate maximal margin classification algorithm. In T. G. Dietterich T. K. Leen and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 500–506. MIT Press, 2001. 283 6. T. Graepel, R. Herbrich, and R. Williamson. From margin to sparsity. In T. G. Dietterich T. K. Leen and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 210–216. MIT Press, 2001. 283
294
Mario Martin
7. J. Kivinen, A. Smola, and R. Williamson. Online learning with kernels. In S. Becker T. G. Dietterich and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14. MIT Press, 2002. 283 8. M. Martin. On-line support vector machine for function approximation. Technical report, Universitat Polit`ecnica de Catalunya, Forthcomming. 292 9. E. Osuna, R. Freund, and F. Girosi. Training support vector machines: an application to face detection. In International Conference on Computer Vision and Pattern Recognition, CVPR97, pages 30–136, 1997. 282 10. A. Smola and B. Sch¨ olkopf. A tutorial on support vector regression. Technical Report NC2-TR-1998-030, NeuroCOLT2, 1998. 283 11. R. Sutton and A. Barto. Reinforcement Learning. MIT Press, 1998. 282 12. V. Vapnik. The nature of statistical learning theory. Springer Verlag, 1995. 282, 283
Q-Cut – Dynamic Discovery of Sub-goals in Reinforcement Learning Ishai Menache, Shie Mannor, and Nahum Shimkin Department of Electrical Engineering, Technion, Israel Institute of Technology Haifa 32000, Israel {imenache,shie}@tx.technion.ac.il
[email protected]
Abstract. We present the Q-Cut algorithm, a graph theoretic approach for automatic detection of sub-goals in a dynamic environment, which is used for acceleration of the Q-Learning algorithm. The learning agent creates an on-line map of the process history, and uses an efficient MaxFlow/Min-Cut algorithm for identifying bottlenecks. The policies for reaching bottlenecks are separately learned and added to the model in a form of options (macro-actions). We then extend the basic Q-Cut algorithm to the Segmented Q-Cut algorithm, which uses previously identified bottlenecks for state space partitioning, necessary for finding additional bottlenecks in complex environments. Experiments show significant performance improvements, particulary in the initial learning phase.
1
Introduction
Reinforcement Learning (RL) is a promising approach for building autonomous agents that improve their performance with experience. A fundamental problem of its standard algorithms, is that although many tasks can asymptotically be learned by adopting the Markov Decision Process (MDP) framework and using Reinforcement Learning techniques, in practice they are not solvable in reasonable time. “Difficult” tasks are usually characterized by either a very large state space, or a lack of immediate reinforcement signals. There are two principal approaches for addressing these problems: The first approach is to apply generalization techniques, which involve low order approximations of the value function (e.g., [14], [16]). The second approach is through task decomposition, using hierarchical or related structures. The main idea of hierarchical Reinforcement Learning methods (e.g., [4], [6], [18]) is to decompose the learning task into simpler subtasks, which is a natural procedure also performed by humans. By doing so, the overall task is “better understood” and learning is accelerated. A major challenge as learning progresses is to be able to automatically define the required decomposition, as in many cases the decomposition is not straightforward and cannot be obtained a-priori. One common way of defining subtasks (statically or dynamically) is in the state-space context (e.g., [7], [11], [15]): The learning agent identifies landmark states, which are worthwhile reaching, and learns sub-policies for that purpose. T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 295–306, 2002. c Springer-Verlag Berlin Heidelberg 2002
296
Ishai Menache et al.
This approach relies on the understanding that the path towards achieving a complex goal is through intermediate stages which are represented by states. If those states are discovered, and the policy to reach them is separately learned, the overall learning procedure may become simpler and faster. The purpose of this work is to dynamically find the target states which may usefully serve as subgoals. One approach is to choose states which have a nontypical reinforcement (a high reinforcement gradient, for example, as in [7]). This approach is not applicable in domains which suffer from delayed reinforcement (for example, a maze with ten rooms and one goal). Another approach is to choose states based on their frequency of appearance (see [7] and also [11]). The rule of thumb here is that states that have been visited often should be considered as the target of subtasks, as the agent will probably repeatedly visit them in the future, and may save time having local policies for reaching those states. The latter approach is refined in [11] by adding the success condition to the frequency measure: States will serve as subgoals if they are visited frequently enough on successful but not on unsuccessful paths. These states are defined as bottlenecks in the state space, a term that we will adopt. A problem with frequency based solutions is that the agent needs excessive exploration of the environment in order to distinguish between bottlenecks and “regular” states, so that options are defined (and learned) at relatively advanced stages of the learning process. The algorithm that will be presented here is based on considering bottlenecks as the “border states” of strongly connected areas. If an agent knows how to reach the bottleneck states, and uses this ability, its search in the state space will be more efficient. The common characteristic of the methods that were presented above is that the criterion of choosing a state as a bottleneck is local, i.e., based on certain qualities of the state itself. We shall look for a global criterion that chooses bottlenecks by viewing all state transitions. The Q-Cut algorithm, which will be shortly presented, is based on saving the MDP’s history in a graph structure (where nodes represent states and arcs represent state transitions) and performing a Max-Flow/Min-Cut algorithm on that graph in order to find bottleneck states, which will eventually serve as the target of sub-goals. In order to understand the use of the Max-Flow/Min-Cut algorithm (see [1]) in the context of Reinforcement Learning, let us first briefly review the graph theoretic problem it solves. Consider a capacitated directed network G = (N, A) (N is the set of nodes and A is the set of arcs) with a non negative capacity cij associated with each arc (i, j) ∈ A. The Max-Flow problem is to determine the maximum amount of flow that can be sent from a source node s ∈ N to a sink node t ∈ N , without exceeding the capacity of any arc. An s-t cut is a set of arcs, the deletion of which disconnects the network into two parts, Ns and Nt , where s ∈ Ns and t ∈ Nt . The problem of finding the s-t cut with the minimal capacity among all s-t cuts is called the s-t Min-Cut problem. It is well known that the s-t Min-Cut problem and the Max-Flow problem are equivalent ([1]). There are quite a few algorithms for solving the Max-Flow problem. The running time is in general a low polynomial in the number of nodes and arcs, making the algorithms an attractive choice for solving a variety of optimization problems (see [1] for
Q-Cut – Dynamic Discovery of Sub-goals in Reinforcement Learning
297
G
G
Fig. 1. Simple two room mazes. Goal is marked as “G”. After reaching the goal, the agent is positioned somewhere in the left room
further details on the Max-Flow algorithms and associated applications), and recently also for enabling efficient use of unlabeled data in classification tasks (see [3]). The specific method which we will use in our experiments is PreflowPush (described in [8]), which has a time complexity of O(n3 ), where n is the number of nodes. We shall use the Min-Cut procedure for identifying bottlenecks. This process reflects a natural and intuitive characterization of a bottleneck: If we view an MDP as a flow problem, where nodes are states and arcs are state transitions, bottlenecks represent “accumulation” nodes, where many paths coincide. Those nodes separate different parts of the state space, and therefore should be defined as intermediate goal states to support the transition between loosely connected areas. In addition, using a global criterion enables finding the best bottlenecks considerably faster. In order to explain this claim, consider the upper maze of Figure 1. Assume the agent always starts in the left room, and Goal is located in the right room. If the agent visited the wide passage between the rooms, the MinCut algorithm will identify it as a bottleneck, even if the number of visits is low in comparison to frequently visited states of the left room. In addition, consider the lower maze of Figure 1. If, for example, the agent reached the goal a significant number of times and used one of the passages in most trials, the cut will still choose both passages as bottlenecks. In both cases, the efficient discovery of bottlenecks is used for forming new options, accelerating the learning procedure. After introducing the basic algorithm we suggest to use the cut procedure for recursive decomposition of the state space. By dividing the state space to segments the overall learning task is simplified. Each of these segments is smaller than the complete state space and may be considered separately. The paper is organized as follows: In Section 2 we describe the Reinforcement Learning setup, extended to use options. Section 3 presents the Q-Cut algorithm.
298
Ishai Menache et al.
In Section 4 we extend the basic algorithm to the Segmented Q-Cut algorithm. Some concluding remarks are drawn in Section 5.
2
Reinforcement Learning with Options
We consider a discrete time MDP with a finite number of states S and a finite number of actions A. At each time step t, the learning agent is in some state st ∈ S and interacts with the (unknown) environment by choosing an action at from the set of available actions at state st , A(st ), causing a state transition to st+1 ∈ S. The environment credits the agent for that transition through a scalar reward rt . The goal of the agent is to find a mapping from states to actions, called ∞ a policy, which maximizes the expected discounted reward over time, E{ t=0 γ t rt }, where γ < 1 is the discount factor. A commonly used algorithm in RL is Q-Learning ([5]). The basic idea behind Q-Learning is to update the Q-function at every time epoch. This function maps every state action pair to the expected reward for taking this action at that state, and following an optimal strategy for all other future states. It turns out that the learned Q-function directly approximates the optimal action-value function (asymptotical convergence is guaranteed under technical conditions, see [2]), without the need to explicitly learn a model of the environment. The formula for the update is: Q(st , at ) := Q(st , at ) + α(n(t, st , at )) rt + γ max Q(st+1 , a) − Q(st , at ) a∈A(st+1 )
where α(n(t, st , at )) is the learning rate function which depends on n(t, st , at ), the number of appearances of (st , at ) until time t. We now recall the extension of Q-Learning to Macro-Q-Learning (or learning with options, see [12] and [15]). Following an option means that the agent executes a sequence of (primitive) actions (governed by a “local” policy) until a termination condition is met. Formally, an option is defined by a triplet I, π, β, where: I is the options input set, i.e., all the states from which the option can be initiated; π is the option’s policy, mapping states belonging to I to actions; β is the termination condition over states (β(s) denotes the termination probability of the option when reaching state s). When the agent is following an option, it must follow it until it terminates. Otherwise it can choose either a primitive action or initiate an option, if available (we shall use the notation A (st ) for denoting all choices, i.e., the collection of primitives and options available at state st ). Macro-Q-Learning [12] supplies a value for every combination of state and choice. The update rule for an option ot , initiated at state st , becomes: Q(st , ot ) := Q(st , ot ) + α(n(t, st , ot ))(γ k
max
a ∈A (st+k )
Q(st+k , a )
−Q(st , ot ) + rt + γrt+1 + . . . γ k−1 rt+k−1 ) where k is the actual duration of ot . The update rule for a primitive action remains the same as in standard Q-Learning.
Q-Cut – Dynamic Discovery of Sub-goals in Reinforcement Learning
3
299
The Q-Cut Algorithm
The basic idea of the Q-Cut algorithm is to choose two states, s, t, which will serve as source and target nodes for the Max-Flow/Min-Cut algorithm, and perform the cut. If the cut is “good” (we shall define a criterion for its quality), the agent establishes new options for reaching the discovered bottlenecks. The whole procedure is outlined in Figure 3. We add the details for the steps of the algorithm below. Choosing s and t: The procedure for choosing s and t is task dependent. Generally, it is based on some distance metric between states (e.g., states that are separated in time or in some state space metric), or on the identification of states with special significance (such as the start state or the goal state). In some cases, choice of s and t is more apparent. Consider, for example, the mazes of Figure 1, under the following experiment: The agent tries to reach the goal in the right room, and when the goal is reached, the agent is transferred back to somewhere in the left room. A natural selection of s and t in this case, is to choose s as one of the states in the “returning area” and t as the goal. The reason for this choice is that the agent is interested in the bottlenecks along its path from start to goal. Activating cut conditions: The agent may decide to perform a cut procedure at a constant rate, which is significantly lower than the actual experience frequency (in order to allow a meaningful change of the map of process history between sequential cuts), and might depend on the available computational resources. Another alternative is to perform a cut when good source and target candidates are found according to the procedure for choosing s and t.
Repeat: – Interact with environment and learn using Macro-Q Learning – Save state transition history – If activating cut conditions are met, choose s, t ∈ S perform Cut Procedure(s,t)
Fig. 2. Outline of the Q-Cut Algorithm Cut Procedure(s,t) – Translate state transition history to a graph representation – Find a Minimum Cut partition [Ns , Nt ] between nodes s and t – If the cut’s quality is “good” Learn the option for reaching new derived bottlenecks from every state in Ns ,using Experience-Replay
Fig. 3. The Cut Procedure
300
Ishai Menache et al.
Building the graph from history: Each visited state becomes a node in the graph. Each observed transition i → j (i, j ∈ S), is translated to an arc (i, j) in the graph. We still need to determine the capacity of the arc. Few alternatives are possible. First, capacity may be frequency based, which means setting the capacity of (i, j) to n(i → j), where n(i → j) stands for the number of transitions from i to j. Second, the capacity may be fixed, i.e., assigning a constant capacity (say of 1) to every transition, no matter how many times it occurred. The problem with the frequency-based definition is that we strengthen the capacity of frequently visited areas (e.g., early transitions near the source state, where the policy is actually random) over rarely visited areas (e.g., states that are visited just before performing the cut), thus making it more difficult to find the true bottlenecks. Fixed capacity is lacking in the sense that the same significance is attached to all transitions from some state i ∈ S, a deviation from the actual dynamics the agent faces. Our choice is a compromise between the two alternatives. The capacity is based on the relative frequency, i.e., the capacity of an arc (i, j) is set to the ratio n(i→j) n(i) , where n(i) is the number of visits at state i. Experiments show that capacity based on relative frequency achieves the best performance in terms of bottleneck identification. Determining the cut’s quality: The idea behind the design of the quality factor is that we are interested only in “significant” s-t cuts, meaning those with small number of arcs (forming a small number of bottlenecks) on the one hand, and enough states both in Ns and Nt (s ∈ Ns and t ∈ Nt ) on the other hand. Let |Ns | and |Nt | be the number of states in Ns and Nt , respectively. If |Ns | is too small, we need not bother defining an option from a small set. On the other hand, if |Nt | is small the meaning is that the area of states that we wish to enable easy access to will not contribute much to the overall exploration effort. In summary, we look for a small number of bottleneck states, separating significant balanced areas in the state space. Based on the above analysis, the quality factor of a cut is the ratiocut bipartitioning |Ns ||Nt | where A(Ns , Nt ) metric (See [9] and [17]). We define Q[Ns , Nt ] A(N s ,Nt ) is the number of arcs connecting both sets, and consider cuts whose quality factor is above a predetermined threshold. The threshold may be determined beforehand based on appropriate analysis of the problem domain. It is also possible to change it in the course of learning (e.g., lower it if no “significant” cuts were found). Learning an option: If the cut’s quality is “good”, then the minimal cut (i.e., a set of arcs) is translated into a set of bottleneck states by picking state j for each min-cut arc (i, j), with j ∈ Nt . After bottlenecks have been identified, the local policy for reaching each bottleneck is learned by an Experience Replay [10] procedure. Dynamic programming iterations are performed on all states belonging to Ns , using the recorded interaction with the environment. The bottleneck itself is given an artificial positive reward for the policy learning sake.
Q-Cut – Dynamic Discovery of Sub-goals in Reinforcement Learning
3.1
301
Experimental Results
We illustrate the Q-Cut algorithm on the two simple grids of Figure 1. The experiment conditions are the same as in [11]: The agent starts each trial at a random location in the left room. It succeeds in moving in the chosen direction with probability 0.9. It receives a reward of 1 at the goal, and zero otherwise. The agent uses an -greedy policy, where = 0.1. The learning rate was also set to 0.1, and the discount factor γ to 0.9. A Cut Procedure was executed every 1000 steps, choosing t as the goal state and s as a random state in the left room. If the cut’s quality was good according to the above mentioned criterion, a new option was learned and added to the agent’s set of choices. The performance of the Q-Cut algorithm is depicted in Figure 4, which presents a 50-runs average of the number of steps to goal as a function of the episode. Comparing Q-Cut to standard Q-Learning (using the same learning parameters) emphasizes the strength of our algorithm: Options, due to bottleneck discovery are defined extremely fast, leading to noticeable performance improvement within 2 to 3 episodes. In comparison, the frequency based solution of [11] that was applied to the upper maze of Figure 1 yielded significant improvement within about 25 episodes. As a consequence, the goal is found a lot faster than by other algorithms, with near-optimal performance reached within 20 to 30 episodes. In order to clarify the inner working of Q-Cut, we added state frequency maps for both mazes, under Q-Cut and also Q-Learning. Figure 5 presents “snapshots” taken after 25 episodes. Bright areas represent states which were visited often during the course of learning, while darker areas stand for less frequently visited
One−passage maze
Two−passage maze
1600
1200 Q−Learning Q−Cut
Q−Learning Q−Cut
1400 1000 1200 800 Steps to Goal
Steps to Goal
1000
800
600
600 400 400 200 200
0
0
10
20
30
40 Episode
50
60
70
80
0
0
10
20
30
40
50
60
70
80
Episode
Fig. 4. Performance curves for Q-Cut compared to standard Q-Learning. The left graph presents simulation results for the upper maze of Fig. 1, the right graph presents simulation results for the lower maze of the same figure. The graphs depict the number of steps to goal vs. episode number (averaged over 50 runs)
302
Ishai Menache et al. Q-Learning
Q-Cut G
G
G
G
Fig. 5. State frequency maps for both mazes of Fig. 1 (upper maps describe the upper maze of Fig. 1). All measurements are averaged over 50 runs, and were taken each time after 25 episodes. Bright areas describe more frequently visited states. We can see that in both mazes, the Q-Learning agent suffers from exhaustive exploration of the left room. On the other hand, the Q-Cut agent learns the right path towards the bottlenecks, and therefore the bottlenecks themselves are the most visited states of the environment states. We conclude from the Q-Learning maps that the Q-Learning agent spent major efforts in exploring the left room. On the other hand, having discovered appropriate options, the Q-Cut agent wandered less in the left room, and used shorter paths for the passages of the maze (which have the brightest color in the Q-Cut frequency graphs). Being able to efficiently reach the right room, the global policy for reaching the goal is learned in less time, significantly improving performance.
4
The Segmented Q-Cut Algorithm
The Q-Cut algorithm works well when one bottleneck sequentially leads to the other (for illustration, imagine a wide hallway of sequential rooms, where adja-
Q-Cut – Dynamic Discovery of Sub-goals in Reinforcement Learning
303
cent rooms are separated by one or more doors). In general, if cuts are always performed on the entire set of visited states (which grows with time), chances of finding good bottlenecks decrease. Consider the more complex maze of Fig. 6. To solve the above mentioned problem, we may divide the state space into different segments, using bottlenecks that were already found. If, for example, the agent has found Bottlenecks 2 and 3, it may use them to divide the state space into two segments, where the first contains states from the two upper left rooms and the second contains all other states. In that way, cuts may be performed separately on each segment, improving the chances of locating other bottlenecks (Bottleneck 1, for example). The above idea is the basis for the Segmented QCut algorithm. The agent uses the discovered bottlenecks (each of which may consist of a collection of states) as a segmentation tool. We use here a “divide and conquer” approach: Work on smaller segments of states in order to find additional bottlenecks and define corresponding new options. The pseudo-code for the algorithm is presented in Figure 7. Instead of working with one set of states, Segmented Q-Cut performs cuts on the segments that were created, based on previously found bottlenecks. When a good quality cut is found (using the same criterion as in Section 3), the segment is partitioned into two new segments. New cuts will be performed in each of these segments separately. Before performing cuts, each segment is extended to include newly visited states, belonging to the segment. The extension is achieved by a graph connectivity test (a simple O(nm) search in the graph, where n is the number of states and m is the number of arcs representing state transitions), where arcs that belong to a certain valid cut are removed for the connectivity testing procedure. Performing a cut procedure on a segment N means activating the Min-Cut algorithm on several (s, t) pairs, where the sources s ∈ S(N ) are the bottleneck states leading to the segment. The targets are chosen as in the Q-Cut algorithm, based on some distance metric from their matching s.
S
2
1
3
S
5
4
6
G
Fig. 6. A 6-room maze. In each episode the agent starts at a random location in the upper left room. Bottleneck states are numbered for illustration reasons
304
Ishai Menache et al.
Initialize: – Create an empty segment N0 – Include starting state s0 in segment N0 – Include starting state s0 in S(N0 ) Repeat: – Interact with environment/Learn using Macro-Q Learning – Save state transition history – For each segment N , if activating cut conditions are met: Cut Procedure(N)
Fig. 7. The Segmented Q-Cut algorithm Cut Procedure(N ) – Extend segment N by connectivity testing – Translate state transition history of segment N to a graph representation – For each s ∈ S(N ) Perform Min-Cut on the extended segment (s as source, choice of t is task depended) – If the cut’s quality is good (bottlenecks are found) • Separate the extended N into two segments Ns and Nt • Learn the Option for reaching the bottlenecks from every state in Ns , using Experience Replay • Save new bottlenecks in S(Nt )
Fig. 8. The Cut Procedure for a segment
4.1
Experimental Results
The Segmented Q-Cut algorithm was tested on the six-room maze of Figure 6. The agent always started at a random location in the upper left room. Learning parameters were the same as in the experiments made on the simple maze examples. Results with comparison to Q-Learning are summarized in Figure 9. The Segmented Q-Cut has a clear advantage over Q-Learning. It is interesting to note when in the course of learning the agent found the real bottlenecks of the environment. On average, a first bottleneck was discovered at the middle of the first episode, the second at beginning of the second episode, and the third at the middle of the same episode. This indicates a fast discovery and definition of subgoals (even before goal location is known), which accelerates the learning procedure from early stages.
5
Conclusion
The Q-Cut algorithm (and its extension to the Segmented Q-Cut algorithm) is a novel approach for solving complex Markov Decision Processes, which are characterized by the lack of immediate reinforcement. Through very fast discovery
Q-Cut – Dynamic Discovery of Sub-goals in Reinforcement Learning
305
9000 Q−Learning Segmented Q−Cut
8000
7000
Steps to Goal
6000
5000
4000
3000
2000
1000
0
0
50
100
150
Episode
Fig. 9. Performance curves for Segmented Q-Cut compared to standard QLearning for the six-room maze simulations. The graphs depict the number of steps to goal vs. episode number. Results are averaged over 50 runs of bottlenecks, the agent immediately sets its own sub-goals on-line. By doing so, exploration of different areas in the state space, which are weakly connected, becomes easier, and as a by product learning is enhanced. The main strength of the algorithm is the use of global information: Viewing the Markov Decision Process as a map of nodes and arcs is a natural perspective for determining the strategic states, which may be worth reaching. The Min-Cut algorithm is used to efficiently find bottleneck states, which divide the observed state connectivity graph into two disjoint segments. Experiments on grid-world problems indicate the potential of the Q-Cut algorithm. The algorithm significantly outperforms standard Q-Learning in different maze problems. An underlying assumption of this work is that the off-line computational power is at hand, while actual experience might be expensive. Also note that the cut procedure is computationally efficient and is required only once in a while. The distinctive empirical results motivate the application of the Q-Cut algorithm to a variety of problems where bottlenecks may arise. A car parking problem, a robot learning to stand up (see [13]), and some scheduling problems, are characterized by the existence of bottlenecks that must be reached in order to complete the overall task. Performance of the algorithm in different learning problems, specifically those with a large state-space, is under current study. Additional algorithmic enhancements, such as alternative quality factors and region merging mechanism should also be considered.
Acknowledgements This research was supported by the fund for the promotion of research at the Technion. The authors would like to thank Yaakov Engel and Omer Ziv for helpful discussions.
306
Ishai Menache et al.
References 1. R. K. Ahuja, T. L. Magnati, and J. B. Orlin. Network Flows Theory, Algorithms and Applications. Prentice Hall Press, 1993. 296 2. D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1995. 298 3. A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. In Proceedings of the 18th International Conference on Machine Learning, pages 19–26. Morgan Kaufmann, 2001. 297 4. P. Dayan and G. E. Hinton. Feudal reinforcement learning. In Advances in Neural Information Processing Systems 5. Morgan Kaufmann, 1993. 295 5. P. Dayan and C. Watkins. Q-learning. Machine Learning, 8:279–292, 1992. 298 6. T. G. Dietterich. Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research, 13:227–303, 2000. 295 7. B. Digney. Learning hierarchical control structure for multiple tasks and changing environments. In Proceedings of the Fifth Conference on the Simulation of Adaptive Behavior: SAB 98, 1998. 295, 296 8. A. V. Goldberg and R. E. Tarjan. A new approach to the maximum-flow problem. Journal of ACM, 35(4):921–940, October 1988. 297 9. D. J. Huang and A. B. Kahng. When clusters meet partitions: A new densitybased methods for circuit decomposition. In Proceedings of the European Design and Test Conference, pages 60–64, 1995. 300 10. L. G. Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8(3):293–321, 1992. 300 11. A. McGovern and A. G. Barto. Automatic discovery of subgoals in reinforcement learning using diverse density. In Proceedings of the 18th International Conference on Machine Learning, pages 361–368. Morgan Kaufmann, 2001. 295, 296, 301 12. A. McGovern, R. S. Sutton, and A. H. Fagg. Roles of macro-actions in accelerating reinforcement learning. In Proceedings of the 1997 Grace Hopper Celebration of Women in Computing, pages 13–18, 1997. 298 13. J. Morimoto and K. Doya. Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning. In Proceedings of the 17th International Conference on Machine Learning, pages 623–630. Morgan Kaufmann, 2000. 305 14. S. P. Singh, T. Jaakkola, and M. I. Jordan. Reinforcement learning with soft state aggregation. In Advances in Neural Information Processing Systems, volume 7, pages 361–368. The MIT Press, 1995. 295 15. R. S. Sutton, D. Precup, and S. Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112:181–211, 1999. 295, 298 16. J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674–690, 1997. 295 17. Y. C. Wei and C. K. Cheng. Ratio cut partitioning for hierarchical designs. IEEE/ACM Transaction on Networking, 10(7):911–921, 1991. 300 18. M. Wiering and J. Schmidhuber. HQ-learning. Adaptive Behavior, 6(2):219–246, 1997. 295
A Multistrategy Approach to the Classification of Phases in Business Cycles Katharina Morik and Stefan R¨ uping Univ. Dortmund, Computer Science Department, LS VIII {morik,rueping}@ls8.informatik.uni-dortmund.de http://www-ai.cs.uni-dortmund.de Abstract. The classification of business cycles is a hard and important problem. Government as well as business decisions rely on the assessment of the current business cycle. In this paper, we investigate how economists can be better supported by a combination of machine learning techniques. We have successfully applied Inductive Logic Programming (ILP). For establishing time and value intervals different discretization procedures are discussed. The rule sets learned from different experiments were analyzed with respect to correlations in order to find a concept drift or shift.
1
Introduction
The ups and downs of business activities have been observed for a long time It is, however, hard to capture the phenomenon by a clear definition. The National Bureau of Economic Research (NBER) defines business cycles as “recurrent sequences of altering phases of expansion and contraction in the levels of a large number of economic and financial time series.” This definition points at the multi-variate nature of business cycles. It does not specify many of the modeling decisions to be made. There is still room for a variety of concepts. – What are the indices that form a phase of the cycle? Production, employment, sales, personal income, and transfer payments are valuable indicators for cyclic economic behavior. Are there others that should be included? – What is the appropriate number of phases in a cycle? The number of phases in a cycle varies in the various economic models from two to nine. The NBER model indicates two alternating phases. The transition from one phase to the next is given by the turning points trough and peak. In the model of the Rheinisch-Westf¨ alisches Institut f¨ ur Wirtschaftsforschung (RWI), a cycle consists of a lower turning point, an upswing, an upper turning point, and a downswing. Here, the turning points are phases that cover several months. – Do all cycles follow the same underlying rules or has there been a drift of the rules? There are two tasks investigated by economic theory, the prediction and the dating problem. Where the prediction of values of economic indicators is quite successful handled by macro-economic equations [6], the dating problem remains a challenge. In this paper, we tackle the dating problem: T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 307–318, 2002. c Springer-Verlag Berlin Heidelberg 2002
308
Katharina Morik and Stefan R¨ uping
Dating: Given current (and past) business measurements, in which phase is the economy currently? In other words, the current measurements are to be classified as phases of a business cycle. The dating problem is solved in the United States of America by a board of experts, the NBER. The data on German business cycles are classified by experts, as well. The aim is now to learn from these classified business data rules that state in which phase of the cycle a country is in. This task is less clearly defined than the task of predicting business activities, because business cycles themselves are basically a theoretical model to explain the variation in business data. Linear discriminant analysis has been proposed as the baseline of empirical models1 . Univariate rules were learned that used threshold values for separating phases. The accuracy of the 18 learned rules was 54% in cross validation. It has been investigated how the classification can be enhanced by the use of monthly data [8]. More sophisticated statistical models have been developed and achieved 63% accuracy [15]. The use of Hidden Markov Models led to developing two signals for an increase in the probability of a turning point [3]. The results cannot be tranformed into classification accuracy2. Also extensive experiments with other learning techniques (linear and quadratic discriminant analysis, neural networks, support vector machines) in [13] did not deliver a better accuracy. In summary, the results of statistical economy show how hard it is to classify business phases correctly. In this paper, we investigate the applicability of inductive logic programming to the problem of dating phases of a business cycle. ILP was chosen because the results can easily be interpreted by the experts, experts are allowed to enter additional economic knowledge into the rule set, and ILP automatically selects the relevant features. We were given quarterly data for 13 indicators concerning the German business cycle from 1955 to 1994 (see Figure 1), where each quarter had been classified as being a member of one of four phases [7]. The indicators are: IE real investment in equipment (growth rate) C real private consumption (growth rate) Y real gross national product (growth rate) PC consumer price index (growth rate) PYD real gross national product deflator (growth rate) IC real investment in construction (growth rate) LC unit labour cost (growth rate) L wage and salary earners (growth rate) Mon1 money supply M1 RLD real long term interest rate 1 2
Claus Weihs at a workshop on business cycles at the “Rheinisch-Westf¨ alisches Institut f¨ ur Wirtschaftsforschung” in January 2002 The signals precede or follow a turning point by 5 to 7 quarters of a year [3].
A Multistrategy Approach
309
Y LC L
12
10
8
6
4
2
0
-2 up
utp
down
ltp
up
utp
down
ltp
-4 90
100
110
120
130
140
150
Fig. 1. Plot of the indicators Y, LC and L in two successive business cycles, starting with upswing in the quarter 82 (1976/1) and ending with lower turning point in the quarter 156 (1994/3) RS nominal short term interest rate GD government deficit X net exports We experimented with different discretizations of the indicator values (see Section 2.1). The discretization into ranges (levels) of values was also used in order to form time intervals. A sequence of measurements within the same range is summarized into a time interval. Relations between the different time intervals express precedence or domination of one indicator’s level to another ones level. We also compared the two phase with the four phase business cycle. In summary, the following three models were inspected: – business cycle with four phases, without time intervals, (Section 2.2) – business cycle with four phases, with time intervals, (Section 2.3). – business cycle with two phases, without time intervals (Section 2.4). Particular attention was directed towards the appropriate sample size for the dating problem. The homogeneity of the data set of business cycles with two phases was investigated (Section 2.5).
2
Experiments on German Business Cycle Data
Our leading question was whether ILP can support economists in developing models for dating phases of the business cycle. Given the quarterly data for
310
Katharina Morik and Stefan R¨ uping
13 indicators concerning the German business cycles from 1955 to 1994 where each quarter is classified as member of one of four phases, we used all but one cycle for learning rules and tested the rules on the left-out cycle. The leave-onecycle-out test assesses the accuracy (how many of the predicted classifications of quarters corresponded to the given classification) and the coverage (how many of the quarters received a classification by the learned rules). For ILP learning, we applied RDT [11] with the following rule schemata: m1 (Index1, Value, Phase): Index1(T, V ), V alue(V ) → P hase(T ) m2 (Index1,Value,Index2, Phase): Index1(T, V ), V alue(V ), Index2(T, V ) → P hase(T ) m3 (Index1, Value1,Index2,Value2,Phase): Index1(T, V 1), V alue1(V 1), Index2(T, V 2), V alue2(V 2), opposite(V 1, V 2) → P hase(T ) The predicates that can instantiate the predicate variable Index are the 13 indicators of the economy (see above). The predicates that can instantiate the predicate variable V alue express the discretization of the real values of the indicators. The phase variable can be instantiated by down, ltp, up, utp for the phases of downswing, lower turning point, upswing and upper turning point of a four-phase business cycle model or by down, up for a business cycle model with two phases. 2.1
Discretization
The goal of discretization is to provide the learning algorithm with data in a representation from which it can generalize maximally. Actually, two discretization tasks have to be solved: Discretization of Values: split the continuous range of possible values into finitely many discrete For example, a gross national product of 9.21 in the third quarter could be expressed as the fact y(3, high)3 . Interval Segmentation: for a given time series, find a segmentation of the time points into maximal sub-intervals, such that the values of the series in this interval share a common pattern. For example, the time series of gross national products Y = (10.53, 10.10, 9.21, 5.17, 4.93) could be described as the temporal facts y(1, 3, high), y(4, 5, medium), but can also be described as y(1, 5, decreasing). Interval segmentation can be viewed as discretization of the temporal values, therefore in this section we will use the name discretization as a generic term for both discretization of values and interval segmentation. 3
Note, that the economic indicator Y is expressed as a predicate y and not as a variable in ILP.
A Multistrategy Approach
311
The key in interval segmentation is to find a representation, that is adequate for the learner. There are many representations for time series [12], e.g. as piecewise constant or piecewise linear functions [10], using template patterns [4] or as temporal rules [5, 9]. In our case, the data is already extensively pre-processed using economical knowledge (e.g. the gross national product was originally developed as a single indicator for the state of national economy). Also, the data is given free of trends (as growth rates). It can be assumed that the relevant information lies in the value of the indicator alone. Hence, a representation of the given time series as piecewise constant functions seems to be appropriate. This has the additional advantage, that the interval segmentation can easily be found by discretizing the attribute values and joining consecutive time points with identical discretization. To find a high-quality value discretization, we can use the information that is given by the class of the examples in addition to the distribution of the numerical values [17]. Our goal is to find a discretization of the indicators, that already contains as much information about the cycle phase as possible. This directly leads to the use of information gain as the discretization criterion. In contrast to the usual approaches, we did not use an artificial criterion to determine the optimal number of discrete values, but used the number of interval segments that were induced by the discretization as our quality criterion. Using four discrete values usually led to a representation with a suitable number of facts. Note that this also deals with the information gains tendency to over-discretize the data, that was reported in [17]. A closer look at the resulting discretization showed that in certain cases, the indicators had a very high variation, which leads to many intervals that contained only one time point. In this case, the relevant observation may not be the value of the indicator, but the fact that this indicator was highly varying, i.e. that no definite value can be assigned to it. This can be expressed by a new fact indicator(T 1, T 2, unsteady), which replaces the facts indicator(T 1, T 1 + 1, value1), indicator(T 1 + 1, T 1 + 2, value2 ), . . . , indicator(T 2 − 1, T 2, valuen). 2.2
Modeling Four Phases without Time Intervals
The data correspond to six complete business cycles, each with four phases. We tested our model by a kind of leave-one-out test, where in each turn a full cycle was left out (LOO1 to LOO6). For the upper and lower turning point phases, no rule could be learned. Only for the upswing, each learning run delivered rules. For the downswing, only two learning runs, namely leaving out cycle 3 and leaving out cycle 5, delivered rules. Misclassifications at the turning points are strikingly more frequent than in other phases. Figure 2 shows the results. The results miss even the baseline of 54% in the average. Leaving out the fifth cycle (from 1974 until 1982) delivers the best result where both, accuracy and coverage, approach 70%. This might be due to its length (32 quarters), since also in the other experiment dealing with four phases the prediction of upper turning point and upswing is best, when leaving out the fifth cycle. Since the sixth cycle is even longer (45 quarters), we would expect best results in LOO6
312
Katharina Morik and Stefan R¨ uping Cycle LOO1 LOO2 LOO3 LOO4 LOO5 LOO6 Average
Accuracy 0.125 0.5 0.462 0.375 0.696 1.0 0.526
Coverage 0.25 1.0 0.462 1.0 0.696 0.36 0.628
No.of learned rules 13 upswing 12 upswing 10 upswing, 2 downswing 11 upswing 10 uspwsing, 1 downswing 1 upswing total: 60
Fig. 2. Results in the four phase model using time points
I1 I2
11111111 00000000 11111111 00000000 00000 11111 00000 11111 contains(I1,I2)
1111111 0000000 1111111 0000000 000000 111111 000000 111111 overlaps(I1,I2)
Fig. 3. The temporal relations contains and overlaps
which is true for the accuracy in this experiment. In the other experiment with four phases, the accuracy is best for upswing in LOO6 and second best for it in LOO5. 2.3
Modeling Four Phases with Time Intervals
Let us now see, whether time intervals can improve the results. We have used the discretization of the indicator values for the construction of time intervals (see Section 2.1). We end up with facts of the form Index(I,Range), and for each time point within the time interval I a fact stating that this time point T (i.e. quarter) lies in the time interval I: covers(I, T). We then described the relations between different time intervals by means of Allen’s temporal logic [2]. From the 13 possible relationships between time intervals, we chose contains and overlaps. The relation contains(I1, I2) denotes a larger interval I1 in which somewhere the interval I2 starts and ends. contains(I1, I2) is true for each time point within the larger interval I1. overlaps(I1, I2) is true for each time point of the interval I1 which starts before I2 is starting (see Figure 3). We left out the other possible relations, because they were either too general or too specific to be used in a classification rule or would violate the constraint, that only information about past events can be used in the classification 4 . The time intervals were calculated before the 4
A relation that would require that the end point of one interval was identical to the starting point of another interval would be too specific. A relation that would only require that an interval would happen before another interval, regardless of the amount of time in between, would be too general.
A Multistrategy Approach
313
training started. The rule schemata were defined such that they link two indicators with their corresponding time intervals. One rule schema is more specialised in that it requires the time intervals of the two indicators to either overlap or include each other. This more specific rule schema was intended to find rules for the turning phases, where no rules were learned in the previous experiment. In fact, rules for the upper turning point, upswing, and downswing were learned, but no rules could be learned for the upper turning point. Another intention behind the time interval modeling was to increase the accuracy of the learned rules. Indeed, rules for the upper turning point could be learned with the average accuracy of 75% in the leave-one-cycle-out runs. However, the accuracy for upswing decreased to 34% in the average. Hence, overall the time interval model did not enhance the results of the time point model in as much as we expected (see Table 4).
Cycle LOO1
Phase upswing downswing utp ltp LOO2 upswing downswing utp ltp LOO3 upswing downswing utp ltp LOO4 upswing downswing utp ltp LOO5 upswing downswing utp ltp LOO6 upswing downswing utp ltp Average upswing downswing utp ltp
Accuracy Coverage 0.167 1 0 0 0 0 0 0 0 0.461 1 1 0.200 0 0 0 0.167 1 0.333 1 0 0 0.481 1 0 0 0 0.75 0.857 0.667 0.296 0.243 1 0 0 0.388 0.716 0.104 0.500 0 0 0.75 0.143
No. learned rules 73 1 0 2 103 3 2 0 87 2 2 2 59 7 0 4 88 3 0 4 6 2 0 0 69.3 3 0.667 2
Fig. 4. Results in the four phase model using time intervals
314
2.4
Katharina Morik and Stefan R¨ uping
Modeling Two Phases
Theis and Weihs [14] have shown, that in clustering analyses of German macroeconomic data at most three clusters can be identified. The first two clusters correspond to the cycle phases of upswing and downswing and the eventual third cluster corresponds to a time period around 1971. This suggests, that two phases instead of four may be more suited for the description of business data. It also points at a concept drift (see Section 2.5). In our third experiment we mapped all time points classified as upper turning point to upswing and all quarters of a year classified as lower turning point to downswing. We then applied the rule schemata of the first experiment. An example of the learned rules is: ie(T, V 1), low(V 1), c(T, V 2), high(V 2) → down(T ) stating that a low investment into equipment together with high private consumption indicates a downswing. Again, leaving out the fifth or the sixth cycle gives the best results in the leave-one-cycle-out test. Accuracy and coverage are quite well balanced (see Figure 5). These learning results are promising. They support the hypothesis that a two phase model is more appropriate for the dating task. Concerning the selection of indicators, the learning results show that all indicators contribute to the dating of the phase. However, the short term interest rate does not occur in three of the rule sets. Consumption (both the real value and the index), net exports, money supply, government deficit, and long term interest rate are missing in at least one of the learned rule sets. For the last four cycles, i.e. leaving out cycle 1 or cycle 2, some indicators predict the upswing without further conditions: high or medium number of salary earners (l), high or medium investment in equipment (ie), high or medium investment in construction (ic), medium consumption (c), and the real gross national product (y). It is interesting to note, that a medium or high real gross national product alone classifies data into the upswing phase only when leaving out cycle 1,2, or 4. Since RDT performs a complete search, we can conclude, that in the data of cycle 1 to cycle 4, the gross national product alone does not determine the upswing phase. Further indicators are necessary there, for instance money supply (mon1) or consumer price index (pc).
Cycle LOO1 LOO2 LOO3 LOO4 LOO5 LOO6 Average
Accuracy 0,8125 0,588 0,823 0,8 0,869 1,0 0,815
Coverage 0,795 1,0 0,571 0,35 0,8 0,701 0,703
No. learned rules 9 upswing, 69 downswing 17 upswing, 35 downswing 2 upswing, 15 downswing 6 upswing, 8 downswing 10 upswing, 39 downswing 6 upswing, 41 downswing total 50 up, 207 down
Fig. 5. Results in the two phase model using time points
A Multistrategy Approach
2.5
315
Concept Shift
Starting from the two-phase model, we analyzed the homogeneity of the business cycle data. The learning results from different leave-one-cycle-out experiments were inspected with respect to their correlation. If the same rule is learned in all experiments, this means that the underlying principle did not change over time. If, however, rules co-occur only in the first cycles or in the last cycle, we hypothesize a concept drift in business cycles. We used the correlation analysis of the APRIORI algorithm [1, 16]. We want to know whether some rules are learned in all training sets, or, at least, whether there are rules that are more frequently learned than others. Enumerating all learned rules we get a vector for each training set (corresponding to a transaction in APRIORI) where the learned rule is marked by 1 and the others are set to 0. The frequency of learned rules and their co-occurrence is identified. There is no rule which was learned in all training sets. Eight rules were learned from three training sets. No co-occurrence of learned rules could be found. There is one rule, which was learned in four training sets, namely leaving out cycle 1, cycle 4, cycle 5, or cycle 6: rld(T, V ), l(T, V ), low(V ) → down(T ) stating that the real long term interest rate and the number of wage and salary earners being low indicates a downswing. We now turn around the question and ask: which training sets share rules? For answering this question, a vector for each learned rule is formed where those training sets are marked by 1 which delivered the rule. – Eighteen rules were shared in the training sets leaving out cycle 5 and leaving out cycle 6. Four of the rules predict an upswing, fourteen rules predict a downswing. This means, that cycles 1 to 4 have the most rules in common. The data from the last quarter of 1958 until the third quarter of 1974 are more homogeneous than all the data from 1958 until 1994. – When leaving out cycle 1 or cycle 2, eleven rules occur in both learning results. This means, that cycles 3 to 6 have second most rules in common. The data from the second quarter of 1967 until the end of 1994 are more homogeneous than all data together. The rule set analysis shows that cycles 1 to 4 (1958 – 1974) and cycles 3 to 6 (1967 - 1994) are more homogeneous than the overall data set. We wonder what happened in cycles 3 and 4. The first oil crisis happened at the end of cycle 4 (November 1973 – March 1974). This explains the first finding well. It shows that our rule set analysis can indeed detect concept drift, where we know that a drift occured. However, the oil crisis cannot explain why cycles 3 to 6 share so many rules. The second oil crises occured within cycle 5 (1979 – 1980). We assume that the actual underlying rules of business cycles may have changed over time. The concept drift seems to start in cycle 3. The periods of cycles 1 and 2 (1958 – 1967) are characterized by the reconstrucion after the world war. Investment in construction (ic) and in equipment (ie) is not indicative in this period, since it is rather high, anyway. A low number of earners (l) together with
316
Katharina Morik and Stefan R¨ uping
a medium range of the gross national product deflator (pyd) best characterizes the downswing in cycles 1 to 3 – this rule has been found when leaving out cycles 4 or 5 or 6. Since the unemployment rate was low after the war, it is particularly expressive for dating a phase in that period. This explains the second finding of our rule set analysis.
3
Conclusion and Further Work
ILP can be applied to the problem of classifying the phases of a business cycle with a performance that is comparable to state-of-the-art statistical methodes like linear discriminant analysis, quadratic discriminant analysis, support vector machines or neural nets [13]. There is evidence, that the high error rate, compared to other classification problems, is a result of the four phase model of business cycles. The two phase model seems to be fitting the data much better. Machine learning techniques in concert have answered the questions that have been our starting point (see Section 1). – ILP offers opportunities for the analysis of business cycle data. It is easy to interpret the results so that the learned rules can be easily inspected by economists. The multi-variate nature of ILP and the automatic selection of most relevant indicators fits the needs of dating problem. – The two-phase model of the business cycle clearly outperformed the fourphase model. Where the best average accuracy in the four-phase model was 53%, the average accuracy of the two-phase model was 82%. – Rule set analysis in terms of correlations between training set results shows that cycles 1–4 (1958 - 1974), leaving out cycle five or cycle six, had more rules in common than other cycles. The second most common rules were found when leaving out the first or the second cycle, i.e. with training on cycles 3–6 (1967 - 1994). Both findings can be explained in economical terms. The results could well be further enhanced. We used discretization in a straightforward manner by creating the interval segmentation based on the discretization of values. This can be extended by using some of the work of [10, 4, 5, 9]. However, in many of these approaches it is unclear, how the resulting discretization can be interpreted. For our application understandability is a main goal. The partitioning into two phases was very simple. A more sophisticated split within the upper and the lower turning phase, respectively, should lead to enhanced accuracy. Concept drift could be the reason for not reaching the level of accuracy that is often achieved in other domains. Training seperately cycles 4 to 6 and restricting the leave-one-cycle-out testing to these cycles could enhance the learning results. Finally, ILP allows a close cooperation with economists, who can easily inspect the learned rules, inspect contradictions of the model to the data and add further background knowledge to the model. This makes ILP a very suitable tool for working on the validation / falsification of economical theories.
A Multistrategy Approach
317
Acknowledgments This work has been partially sponsored by the Deutsche For-schungsgemeinschaft (DFG) collaborative research center 475 “Reduction of Complexity for Multivariate Data Structures”. The authors thank Ullrich Heilemann, vice president of the Rheinisch-Westf¨ alische Institut f¨ ur Wirtschaftsforschung, for data of high quality and many valuable suggestions. We also thank Claus Weihs and Ursula Sondhauss for raising our interest in the task and providing insight in its statistical nature.
References [1] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large data bases. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB ‘94), pages 478–499, Santiago, Chile, sep 1994. 315 [2] J. F. Allen. Towards a general theory of action and time. Artificial Intelligence, 23:123–154, 1984. 312 [3] Marlene Amstad. Konjunkturelle Wendepunkte: Datierung und Prognose. St.Gallen, 2000. 308 [4] Donald J. Berndt and James Clifford. Finding patterns in time series: A dynamic programming approach. In Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, chapter 3, pages 229–248. AAAI Press/The MIT Press, Menlo Park, California, 1996. 311, 316 [5] Gautam Das, King-Ip Lin, Heikki Mannila, Gopal Renganathan, and Padhraic Smyth. Rule Discovery from Time Series. In Rakesh Agrawal, Paul E. Stolorz, and Gregory Piatetsky-Shapiro, editors, Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98), pages 16 – 22, Ney York City, 1998. AAAI Press. 311, 316 [6] Rheinisch-Westf¨ alisches Institut f¨ ur Wirtschaftsforschung. Arbeitsbericht 2000. Rheinisch-Westf¨ alisches Institut f¨ ur Wirtschaftsforschung, Essen, Germany, 2000. 307 [7] U. Heilemann and H. J. M¨ unch. West German Business Cycles 1963-1994: A Multivariate Discriminant Analysis. CIRET-Conference in Singapore, CIRETStudien 50, 1996. 308 [8] U. Heilemann and H. J. M¨ unch. Classification of German Business Cycles Using Monthly Data. SFB-475 Technical Reports 8/2001. Universitaet Dortmund, 2001. 308 [9] Frank H¨ oppner. Learning temporal rules from state sequences. In Miroslav Kubat and Katharina Morik, editors, Workshop notes of the IJCAI-01 Workshop on Learning from Temporal and Spatial Data, pages 25–31, Menlo Park, CA, USA, 2001. IJCAI, AAAI Press. Held in conjunction with the International Joint Conference on Artificial Intelligence (IJCAI). 311, 316 [10] Eamonn Keogh, Selina Chu, David Hart, and Michael Pazzani. An online algorithm for segmenting time series. In Nick Cercone, T. Y. Lin, and Xindong Wu, editors, Proceedings of the 2001 IEEE International Conference on Data Mining, pages 289–296, San Jose, California, 2001. IEEE Computer Society. 311, 316
318
Katharina Morik and Stefan R¨ uping
[11] J¨ org-Uwe Kietz and Stefan Wrobel. Controlling the complexity of learning in logic through syntactic and task-oriented models. Arbeitspapiere der GMD 503, GMD, mar 1991. 310 [12] Katharina Morik. The representation race - preprocessing for handling time phenomena. In Ramon L´ opez de M´ antaras and Enric Plaza, editors, Proceedings of the European Conference on Machine Learning 2000 (ECML 2000), volume 1810 of Lecture Notes in Artificial Intelligence, Berlin, Heidelberg, New York, 2000. Springer Verlag Berlin. 311 [13] Ursula Sondhauss and Claus Weihs. Incorporating background knowledge for better prediction of cycle phases. Technical Report 24, Universit¨ at Dortmund, 2001. 308, 316 [14] Winfried Theis and Claus Weihs. Clustering techniques for the detection of business cycles. SFB475 Technical Report 40, Universit¨ at Dortmund, 1999. 314 [15] Claus Weihs Ursula Sondhauß. Using labeled and unlabeled data to learn drifting concepts. In Miroslav Kubat and Katharina Morik, editors, Workshop notes of the IJCAI-01 Workshop on Learning from Temporal and Spatial Data, pages 38–44, Menlo Park, CA, USA, 2001. IJCAI, AAAI Press. Held in conjunction with the International Joint Conference on Artificial Intelligence (IJCAI). 308 [16] Ian Witten and Eibe Frank. Data Mining // Practical Machine Learning Tools and Techniques with JAVA Implementations. Morgan Kaufmann, 2000. 315 [17] D. A. Zighed, S. Rabaseda, R. Rakotomalala, and Feschet F. Discretization methods in supervised learning. In Encyclopedia of Computer Science and Technology, volume 40, pages 35–50. Marcel Dekker Inc., 1999. 311
A Robust Boosting Algorithm Richard Nock and Patrice Lefaucheur Universit´e des Antilles-Guyane Grimaag-D´ept Scientifique Interfacultaire, Campus de Schoelcher 97233 Schoelcher, Martinique, France {Richard.Nock,Patrice.Lefaucheur}@martinique.univ-ag.fr
Abstract. We describe a new Boosting algorithm which combines the base hypotheses with symmetric functions. Among its properties of practical relevance, the algorithm has significant resistance against noise, and is efficient even in an agnostic learning setting. This last property is ruled out for voting-based Boosting algorithms like AdaBoost. Experiments carried out on thirty domains, most of which readily available, tend to display the reliability of the classifiers built.
1
Introduction and Motivations
Recent advances in Machine Learning (ML) have shown experimentally and theoretically the power of ensemble methods, that is, algorithms combining the predictions of multiple classifiers to make a single classifier [BK99]. Some of the most popular and widely used techniques are Arcing [Bre96b], Bagging [Bre96a], and Boosting [Sch90], in increasing order of the quantity of dedicated works. Arcing and Bagging are voting methods; they differ essentially by a scheme which iteratively modifies the training sample to build the voters. In Bagging [Bre96a], a new sample is generated at each step by bootstrap sampling the initial learning sample. Arcing [Bre96b], on the other side, keeps the initial examples but modifies their weights according to a rule which reweights higher the examples that have been difficult to classify by the voters built so far. Finally, Boosting is related to a general methodology in which an algorithm, called “strong” learner, requests and combines the output of so-called “weak” learners. The weak and strong adjectives are used advisedly, since the weak hypotheses are only required to perform a little bit better than the unbiased coin (but for any distribution over the learning sample). The strong learner combines them and outputs an hypothesis of arbitrary accuracy provided a sufficiently large number of weak hypotheses have been combined. Boosting draws its roots in the weak and strong learning frameworks [KV89, KV94], and further on the PAC model of Valiant [Val84]. One of the very first argumentation in favor of Boosting is due to [Kea88]. Historically, this paper is most interesting because it proposes, without proofs though, three potential Boosting algorithms, that are all voting procedures. The first evidence that Boosting is theoretically viable does not exactly use this combination scheme, but a recursive, decisiontree type majority vote combination [Sch90]. Beyond theory, the first evidences T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 319–331, 2002. c Springer-Verlag Berlin Heidelberg 2002
320
Richard Nock and Patrice Lefaucheur
that the practical importance of Boosting is much more than “possible” (quote from [Kea88]) and can actually be of great help to solve challenging problems, culminates in the paper of [FS97] and its algorithm, AdaBoost, and more recently in refined analyzes of AdaBoost [FHT00, SS98]. Interestingly, this approach follows the voting approach proned by [Kea88], but with a powerful reweighting scheme, that Arcing further studied [Bre96b]. This scheme is a stepwise multiplicative update of the training example’s weights, so as to bring higher importance to those that have been hard to classify for the last hypothesis. Most approaches derived from Boosting are voting procedures (see e.g. the papers [Bre96b, FS97, SS98]), and more generally many ensemble methods are also voting procedures [Bre96a]. A set of voters is grown, which is a way to cast the initial examples onto a new representation space of different dimension, space into which each hypothesis built defines a new variable. Afterwards, a linear separator on this new set of variables is used to classify observations. Linear separators have certain desirable properties. They have good VC-dimension (n + 1 for a ndimensional space), that is, they bring discriminative power at affordable sample complexity for learning [Vap98]. Furthermore, provided the learning sample is separable, i.e. provided there exists a theoretical hyperplane separating the examples, they are efficiently learnable [Val84, NG95], which means the existence of polynomial-time, accurate induction algorithms. However, a major drawback is that whenever the dimension is fixed, if no assumption can be made about the separability of the data, then achieving the error of the optimal hyperplane is hard; Even constant factor approximations of this error are intractable [HS92]. This is a drawback that Support Vector Machines avoid by projecting the data into a very high dimensional space in which they are separable [Vap98]. Boosting, however, cannot guarantee such a separability property. Thus, it may face (in)approximability results depending on the nature of the target concept. Because seldom are the real domains for which assumptions can be made on this target concept, one should consider that learning occurs in a relaxed, sort of “agnostic” learning setting. Fortunately, a model of agnostic (or robust) learning, making no assumption on the target concept as well as on the unknown (but fixed) distribution used to generate the examples, has been receiving much attention and an extensive theoretical cover [KSS94]. Obviously, such an ideally relaxed learning setting is also inherently hard, and many results obtained are actually negative, precluding in one sense the use of most interesting classes of concept representations (even simple rules are not agnostically learnable [HS92]). Fortunately, most, but not all. This paper exhibits the Boosting abilities a class of concept representations among the computationally easiest to manipulate, which allows agnostic learning as well as a handles record noise levels (another crucial issue in Boosting [BK99]): symmetric functions. Symmetric function are Boolean functions whose outputs are invariant under permutation of the input bits [KL88]. Their discriminative power is the same as linear separators: they have the same VC-dimension. Computationally speaking,
A Robust Boosting Algorithm
321
most the learning algorithms require record times or space when compared to many other classes of concept representations [Bou92]. Efficient learning algorithms are known to learn in the PAC or agnostic model, even under malicious noise [Bou92, KL88]. These algorithms have two common-points: first, they are purely theoretical, and studied with absolutely no experiment in the aforementioned papers or books. Second, they follow a similar, simple induction scheme which, informally, proceeds by giving to each of the n + 1 possible summations the most frequent class in the corresponding subset of the learning sample. Boosting consists in our case in an algorithm creating a symmetric function of potentially high dimension, by stepwise additions of so-called weak hypotheses whose input is the set of initial variables, and whose output is the set {0, 1}. The set of weak hypotheses defines the new binary observation of each example into this new representation space. To be more precise, our contribution is twofold. First, we provide the Boosting algorithm and a theoretical study on its capabilities. Among all, we show an interesting relationship with previous works on symmetric functions: Boosting also suggests to build each symmetric function with the robust schemata of [Bou92, KL88]. Armed with an original margin definition adapted to symmetric functions, we prove various results on the algorithm. One is used to establish the theoretical Boosting ability of our algorithm, and makes it, among the quantity of works and algorithms around “Boosting”, one of the few proven to have the Boosting property in its original acceptation [Sch90, FS97]. Our algorithm is also the first to be, for any possible set of weak hypotheses, a theoretically efficient robust learner. Such a property is definitely ruled out by [HS92] for the approaches of [SFBL98, SS98] and related. Second, we provide numerous experiments on our algorithm, and compare it with other approaches on thirty domains, most of which are readily available. Dedicated experiments are also devoted to studying noise handling, and a criterion to stop Boosting. The following section presents our Boosting algorithm, SFboost. The two next sections study and discuss SFboost respectively from a theoretical and an experimental point of view.
2
From Boosting to SFboost
Due to the space constraints, we shall assume basic knowledge of Boosting algorithms, and in particular of their main representative: AdaBoost. All this basic knowledge is included in the clever paper of [SS98], for example. Let us give some general definitions. We let LS = {(x1 , y1 ), (x2 , y2 ), ..., (xm , ym )} denote a set of |LS| = m training examples, where |.| denotes the cardinal. Here, each instance xi belongs to a domain X , and is described using n variables. Each yi is a class, belonging to a set {−1, +1}, where −1 (resp. +1) is called the negative (resp. positive) class. Sometimes, the classes shall also be noted “−” and “+” respectively. This paper is mainly focused on binary classification problems, but multiclass classification problems can be handled with the AdaBoost technique, by making from a c-class classification problem c binary
322
Richard Nock and Patrice Lefaucheur
problems, discriminating between one class and all others. They can also be handled with symmetric functions, as these classifiers are naturally multi-class, which is not the case for linear separators. Note that we do not require that examples be initially described using binary variables. Only the weak hypotheses shall be required to have their output in {0, 1}. Such binary output hypotheses can be decision trees, decision lists, monomials (simple rules), etc.
Algorithm 1: SFboost (LS) Input: a learning sample LS = {(x1 , y1 ), (x2 , y2 ), ..., (xm , ym )} for i = 1 to m do D0 (i) = 1/m; v 0 [0] = Z0 =
1 2
ln
m
+ D 0,0 − D 0,0
i=1
;
D0 (i) exp(−yi H0 (xi )); D (i) exp(−yi H0 (xi )) ; Z0
for i = 1 to m do D1 (i) = 0 for t = 1 to T do ht = Weak learn(LS, Dt ); for j = 0 to t do
v t−1 [0] + v t [j] =
Zt =
+ D (t−1,0)0 − D (t−1,0)0 + D (t−1,t−1)1 v t−1 [t − 1] + 12 ln − D (t−1,t−1)1 + + D exp(v t−1 [j−1])+D exp(v t−1 [j]) (t−1,j−1)1 (t−1,j)0 1 − 2 ln D− exp(−v t−1 [j−1])+D exp(−v t−1 [j]) (t−1,j−1)1 (t−1,j)0 1 2
ln
iff j = 0, iff j = t, (1) otherwise
m
Dt (i) exp (−yi (Ht (xi ) − Ht−1 (xi ))); D (i) exp(−yi (Ht (xi )−Ht−1 (xi ))) for i = 1 to m do Dt+1 (i) = t ; Z i=1
Output: HT (x) = v T
T
t=1
ht (xi )
t
We consider that a symmetric function is a function H : {0, 1}n → IR which is invariant under permutation of its input. Note that in a two-class framework, symmetric functions have their output generally restricted to {0, 1} [KL88]. We prefer to adopt our slightly more general definition which casts its output in IR, so as to make the output give both a label (its sign) and a confidence (its absolute value), thereby rejoining the convention of [SS98]. Suppose we have T weak hypotheses, h1 , h2 , ..., hT . Building a symmetric function HT using this intermediate set of hypotheses is actually building a symmetric function over the transformed set of examples {(x1 , y1 ), (x2 , y2 ), ..., (xm , ym )}, with xi = ∧Tt=1 ht (xi ). HT makes a partition of X into what we call buckets, the j th T bucket (j = 0, 1, ..., T ) receiving the examples (xi , yi ) for which t=1 ht (xi ) = j. represented by a T + 1-dimension “bucket vector” v T , The output of HT can be T such that HT (xi ) = v T t=1 ht (xi ) .
A Robust Boosting Algorithm
323
Suppose HT is built stepwise for t = 0, 1, ..., T , by adding weak hypotheses one at a time, so that at the very beginning, when no such hypothesis exists (t = 0), all examples share the same (empty) description and the symmetric function consists of a single bucket. At t = 1, we dispose of h1 , and two buckets, receiving examples (xi , yi ) for which h1 (x) = 0, and h1 (x) = 1 respectively, and so on for the next steps (t = 2, ..., T ). Each weak hypothesis is built by a so-called weak learner, Weak learn, which takes as input LS, and a distribution Dt . Note that the distribution takes t as index, which means that the weak learner shall be trained on Dt to output hypothesis ht . We consider that D0 is the uniform distribution, where D0 (i), the weight of example (xi , yi ) ∈ LS, is 1/m for all b i = 1, 2, ..., m. We also adopt the notation Dt,j (b ∈ {+, −}) to denote the sum of the weights of the examples at time t (i.e. computed with Dt ) falling in bucket j, b and belonging to class b (for 0 ≤ j ≤ t ≤ T ). Finally, D(t,j)b (b ∈ {0, 1}) is the sum of weights of examples at time t falling in bucket j, belonging to class b, and b b b = D(t,j)0 + D(t,j)1 ). Algorithm 1 for which ht+1 (.) = b (Therefore, we have Dt,j presents our approach to Boosting with symmetric functions, called SFboost. Note that it does not unveil how Weak learn works: though any learning algorithm can be used in place of Weak learn as long as the output of its hypotheses is {0, 1}, the next section presents in particular a criterion that Weak learn should optimize in its synergy with SFboost.
3
Analysis of SFboost
SFboost Repeatedly Levels Out the Distributions. SFboost proceeds by repeatedly leveling out the weights of the classes. Indeed, it is easy to show that after the computation of distribution Dt+1 , i.e. the distribution onto which ht+1 + − shall be built, we have ∀j = 0, 1, ..., t, Dt+1,j = Dt+1,j . SFboost Is a Boosting Algorithm. For t = 1, 2, ..., T , we define a new distribution Dt such that ∀(xi , yi ) ∈ LS, we have D0 = D0 , D1 = D1 , and (i) = Dt (i) exp(−yi Ht (xi ))/Zt (with Zt its normalization coefficient). We Dt>1 call D the “AdaBoost distribution”, as we would have in AdaBoost Dt+1 = Dt . With Dt , the algorithm SFboost can be much simplified. For example, it is a simple matter of fact to show that v t admits the much simpler expression + − b b v t [j] = 12 ln(Dt,j /Dt,j ). ∀b ∈ {+, −}, j = 0, 1, ..., t, fix D|t,j = Dt,j /Dt,j , Dt,j = t + − b b Dt,j + Dt,j , and Dt = j=0 Dt,j . Then Zt (for t = 0, 1, ..., T ) can be computed and upperbounded as follows: Zt =
t t
+ − + + 2 Dt,j Dt,j = 2 Dt,j D|t,j (1 − D|t,j ) j=0
j=0
t t
D + 1 − D + = 2 D + (1 − D + ) . Dt,j D ≤ 2 t t t,j |t,j |t,j j=0
j=0
(2)
324
Richard Nock and Patrice Lefaucheur
We now show a theorem which displays the ability of HT to separate the classes in the training sample. In the case of SFboost, [SFBL98] show that not only does the error decreases exponentially as t increases, but also the fraction of “risky” examples, close to the frontier. More precisely, [SFBL98] define in the case of a linear separator the notion of margin for an example (xi , yi ): following µ(xi ) = yi Tt=1 αt ht (x) / Tt=1 αt . If this margin is positive, the classifier assigns the right label to the example, and the larger its magnitude, the more confident is the classification given. [SFBL98] have shown that the accuracy of a linear separator depends on the margins over the training sample LS, and one should already strive to maximize them to optimize the quality of the classifier over the whole domain. If the training error rates t of each weak hypotheses does not exceed 1/2 − γ (for any possible Dt ), then we have [SFBL98]: PrLS [µ(xi ) ≤ θ] ≤
T 1−θ 1+θ (1 − 2γ) (1 + 2γ) .
(3)
The subscript LS in Pr denotes the probability w. r. t. random uniform choice T in LS. Fix S b = t=0 Dtb (b ∈ {0, 1}). Fix VT = (1/2)| ln(S + /S − )|. VT quantifies a deviation between the average distributions generated throughout the growth of HT . This is a separation parameter to which LS contributes: indeed, when there is no weak hypothesis in HT , a symmetric function can already be constructed, and its accuracy only depends on the balance between the classes in LS. For any example (xi , yi ), its margin µSF (xi ) equals T yi v T t=1 ht (xi ) yi HT (xi ) µSF (xi ) = = . (4) |VT | |VT | Like µ(.), if µSF (xi ) is positive, then the classifier gives the right class to the example. Furthermore, its magnitude quantifies a relative confidence of bucket h (x ) in the last AdaBoost distribution. The larger it is, the more useful t i t is the bucket partitioning generated by HT (w. r. t. xi ). Armed with our margin definition, we are able to prove the following theorem: Theorem 1. Fix bt = arg maxb∈{+,−} Dtb . We have: PrLS [µSF (xi ) ≤ θ] ≤ 2T +1
T θ θ (Dt bt )1+ T +1 (1 − Dt bt )1− T +1 .
(5)
t=0
Proof sketch: If µSF (xi ) ≤ θ, then 1 T bt θ θ max{S + , S − } t=0 Dt T +1 ≤ ln yi HT (xi ) ≤ ln T bt 2 min{S + , S − } 2 1− 1 t=0 Dt T +1 T b D t θ ln T t=0 t bt . ≤ 2(T + 1) t=0 (1 − Dt )
(6)
A Robust Boosting Algorithm
325
(the last ineq. follows from Jensen’s inequality). Fix K as the right-hand side parameter in ineq. 6. Then we have exp(−yi HT (xi )+K) ≥ 1, and PrLS [µSF (xi ) ≤ θ] ≤ ELS [exp(−yi HT (xi ) + K)] (E denotes the expectation). Remarking that T HT (xi ) = H0 (xi ) + t=1 (Ht (xi ) − Ht−1 (xi )), we easily obtain by unraveling DT +1 : PrLS [µSF (xi ) ≤ θ] ≤ exp(K)
T t=0
Zt
m
DT +1 (i) .
(7)
i=1
Plugging in the upperbound of each Zt in ineq. 2 and the expression of K, we obtain the statement of the theorem. Theorem 1 has two incidences. First, ineq. 7 shows that at each round, one should strive to select the weak hypothesis ht which minimizes the corresponding Zt . Second, let us consider that the training error rate t of each ht on its associated AdaBoost distribution is no more than 1/2 − γ for some constant γ > 0, i.e. ht performs only slightly better than random. Then, theorem 1 says that the fraction of examples having margin upperbounded by θ decreases as: T +1 θ θ 1+ T +1 1− T +1 PrLS [µSF (xi ) ≤ θ] ≤ (1 + 2γ) (1 − 2γ) .
(8)
If θ < γ(T + 1), then the right-hand side of ineq. 8 decreases exponentially with T . This result is stronger than the one we actually need to bring the Boosting ability of SFboost. Indeed, the Occam’s razor argument of [Fre95] (section 3.2) can be used to show that if the weak learner is such that it returns with probability 1 − δ0 a hypothesis whose error is no more than 0 = 1/2 − γ < 1/2, then SFboost returns with high probability 1−δ (∀δ > 0) a symmetric function whose error is no more than (∀ > 0), after a reasonable number of rounds (T ), and provided |LS| is large enough. The sample size is lowerbounded by a quantity almost linear in (1/)(ln(1/δ) + γ −2 ln Q), where Q is the quantity of weak hypotheses available to the weak learner at each call. For classes used in practice with a reasonable exponential cardinal (depth-bounded decision trees, monomials, etc.), our lowerbound can be very small, and make SFboost an efficient Boosting algorithm in the sense of [Sch90]. For infinite-cardinality classes, a more complicated argument is needed which integrates the VC dimension. The emphasis on the Boosting ability of SFboost is important, as throughout the literature, a rapidly increasing number of so-called “boosting” algorithms have been developed. However, with respect to the original theory of [KV89, Sch90], only a few of them (such as [Fre95, FS97, Sch90, SS98]) are really Boosting algorithms. SFboost Is an Agnostic/Robust Learning Algorithm. The agnostic learning model of [KSS94] (cast in approximation complexity on the form of a robust learning model [HS92]) is a relaxed variant of the PAC learning model [Val84], which virtually makes no assumption about the target concept used to label the
326
Richard Nock and Patrice Lefaucheur
examples. In this model, the learner draws examples according to a fixed but unknown distribution D, and is given two parameters , δ > 0. In time polynomial in 1/, 1/δ, n, the learner has to return a hypothesis from its class of concept representation (e.g. bounded-depth decision trees), such that with probability > 1 − δ, the error of this hypothesis is no more than the best achievable in the class plus . If we look further into the formula computing v T , it admits another expression which is actually T 1 |{(xi , +) ∈ LS : t=1 ht (xi ) = j}| v T [j] = ln . 2 |{(xi , −) ∈ LS : Tt=1 ht (xi ) = j}| Our way to compute the class associated to the buckets is the same as a well known agnostic learning algorithm for symmetric functions [Bou92] with record complexity. Therefore, for each possible t, SFboost agnostically learns the target concept. To our knowledge, SFboost is the first Boosting algorithm which is also an agnostic/robust learning algorithm. As linear separators are not robustly learnable [HS92] (modulo adequate complexity hypotheses), such a property is definitely out of reach for AdaBoost and all its related algorithms. SFboost Has Optimal Malicious-Noise Tolerance. [KL88] have studied the learnability of concepts when data can be corrupted by errors from whom absolutely no assumption can be made. Their “malicious noise” model takes place in the same setting as the PAC learning model, but with an adversary which manipulates any requested example with probability β, to return something from which nothing can be assumed. This adversary has unbounded computational resources, knows everything about the task (the target concept to be learned, the distribution D), and knows the internal state of the learner. [KL88] show that the maximal amount of such malicious noise is Ω(), where is the error parameter of the PAC-learning model (see before, or [Val84]). They also show that the class of symmetric functions admit an algorithm which does not only tolerate this optimal bound, but also with a minimal sample complexity. It turns out that at each time t, the symmetric function SFboost builds is the same as the one which would be chosen in theorem 11 of [KL88]. To the best of our knowledge, no other Boosting algorithm is known to bring such a noise resistance. Note that noise handling is one of the main problems of Boosting algorithms [BK99].
4
Experiments
Numerical Problems. As SFboost proceeds by repeatedly splitting the training sample into subsamples, there may be some problems to compute the components of v T (eq. 1) whenever the weight of one class approaches zero, or equals zero in a bucket, which in turn would severely bias the update of distributions Dt and Dt . To avoid such situations, we have chosen to follow experimentally the setup proned by [SS98], which boils down to replacing eq. 1 by what follows: v t [0] =
1 2
D+
exp(v t−1 [0])+
ln D−(t−1,0)0exp(−v (t−1,0)0
t−1
, v t [t] = [0])+
1 2
D+
exp(v t−1 [t−1])+
ln D−(t−1,t−1)1exp(−v (t−1,t−1)1
t−1 [t−1])+
, and
A
Robust Boosting Algorithm
Fig. 1. Scatterplots of the errors of SFBOOST(x) vs ADABOOST (y). Points above the y = x line indicate datasets for which SFBOOST performs better. See text for details
Fig. 2. Cumulative distributions of the margin psp(.) as in eq. 4, for three problems. The ILt= x" values show the respective margin distribution curves when the symmetric function contaim T = x rules
~ h ,,,- - ~exp(vi-~u - ~ I ) + D ~ enp(vi-l [j])+i exp(-vi-l U-ll)+D; -,,,,,exp(-vt-t Ul)+i
v t [ j ]= $ l n D;-i,,-i,i t). We also iix
t =
i)i
otherwise (0 ij
/
llm, as proposed by [SSSR].
SFboost vs. AdaBoost. In this experiments, we have chosen to test the behavior of SFBOOST against its principal opponent: ADABOOST. Each algorithm was tested on a set of 30 problems, most of which come from the UCI repository of ML database [BKF&8]. The error is evaluated by averaging over a ten-fold stratified cross validation procedure [Qui96]. Finally, on each couple (training set, test set) generated, both algorithms SFBOOST and ADABOOST are ran. For the sake of comparison, we have chosen for the weak learners a simple class of concept representations: monomials (rules) with a maximal number of literals 5 1 for some 1 > 0. Note that whenever 1 = 1 we induce decision stumps [SFBLQ8]. Each algorithm is ran with a iixed value for 1, and requests a number of rules equal to T , for some T > 0. As suggested by theory, the weak learners are de-
328
Richard Nock and Patrice Lefaucheur
signed to optimize respectively Zt (ineq. 2 for SFboost ) and the Z of AdaBoost (section 3 in [SS98] for AdaBoost ). The weak learners are also stepwise greedy optimization procedures for their corresponding Z criterion, building each monomial from scratch. Figure 1 summarizes the results obtained over each of the 30 datasets, for couples of values (l, r) ∈ {(1, 10), (2, 10), (2, 20), (2, 50)}. They clearly depicts the ability of SFboost to beat AdaBoost on many of the datasets. We have also observed that, as r increases for fixed l, the gap between SFboost and AdaBoost tends to increase, but with the same best algorithm: domains for which SFboost performs better than AdaBoost at fixed r, l tend to be domains for which SFboost shall perform even better when increasing r, and reciprocally for AdaBoost. We emphasize the fact that our choice to use monomials was simply for implementation purposes. Only theory, and the weak learning assumptions of AdaBoost or SFboost, could guide reliably through the choice of a more or less complicated class of concept representation to address a domain. Unfortunately, nothing can a priori states on an arbitrary domain that some algorithm satisfies the weak learning hypothesis better than another one. So far, only an induction scheme has seemingly brought an experimental accurate answer to the building of weak hypotheses, and has been supported by theoretical comparison studies [KM96]. This scheme has previously been successful to build formulas such as decision trees [Qui94], decision lists [NJ98], and, of course, our simple rules in our experiments with SFboost and AdaBoost. Figure 2 presents the margin distribution for SFboost over one run, for three problems of the UCI (Monks 1, 2, 3) [BKM98] over which we ran SFboost for a maximum of r = 800 iterations (with l = 3). They display nicely the decreasing of the training error. They also display the decreasing of the maximal margin with r, but the fractions of examples whose margin is no more than reasonable positive thresholds also decreases, which accounts for a concentration of the examples near positive, reasonably large margins. Noise Handling. Usual boosting algorithms are well known to be sensitive to noise [BK99]. In the case of SFboost, theory suggests that the algorithm should handle reasonable noise, and be at least as good as AdaBoost, if not better. On 28 out of the 30 problems (for lisibility purposes), we have ran SFboost and AdaBoost again with (l = 4, r = 20), either on the original datasets, or when plugging 10% class noise on the examples. Figure 3 (left table) shows the results obtained. The plot for each dataset gives three indications: the error comparison before noise, that after, and which algorithm is the most resistant to noise addition (if the slope is > 1, it is SFboost). There are two parts on the plot: datasets plotted before noise with an (x, y) such that approximately x, y ≤ .3, and the others. The second set contains problems that were so “hard” to handle without noise that noise addition sometimes even reduces the errors. A more reliable study can be carried out with the first set of problems. In that set, out of 17 problems, only 3 are problems for which the segment slope is < 1. In other words, there are 14 problems on which SFboost is more resistant to noise addition. A simple sign test reveals a p = 0.00636 threshold probability to reject the
A Robust Boosting Algorithm
0.5
0.5
SFBoost err.
0.25
0
0.3
0.25
0 0
0.25
0.5
0
SFBoost* err.
0.5
0.5 l=2, r=50 SFBoost err.
l=2, r=20
0.1
0.25 SFBoost* err.
0.5
0.2 SFBoost err.
AdaBoost err.
0.4
l=2, r=10 SFBoost err.
l=1, r=10
0.5
329
0.25
0.25
0 0
0.1
0.2
0.3 SFBoost err.
0.4
0.5
0
0 0
0.25 SFBoost* err.
0.5
0
0.25
0.5
SFBoost* err.
Fig. 3. Left table: scatterplots for the errors of SFboost (solid line) vs AdaBoost (dashed line) on 28 out of the 30 datasets, with and without 10% class noise (l = 4, r = 20). The squares depicts the errors without noise; dashed lines link them with the errors on their corresponding noisy dataset. Right table: error scatterplots of SFboost∗ (x) vs SFboost (y) for the 30 datasets with (l, r) = (1, 10), (2, 10), (2, 20) and (2, 50) for SFboost (see text for details) hypothesis for an identical behavior against noise. Therefore, SFboost seems to handle noise in our experiments in a better way than AdaBoost does. Stopping Boosting. There is a lack of criteria to choose the T parameter for Boosting. In the case of SFboost, we have tried a very simple alternative, suggested by ineq. 7. When putting T θ = K = 0, ineq. 7 shows that the training error is upperbounded by P = t=0 Zt . But each Zt can sometimes be > 1, on hard enough domains. This suggests that, out of a classifier containing T weak hypotheses h1 , h2 , ..., hT , one could choose the one containing h1 , h2 , ..., hT ∗ ≤T which minimizes P , out of the T + 1 possible subclassifiers (with the empty one). This is a simple, yet reasonable test to carry out. Figure 3 (right table) reports the results of SFboost against this variant called SFboost∗ on the 30 datasets, where HT ∗ is built after T = 50 iterations. SFboost∗ beats SFboost on most of the datasets, even when the points gather around the y = x line as r increases in SFboost : for r = 50, SFboost∗ still beats SFboost on 21 datasets, and is beaten only on 3 of them.
References [BK99]
E. Bauer and R. Kohavi. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning Journal, 36:105–139, 1999. 319, 320, 326, 328
330
Richard Nock and Patrice Lefaucheur
[BKM98]
C. L. Blake, E. Keogh, and C. J. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/∼mlearn/MLRepository.html. 327, 328 [Bou92] S. Boucheron. Th´eorie de l’apprentissage, de l’approche formelle aux enjeux cognitifs. Hermes, 1992. 321, 326 [Bre96a] L. Breiman. Bagging predictors. Machine Learning Journal, 24:123–140, 1996. 319, 320 [Bre96b] L. Breiman. Bias, Variance and Arcing classifiers. Technical Report 460, UC Berkeley, 1996. 319, 320 [FHT00] J. Friedman, T. Hastie, and R. Tibshirani. Additive Logistic Regression: A Statistical View of Boosting. Annals of Statistics, 28:337–374, 2000. 320 [Fre95] Y. Freund. Boosting a weak learning algorithm by majority. Information and Computation, 121:256–285, 1995. 325 [FS97] Y. Freund and R. E. Schapire. A Decision-Theoretic generalization of online learning and an application to Boosting. Journal of Computer and System Sciences, 55:119–139, 1997. 320, 321, 325 [HS92] K-U. H¨ offgen and H. U. Simon. Robust trainability of single neurons. In Proceedings of the 5 th International Conference on Computational Learning Theory, 1992. 320, 321, 325, 326 [Kea88] M. J. Kearns. Thoughts on Hypothesis Boosting, 1988. ML class project. 319, 320 [KL88] M. J. Kearns and M. Li. Learning in the presence of malicious errors. In Proceedings of the 20 th ACM Symposium on the Theory of Computing, pages 267–280, 1988. 320, 321, 322, 326 [KM96] M. J. Kearns and Y. Mansour. On the boosting ability of top-down decision tree learning algorithms. Proceedings of the 28th Annual ACM Symposium on the Theory of Computing, pages 459–468, 1996. 328 [KSS94] M. J. Kearns, R. E. Schapire, and L. M. Sellie. Toward efficient agnostic learning. Machine Learning Journal, 17:115–141, 1994. 320, 325 [KV89] M. J. Kearns and L. Valiant. Cryptographic limitations on learning boolean formulae and finite automata. Proceedings of the 21 th ACM Symposium on the Theory of Computing, pages 433–444, 1989. 319, 325 [KV94] M. J. Kearns and U. V. Vazirani. An Introduction to Computational Learning Theory. M. I. T. Press, 1994. 319 [NG95] R. Nock and O. Gascuel. On learning decision committees. In Proceedings of the 12 th International Conference on Machine Learning, pages 413–420, 1995. 320 [NJ98] R. Nock and P. Jappy. On the power of decision lists. In Proceedings of the 15 th International Conference on Machine Learning, pages 413–420, 1998. 328 [Qui94] J. R. Quinlan. C4.5 : programs for machine learning. Morgan Kaufmann, 1994. 328 [Qui96] J. R. Quinlan. Bagging, Boosting and C4.5. In Proceedings of AAAI’96, pages 725–730, 1996. 327 [Sch90] R. E. Schapire. The strength of weak learnability. Machine Learning Journal, pages 197–227, 1990. 319, 321, 325 [SFBL98] R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee. Boosting the Margin: A new explanation for the effectiveness of Voting methods. Annals of statistics, 26:1651–1686, 1998. 321, 324, 327
A Robust Boosting Algorithm [SS98]
[Val84] [Vap98]
331
R. E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. In Proceedings of the 11 th International Conference on Computational Learning Theory, pages 80–91, 1998. 320, 321, 322, 325, 326, 327, 328 L. G. Valiant. A theory of the learnable. Communications of the ACM, 27:1134–1142, 1984. 319, 320, 325, 326 V. Vapnik. Statistical Learning Theory. John Wiley, 1998. 320
Case Exchange Strategies in Multiagent Learning Santiago Onta˜ no´n and Enric Plaza IIIA, Artificial Intelligence Research Institute CSIC, Spanish Council for Scientific Research Campus UAB, 08193 Bellaterra, Catalonia (Spain) {santi,enric}@iiia.csic.es http://www.iiia.csic.es
Abstract. Multiagent systems offer a new paradigm to organize AI applications. We focus on the application of Case-Based Reasoning to Multiagent systems. CBR offers the individual agents the capability of autonomously learn from experience. In this paper we present a framework for collaboration among agents that use CBR. We present explicit strategies for case retain where the agents take in consideration that they are not learning in isolation but in a multiagent system. We also present case bartering as an effective strategy when the agents have a biased view of the data. The outcome of both case retain and bartering is an improvement of individual agent performance and overall multiagent system performance. We also present empirical results comparing all the strategies proposed. Keywords: Cooperative CBR, Multiagent CBR, Collaboration Policies, Bartering, Multiagent Learning.
1
Introduction
Multiagent systems offer a new paradigm to organize AI applications. Our goal is to develop techniques to integrate CBR into applications that are developed as multiagent systems. CBR offers the multiagent system paradigm the capability of autonomously learn from experience. The individual case bases of the CBR agents are the main issue here. If they are not properly maintained, the overall system behavior will be suboptimal. These case bases must be maintained having in mind that the agents are not isolated, but inside a multiagent system. This enables the agent not to learn only from its own experience, but collaborating with the other agents in the system. The lacks in the case bases of some agents can be compensated by the experience of other agents in the system. In a real system, there will be agents that can very often obtain certain kind of cases, and that will very seldom obtain other types of cases. It will be beneficial for two agents if they reach an agreement to trade cases. This is a very well known strategy in the human history called bartering. Using case bartering, agents that have a lot of cases of some kind will T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 331–344, 2002. c Springer-Verlag Berlin Heidelberg 2002
332
Santiago Onta˜ n´ on and Enric Plaza
give them to another agents in return to more interesting cases for them, and both will profit by improving their performance. Our research focuses on the scenario of separate case bases that we want to use in a decentralized fashion by means of a multiagent system, that is to say a collection of CBR agents that manage individual case bases and can communicate (and collaborate) with other CBR agents. Separate case bases make sense for different reasons like privacy or efficiency. If the case bases are owned by some organizations, perhaps they are not willing to donate the contents of its case bases to a centralized one where CBR can be applied. Moreover, in the case that the case bases where not private, more problems can arise from having all the cases in a single one, such as efficiency, storage or maintenance problems [5]. All these problems suggest difficulties that may be avoided by having separate case bases. In this paper we focus in multiagent systems, where individual agents learn from its own experience using CBR. We show how the agents can improve its learning efficiency by collaborating with other agents, and show results comparing several strategies. The structure of the paper is as follows. Section 2 presents the collaboration scheme that the agents use. Section 3 explains the strategies used by the agents to retain cases in their individual case bases . Then, section 4 presents a brief description of the bartering process. Finally, The experiments are explained in section 5 and the paper closes with related work and conclusion sections.
2
Multiagent Learning
A multiagent CBR (MAC) system M = {(Ai , Ci )}i=1...n is composed on n agents, where each agent Ai has a case base Ci . In this framework we restrict ourselves to analytical tasks, i.e. tasks (like classification) where the solution is achieved by selecting from an enumerated set of solutions K = {S1 . . . SK }. A case base Ci = {(Pj , Sk )}j=1...N is a collection of pairs problem/solution. When an agent Ai asks another agent Aj help to solve a problem the interaction protocol is as follows. First, Ai sends a problem description P to Aj . Second, after Aj has tried to solve P using its case base Cj , it sends back a message that is either :sorry (if it cannot solve P ) or a solution endorsement record (SER). A SER has the form Sk , P, Aj , meaning that the agent Aj has found Sk as the most plausible solution for the problem P . Voting Scheme. The voting scheme defines the mechanism by which an agent reaches an aggregate solution from a collection of SERs coming from other agents. Each SER is seen as a vote. Aggregating the votes from different agents for each class, we can now obtain the winning class as the class with the maximum number of votes. We will show now the Committee collaboration policy that uses this voting scheme (see [6] for a detailed explanation and comparison of several collaboration policies, and a generalized version of the voting scheme to allow more complex CBR methods).
Case Exchange Strategies in Multiagent Learning
333
Committee Policy. In this collaboration policy the agent members of a MAC system M are viewed as a committee. An agent Ai that has to solve a problem P, sends it to all the other agents in M. Each agent Aj that has received P sends a solution endorsement record Sk , P, Aj to Ai . The initiating agent Ai uses the voting scheme above upon all SERs, i.e. its own SER and the SERs of all the other agents in the multiagent system. The problem’s solution is the class with maximum number of votes. In a single agent scenario, when an agent has the opportunity to learn a new case, the agent only has to decide whether the new case will improve its case base or not. Several retain policies exist to take this decision [2, 9]. But when we are in a multiagent scenario, new factor must be considered. Imagine the following situation: an agent Ai has the opportunity to learn a new case C, but decides that C is not interesting to him. But there is another agent Aj in the system, that could obtain a great benefit from learning the case C. It would be beneficial for both agents, that the agent Ai retained the case, and then give or sell it to Aj . Two different scenarios may be considered, when there are ownership rights over the cases and when the agents are free to make copies of the cases to send them to other agents. We will call the first scenario the non-copy scenario, and the second one the copy scenario. Several strategies for retaining cases and bargaining with the retained cases can be defined for each scenario. The learning process in our agents has been divided in two subprocesses: The case retain process, and the case bartering process. During the case retain process the agent that receives the new case decides whether to retain or not the new case, and whether to offer the new case to the other agents or not. An alternative to offer the cases to other agents for free, is to offer them in change of more interesting cases. This is exactly what the case bartering process consists on. Thus, when an agent has recopilated some cases that are not interesting for him, he can interchange them for more interesting cases during a bartering process. This bartering process has not to be engaged necessary each time an agent learns a new case, but just when the agents decide that they have enough cases to trade with. In the following sections we will first describe all the strategies that we have experimented with, and then we will give a brief description of the bartering process.
3
Case Retain
In this section, we will explain with detail all the strategies for the case retain process used in the experiments. This case retain strategies are used when an agent has the opportunity to learn a new case, and has to decide whether to retain it or not. Each case retain strategy is composed on two different policies: the individual retain policy and the offering policy. For the individual retain policy, we have experimented with three options: Never retain (NR), where the agent that has
334
Santiago Onta˜ n´ on and Enric Plaza
the opportunity to learn a new case never retains the case; Always retain (AR), where the agent always retains the case and When-Interesting retain (WIR), where the agent only retains cases founded interesting. Notice that we can define the interestingness of a case in several ways. In our experiments the criterion for a case being interesting for an agent is when that case is incorrectly solved by the agent. And for the offering policy, we have only two options: Never offer (NO), where the agent that has the opportunity to learn a new case never offers it to any other agent in the system and Always offer (AO), where the agent always asks if any of the other agents is interested in each case the agent has the opportunity to learn. Now, combining all these options, we can define all the possible case retain strategies for both scenarios: copy scenario and non-copy scenario. In the following subsections, all the possible combinations that have sense for both scenarios are explained. Non-copy Scenario Strategies. The following combinations have sense for the non-copy scenario: – Never retain - Never offer strategy (NR-NO): The agents never retain the cases neither offer it to any other agent. Therefore, this is equivalent to a system where the agents do not learn from its experience. – Always retain - Never offer strategy (AR-NO): The agent has the opportunity to learn a new case always retains it, but never offer it to the other agents. In this case, every agent works as if learning in isolation. And all the collaborative work is delegated to the case bartering process. – When-Interesting retain - Never offer strategy (WIR-NO): Equivalent to the previous one, but the agent only retains the case if it is interesting for him. – When-Interesting retain - Always offer strategy (WIR-AO-non-copy): In this strategy, the agent Ai that has the opportunity to learn a new case, retains the case only if deemed interesting. If the case is not retained, it is offered to the other agents. Then, as we are in the non-copy scenario, the agent has to choose just one of the agents that have answered requesting the case to send only one copy of it. Several strategies can be used to make this selection, but in the experiments this is made randomly. Copy Scenario Strategies. The NR-NO, AR-NO and WIR-NO strategies are the same than in the non-copy scenario. Thus, the only new strategy that can be applied in the copy scenario is: When-Interesting retain - Always offer strategy (WIR-AO-copy) where the agent that has the opportunity to learn a new case, retains the case only if deemed interesting. Then, it is offered to the other agents. The, a copy of the case is sent to each agents that answers requesting a copy. Notice that now this is possible because we are in the copy scenario. There is another combination of policies that generates a new strategy: Always retain - Always offer strategy, where the cases are always retained by the
Case Exchange Strategies in Multiagent Learning
335
agent. And then, offered to the other agents. This is a non interesting strategy although, because all the agents in the system will have access exactly to the same cases and will retain all of them. Therefore, as all the agents will have exactly the same case bases there is no reason to use a multiagent system instead of a single agent that centralizes all the cases.
4
Case Bartering
In the previous section, we have explained the case retain process strategies used by the agents. Now, in this section, we will give a brief description of the case bartering process, see [7] for a complete description. Previous results [6] have shown that agents can obtain better results using the Committee collaboration policy than working alone. However, those experiments assumed that every agent had a representative (with respect to the whole collection of cases) sample of cases in its individual case base. When one agent’s case base is not representative we say it is biased, and the Committee accuracy starts decreasing. Case bartering addresses this problem, and each agent will exchange cases with other agents in order to improve the representativeness (i.e. diminish the bias) of its case base. 4.1
Case Base Bias
The first thing we have to define is the way that the agents measure its case base bias. Let be di = {d1i , . . . , dK i } the individual distribution of cases for an agent Ai , where dji is the number of cases with solution Sj ∈ K in the the case base of Ai . Now, we can estimate the overall distribution of cases D = {D1 , . . . , DK } where n n K Dj = ( i=1 dji )/( i=1 l=1 dli ) is the estimated probability of the class Si . To measure how far is the case base Ci of a given agent Ai of being a representative sample of the overall distribution we will define the Individual Case Base (ICB ) bias, as the square distance between the distribution of cases D and the (normalized) individual distribution of cases obtained from di : ICB(Ci ) =
K k=1
dki
Dk − K
j=1
2 dji
It has been empirically shown [7] that when the ICB bias is high (and thus the individual case base representativeness is low), the agents using the Committee policy obtain lower accuracies. 4.2
Bartering Offers
The way bartering has to reduce the ICB bias of a case base is through case interchanging. In order to interchange cases between two agents, they must reach a bartering agreement. Therefore, there must be an offering agent Ai that sends
336
Santiago Onta˜ n´ on and Enric Plaza
an offer to another agent Aj . Then Aj has to evaluate whether the offer of interchanging cases with Ai is interesting or not, and accept or reject the offer. If the offer is accepted, we say that Ai and Aj have reached a bartering agreement, and they will interchange the cases in the offer. Formally an offer is a tuple o = Ai , Aj , Sk1 , Sk2 where Ai is the offering agent, Aj is the receiver of the offer, and Sk1 and Sk2 are two solution classes, meaning that the agent Ai will send one of its cases with solution Sk2 and Aj will send one of its cases with solution Sk1 . The Case Bartering protocols do not force to use some concrete strategy to accept or to send offers, so each agent can have its own strategy. However, in our experiments each agent follow the same strategy. Let us start with the simpler one. When an agent receives a set of offers, it has to choose which of these offers to accept and which not. In our experiments the agents use the simple rule of accepting every offer that reduces its own ICB bias. Thus, we will define the set of interesting offers Interesting(O, Ai ) of a set of offers O for an agent Ai as those offers that will reduce the ICB bias of Ai . The strategy to make offers in our experiments is slightly more complicated. In [7] the agents used a deterministic strategy to make offers, but for the experiments reported here, we have chosen a probabilistic strategy which obtains better results. Each agent Ai decide which offers to make in the following way: from the set possible solution classes K, each agent choose the set K ⊆ K of those solution classes they are interested in (i.e. those classes that incrementing the number of cases with that solution class, will diminish the ICB bias measure). For each class Sk1 ∈ K , the agent will send one bartering offer to an agent Aj ∈ A. This agent Aj is chosen probabilistically, and the probability of an agent to be chosen as Aj is a function of the number of cases that the agent have with solution class Sk1 (as more cases, higher probability). Now, the agent Ai has to decide which solution class Sk2 ∈ K will offer to Aj in change for the class Sk1 . The solution class Sk2 ∈ K (where K ⊆ K is the subset of solution classes that decreasing the number of cases with that solution class, will diminish the ICB bias measure) is also chosen probabilistically, and the probability of each solution class to be chosen is a function of the number of cases that Ai has of that solution class (as more cases, higher probability). 4.3
Case Bartering Protocol
Using the previous strategies, two different protocols for Case Bartering have been experimented. The first one is called the Simultaneous Case Bartering Protocol, and the second one the Token-Passing Case Bartering Protocol. However, since the experiments presented in this paper use only the second one, only the Token-Passing protocol is going to be explained here. When an agent member of the MAC wants to enter in the bartering process, it sends an initiating message to all the other agents in the MAC. Then all the other agents answer whether or not they enter the bargaining process. This initiating message contains the parameters for bartering: a parameter tO ,
Case Exchange Strategies in Multiagent Learning
337
corresponding to the time period that the agents have to make offers; a parameter tA , corresponding to the time period that the agents have to send the accept messages; the number n of agents taking part in the bartering, and Rmax , the maximum number of bartering rounds that the bartering will have. Once the agents have answered to this initial message, the bartering starts. The main characteristic of this protocol is the Token-Passing mechanism, so that only the agent who has the Token can make offers to the others. 1. The initiating agent sends a start message containing the the protocol parameters (tO , tA , and Rmax ). 2. Each agent broadcasts its local statistics di . 3. When all agents have send di , they are able to compute the overall distribution estimation D. 4. Each agent computes the ICB bias of all the agents taking part in the bartering (including itself), and sorts them in decreasing order. This defines the order in which the Token will be passed through. 5. The agent with higher ICB bias is the first to have the Token. So, the initiating agent gives the token to him. 6. The agent who has the Token sends its bartering offers. 7. When the time tO is reached each agent chooses the subset of accepted offers from the set of received offers from the owner of the token and sends accept messages. 8. When the maximum time tA is over, all the unaccepted offers are considered as rejected. 9. Each agent broadcasts its new individual distribution di . 10. When all agents have send di , three different situations may arise: (a) If there are agents that still haven’t owned the token in the current round, the owner of the token gives it to the next agent and the protocol moves to state 6. (b) If every agent has owned the token once in this round, there have been some interchanged cases and the maximum number of iterations Rmax still has not been reached, the protocol moves to state 4. (c) If every agent has owned the token once in this round, but there have been no interchanged cases or the maximum number of iterations Rmax has been reached, the protocol moves to state 11. 11. If there have been no interchanged cases, the Case Bartering Protocol ends, otherwise the protocol moves to state 4. Notice that the procotol does not specify when the agents have to barter the cases. It only defines a way to reach bartering agreements. It’s a matter of the agents when they will really interchange the cases.
5
Experimental Results
In this section we want to compare the classification accuracy of the Committee collaboration policy using all the strategies presented in this paper. We also present results concerning case base sizes.
338
Santiago Onta˜ n´ on and Enric Plaza
We use the marine sponge identification (classification) problem as our test bed. Sponge classification is interesting because the difficulties arise from the morphological plasticity of the species, and from the incomplete knowledge of many of their biological and cytological features. Moreover, benthology specialists are distributed around the world and they have experience in different benthos that spawn species with different characteristics due to the local habitat conditions. We have designed an experimental suite with a case base of 280 marine sponges pertaining to three different orders of the Demospongiae class (Astrophorida, Hadromerida and Axinellida). In each experimental run the whole collection of cases is divided in two sets, a training set (that contains a 10% of the cases), and a test set (that contains a 90% of the cases). The training set is distributed among the agents, and then incremental learning is performed with the test set. Each problem in the test set arrive randomly to one agent in the MAC. The goal of the agent receiving a problem is to identify the correct biological order given the description of a new sponge. Once an agent has received a problem, the MAC will use the Committee collaboration policy to obtain the prediction. Since our experiments are supervised learning ones, after the committee has solved the problem, there is a supervisor that tells the agent receiver of the problem which was the correct solution. After that, the retain policy is applied. In order to test the generality of the strategies, we have tested them using systems with 3, 5 and 8 agents. Each agent apply the nearest neighbor rule to solve the problems. The results presented here are the average of 50 experimental runs. For experimentation purposes, the agents do not receive the problems randomly. We force biased case bases in every agent by increasing the probability of each agent to receive cases of some classes and decreasing the probability to receive cases of some other classes. This is done both in the training phase and in the test phase. Therefore, each agent will have a biased view of the data. Figure 1 shows the learning curve for several multiagent systems and using several retain strategies and without using bartering. The three charts shown in Figure 1 correspond to multiagent systems composed of 3, 5 and 8 agents respectively. For each multiagent system, 5 strategies have been tested: NR-NO, AR-NO, WIR-NO, WIR-AO-copy and WIR-AO-non-copy. The figure shows the learning curve for each strategy. The horizontal axis of Figure 1 represents the number of problems that the agents have received of the test set. The baseline for the comparison is the NR-NO strategy, where the agents do not retain any cases, and therefore (as we can see in the figure) the agents do not learn, resulting in an horizontal learning curve around an accuracy of 50% in all the multiagent systems. This is because the training set is extremely small, 28 cases. The Committee collaboration policy has been proven to obtain results above 88% when the agents have a reasonable number of cases [6]. Considering the other four strategies we can see that, in all the multiagent systems, there are two pairs of strategies with similar learning curves. Specifically, the AR-NO and WIR-NO have nearly the same learning curve, and there-
Case Exchange Strategies in Multiagent Learning
339
5 Agent Accuracy comparison
3 Agent Accuracy comparison 100
100
90
90
80
80
70
70
60
60
50
50
40
40
30
30
20
20
10
10 0
0 0
50
100
150
200
0
50
100
150
200
When-interesting retain - Always offer (non-copy scenario)
8 Agent Accuracy comparison 100 90
When-interesting retain - Always offer (copy scenario)
80 70
When-interesting retain - Never offer
60 50 40
Always retain - Never offer
30 20
Never retain - Never offer
10 0 0
50
100
150
200
Fig. 1. Accuracy comparison for several configurations without using bartering
fore we cannot distinguish them. They both start from an accuracy of 50% and end with an accuracy around 81%. Therefore, they are significantly better than the NR-NO strategy. The WIR-NO-copy and WIR-NO-non-copy also have very similar learning curves. Both starting at around 50% in all the scenarios, and arriving to 90% in the case of the WIR-NO-non-copy and 88% in the WIR-NO-copy also in all the scenarios. Summarizing we can say that (when the agents do not use case bartering) the strategies that use When-interesting retain and Always retain policies are not distinghishable in terms of accuracy. The strategies that use the Always offer policy (WIR-AO-copy and WIR-AO-non-copy) obtain higher accuracy than the strategies that use the Never offer policy (AR-NO and WIR-NO). Thus, it is always better for the Committee collaboration policy that the agents that receive cases offer them to the other agents; the reason is that these cases perhaps are
340
Santiago Onta˜ n´ on and Enric Plaza
Table 1. Average case base size of each agent at the end of the learning process for agents using and not using bartering without bartering with bartering 3 Agents 5 Agents 8 Agents 3 Agents 5 Agents 8 Agents NR-NO 9.33 5.60 3.50 9.33 5.60 3.50 AR-NO 93.33 56.00 35.00 93.33 56.00 35.00 WIR- NO 23.80 14.32 10.70 29.13 19.50 13.64 WIR- AO-copy 58.66 57.42 56.60 59.43 57.42 57.09 WIR- AO-non-copy 45.00 34.42 25.90 44.33 35.14 26.55
not interesting to agent receiving the problem, but may be another agent in the system that founds some of those cases interesting. We can also compare the case base sizes reached after the learning process. The left part of Table 1 shows the average size of each individual case base at the end of the learning process (i.e. when all the 252 cases of the test set have been sent to the agents) when the agents do not use bartering. In all the experiments just 28 cases (the training set) are owned by the agents at the beginning. When the agents use the NR-NO strategy, since they do not retain any new cases, they just keep the initial cases. For instance, we can see in the 3 agents scenario, where the agents have in average a case base of 9.33 cases, that 3 times 9.33 is exactly 28, that is the number of cases initially given to them. Comparing the AR-NO strategy with the WIR-NO strategy (that achieved undistinguishable accuracies), we can see that the case bases sizes obtained with WIR-NO are four times smaller than the case base sizes obtained with AR-NO for the 3 and 5 agents scenarios, and about 3.2 times smaller for the 8 agents scenario. Thus, we can conclude that the WIR-NO strategy is better than ARNO strategy because achieves the same accuracy but with a smaller case base size. A similar comparison can be made with the WIR-AO-copy and WIR-AOnon-copy. Remember that WIR-AO-non-copy and WIR-AO-copy have a similar learning curve, but WIR-AO-non-copy obtains slightly better results (90% vs 88% at the end of the test phase). The case base sizes reached are smaller for the WIR-AO-non-copy than for the WIR-AO-copy strategy. Thus, WIR-AOnon-copy achieves higher accuracy with a smaller case base size. The explanation is that when allowing multiple copies of a case in the system (in WIN-AO-copy), we are increasing the correlation between the case bases of the agents. Moreover, It is known that the combination of uncorrelated classifiers has better results that the combination of correlated ones [4]; increasing the correlation is the cause of WIR-AO-copy achieving a lower accuracy than WIR-AO-non-copy. Figure 2 shows exactly the same experiments as in Figure 1, but with agents using bartering. The agents in our experiments perform bartering each 20 cases of the test phase. Figure 2 shows that with the use of bartering, the accuracy of all the different strategies is increased. The NR-NO strategy gets boosted from an accuracy of 50% to 70% in the 3 agents scenario and to 67% and 60% in
Case Exchange Strategies in Multiagent Learning
341
5 Agent Accuracy comparison
3 Agent Accuracy comparison 100
100
90
90
80
80
70
70
60
60
50
50
40
40
30
30
20
20
10
10 0
0 0
50
100
150
200
0
50
100
150
200
When-interesting retain - Always offer (non-copy scenario)
8 Agent Accuracy comparison 100 90
When-interesting retain - Always offer (copy scenario)
80 70
When-interesting retain - Never offer
60 50 40
Always retain - Never offer
30 20
Never retain - Never offer
10 0 0
50
100
150
200
Fig. 2. Accuracy comparison for several configurations using bartering
the 5 and 8 agents scenario respectively. Notice that when the agents use the NR-NO, no cases are retained and thus their case base sizes are very small, and that just reducing the bias of the individual case bases, the agents obtained a great improvement. This shows the benefits that bartering can provide to the multiagent system. Figure 2 also shows that the WIR-AO-copy and WIR-AO-non-copy strategies still achieve the highest accuracies, and that their accuracies are not distinguishable. The accuracies of the WIR-NO and AR-NO strategies also improved and are now closer to WIR-AO-copy and WIR-AO-non-copy than without bartering. Moreover, the AR-NO strategy achieves now higher accuracy than WIR-NO. The agents retain more cases in the AR-NO strategy than in the WIR-NO, thus they have more cases to trade with in the bartering process. Therefore, when the agents use bartering, they have an incentive to retain cases because they can later negotiate with them in the bartering process.
342
Santiago Onta˜ n´ on and Enric Plaza
The right part of Table 1 shows the average size of each individual case base at the end of the learning process when the agents use bartering. We can see that the case base sizes reached are very similar to the case base sizes reached without bartering. Therefore, with bartering we cannot say that AR-NO is better than WIR-NO (as happened without bartering), because AR-NO achieves higher accuracy but greater case base sizes, and WIR-NO has smaller case base size but with a slightly lower accuracy. Summarizing all the experiments presented (with and without bartering), we can say that using bartering the system always obtains an increased accuracy. We have also seen that the strategies where the agents use the Always offer policy also obtains higher accuracies. And that if we let each agent to decide when a case is interesting to be retained (When-Interesting retain) instead of retaining every case (Always retain), we can reduce significantly the case bases with practically no effect on the accuracy. Finally, we can conclude that WhenInteresting Retain - Always offer strategy (with no copy) outperforms all the other strategies, since it obtains the higher accuracies with rather small case bases, and that the use of bartering is always beneficial.
6
Related Work
Related work can be divided in two areas: multiple model learning (where the final solution for a problem is obtained through the aggregation of solutions of individual predictors) and case base competence assessment. A general result on multiple model learning [3] demonstrated that if uncorrelated classifiers with error rate lower than 0.5 are combined then the resulting error rate must be lower than the one made by the individual classifiers. However, these methods do not deal with the issue of “partitioned examples” among different classifiers as we do—they rely on aggregating results from multiple classifiers that have access to all data. The meta-learning approach in [1] is applied to partitioned data. They experiment with a collection of classifiers which have only a subset of the whole case base and they learn new meta-classifiers whose training data are based on predictions of the collection of (base) classifiers. Learning from biased data sets is a well known problem, and many solutions have been proposed. Vucetic and Obradovic [10] propose a method based on a bootstrap algorithm to estimate class probabilities in order to improve the classification accuracy. However, their method does not fit our needs, because it requires the availability of the entire test set. Related work is that of case base competence assessment. We use a very simple measure comparing individual with global distribution of cases; we do not try to assess the areas of competence of (individual) case bases - as proposed by Smyth and McKenna [8]. This work focuses on finding groups of cases that are competent.
Case Exchange Strategies in Multiagent Learning
7
343
Conclusions and Future Work
We have presented a framework for cooperative Case-Based Reasoning in multiagent systems, where agents can cooperate in order to improve its performance. We have presented also that a market mechanism (bartering) can help the agents and improve the overall performance as well as the individual performance of the agents. The agent autonomy is maintained because all the agents are free, because if an agent do not want to take part in the bartering, he just have to reject all the offers and do not make any one. We have also shown the problem arising when data is distributed over a collection of agents than can have a skewed view of the world (the individual bias). Case bartering shows that the problems derived from distributed data over a collection of agents can be solved using a market-oriented approach. We have presented explicit strategies for the agents to accumulate experience (retaining cases) and to share this experience with the other agents in the system. The outcome of this experience sharing is an improvement of the overall performance of the system (i.e. higher accuracy). But further research is needed in order to find better strategies that allow the agents obtain the higher accuracy with the smallest case base size.
Acknowledgements The authors thank Josep-Llu´ıs Arcos of the IIIA-CSIC for its support and for the development of the Noos agent platform. Support for this work came from CIRIT FI/FAP 2001 grant and projects TIC2000-1414 “eInstitutor” and IST1999-19005 “IBROW”.
References [1] Philip K. Chan and Salvatore J. Stolfo. A comparative evaluation of voting and meta-learning on partitioned data. In Proc. 12th International Conference on Machine Learning, pages 90–98. Morgan Kaufmann, 1995. 342 [2] G. W. Gates. The reduced nearest neighbor rule. IEEE Transactions on Information Theory, 18:431–433, 1972. 333 [3] L. K. Hansen and P. Salamon. Neural networks ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence, (12):993–1001, 1990. 342 [4] Anders Krogh and Jesper Vedelsby. Neural network ensembles, cross validation, and active learning. In G. Tesauro, D. Touretzky, and T. Leen, editors, Advances in Neural Information Processing Systems, volume 7, pages 231–238. The MIT Press, 1995. 340 [5] David B. Leake and Raja Sooriamurthi. When two case bases are better than one: Exploiting multiple case bases. In ICCBR, pages 321–335, 2001. 332 [6] S. Onta˜ n´ on and E. Plaza. Learning when to collaborate among learning agents. In 12th European Conference on Machine Learning, 2001. 332, 335, 338 [7] S. Onta˜ n´ on and E. Plaza. A bartering aproach to improve multiagent learning. In 1st International Joint Conference in Autonomous Agents and Multiagent Systems, 2002. 335, 336
344
Santiago Onta˜ n´ on and Enric Plaza
[8] B. Smyth and E. McKenna. Modelling the competence of case-bases. In EWCBR, pages 208–220, 1998. 342 [9] Barry Smyth and Mark T. Keane. Remembering to forget: A competencepreserving case deletion policy for case-based reasoning systems. In IJCAI, pages 377–383, 1995. 333 [10] S. Vucetic and Z. Obradovic. Classification on data with biased class distribution. In 12th European Conference on Machine Learning, 2001. 342
Inductive Confidence Machines for Regression Harris Papadopoulos, Kostas Proedrou, Volodya Vovk, and Alex Gammerman Department of Computer Science, Royal Holloway, University of London Egham, Surrey TW20 0EX, England {harris,konstant,vovk,alex}@cs.rhul.ac.uk
Abstract. The existing methods of predicting with confidence give good accuracy and confidence values, but quite often are computationally inefficient. Some partial solutions have been suggested in the past. Both the original method and these solutions were based on transductive inference. In this paper we make a radical step of replacing transductive inference with inductive inference and define what we call the Inductive Confidence Machine (ICM); our main concern in this paper is the use of ICM in regression problems. The algorithm proposed in this paper is based on the Ridge Regression procedure (which is usually used for outputting bare predictions) and is much faster than the existing transductive techniques. The inductive approach described in this paper may be the only option available when dealing with large data sets.
1
Introduction
When presented with a test example, traditional machine learning algorithms only output a bare prediction, without any associated confidence values. For example, Support Vector Machine (Vapnik, 1998, Part II) outputs just one number (a bare prediction, as we will say), and one has to rely on the previous experience or relatively loose theoretical upper bounds on the probability of error to gauge the quality of the given prediction. This is also true for the more traditional Ridge Regression (RR) procedure as it is used in machine learning (see, e.g., Saunders, Gammerman, & Vovk, 1998). Gammerman, Vapnik, and Vovk (1998) proposed what we call in this paper “Transductive Confidence Machine” (TCM), which complements the bare predictions with measures of confidence in those predictions. Both Transductive (see, e.g., Proedrou et al., 2001) and Inductive (proposed in this paper) Confidence Machines are currently built on top of the standard machine learning algorithms for outputting bare predictions; we will call the latter the underlying algorithms. TCM suggested in Gammerman et al. (1998) was greatly improved in (Saunders, Gammerman, & Vovk, 1999). Vovk, Gammerman, and Saunders (1999) introduced the universal confidence values: the best confidence values one can hope to obtain. The universal confidence values are defined using the algorithmic theory of randomness (or, in the simplest situations, Kolmogorov complexity; see Li and Vit´ anyi, 1997) and are computable only in a very weak sense (“computable in the limit”). There are reasons to believe that the version T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 345–356, 2002. c Springer-Verlag Berlin Heidelberg 2002
346
Harris Papadopoulos et al.
of TCM defined in Saunders et al. (1999), when coupled with a good underlying algorithm, can give confidence values as good as the universal values provided by the algorithmic theory of randomness (Nouretdinov et al., 2001). The main disadvantage of the existing variants of TCM is their relative computational inefficiency. An original motivation behind the idea of transductive inference (Vapnik, 1998) was to obtain more computationally efficient versions of learning algorithms. Whereas this remains an interesting long-term goal, so far in the theory of confident predictions a side-effect of using transduction has been computational inefficiency; for every test example, all computations need to be started from scratch. It was not clear, however, how prediction with confidence could be implemented without resorting to transduction. Saunders, Gammerman, and Vovk (2001) proposed a much more efficient version of TCM; other efficient versions are described in Vovk and Gammerman (2001). This paper makes a much more radical step introducing Inductive Confidence Machine, ICM. The computational efficiency of ICM is almost as good as that of the underlying algorithm. There is some loss in the quality of the confidence values output by the algorithm, but we show that this loss is not too serious. On the other hand, the improvement in the computational efficiency is massive. ICM will be defined in Section 2. In the following section we will prove the validity of the predictive regions it outputs. Finally, in the last section we give some experimental results that measure the efficiency of our algorithm based on those criteria. In the rest of this introductory section we will briefly describe the relevant literature. Computing confidence values is, of course, an established area of statistics. In the non-parametric situations typically considered in machine learning the most relevant notion is that of tolerance regions (Fraser, 1957; Guttman, 1970). What we do in this paper is essentially finding tolerance regions without parametric assumptions, only assuming that the data is generated by some completely unknown i.i.d. distribution (we will call this the i.i.d. assumption). Traditional statistics, however, did not consider, in this context, the high-dimensional problems typical of machine learning, and no methods have been developed in statistics which could compete with TCM and ICM. The two main areas in the mainstream machine learning which come close to providing confidence values similar to those output by TCM and ICM are the Bayesian methods and PAC theory. For detailed discussion, see (Melluish et al., 2001); here our discussion will be very brief. Quite often Bayesian methods make it possible to complement bare predictions with probabilistic measures of their quality (theoretically this is always possible, but in practice there can be great computational difficulties); e.g., Ridge Regression can be obtained as a Bayesian prediction under specific assumptions and then it can be complemented by a measure of its accuracy (such as the variance of the a posteriori distribution). They require, however, strong extra assumptions, which the theory of TCM and ICM avoids. In fact, Bayesian
Inductive Confidence Machines for Regression
347
methods are only applicable if the stochastic mechanism generating the data is known in every detail; in practice, we will rarely be in such a happy situation. (Melluish et al., 2001) show how misleading Bayesian methods can become when their assumptions are violated and how robust TCM results are (ICM results are as robust). PAC theory, in contrast, only makes the general i.i.d. assumption. There are some results, first of all those by Littlestone and Warmuth (1986; see also Cristianini & Shawe-Taylor, Theorem 4.25 and 6.8), which are capable of giving non-trivial confidence values for data sets that might be interesting in practice. However, in order for the PAC methods to give non-trivial results the data set should be particularly clean; they will fail in the vast majority of cases where TCM and ICM produce informative results (see Melluish et al., 2001). The majority of relevant results in the PAC theory are even less satisfactory in this respect: they either involve large explicit constants or do not specify the relevant constants at all (see, e.g., Cristianini and Shawe-Taylor, 2000, Section 4.5).
2
Inductive Confidence Machine
In this paper we are only interested in the problem of regression, with Ridge Regression as the underlying algorithm. In contrast to the original Ridge Regression method, every prediction output by ICM is not a single real value, but a set of possible values, called a predictive region. We are given a training set {(x1 , y1 ), . . . , (xl , yl )} of l examples, where xi ∈ IRn are the attributes and yi ∈ IR are the labels, i = 1, . . . , l, and the attributes of a new example xl+1 ∈ IRn . When fed with a confidence level, such as 99%, ICM is required to find a predictive region such that one can be 99% confident that the label yl+1 of the new example will be covered by that predictive region. The idea of ICM is as follows. We split the training set into two subsets: – the proper training set {(x1 , y1 ), . . . , (xm , ym )} with m < l elements, and – the calibration set {(xm+1 , ym+1 ), . . . , (xl , yl )} with k := l − m elements; m and k are parameters of the algorithm. We apply the Ridge Regression method to the proper training set, and using the derived rule we associate a strangeness measure with every pair (xi , yi ) in the calibration set. This measure can be defined as αi := |ym+i − yˆm+i |, i = 1, . . . , k, (1) where yˆm+i are the predictions given by the derived rule; later we will also consider other definitions. For every potential label y of the new unlabelled example xl+1 we can analogously define αk+1 := |y − yˆl+1 |, where yˆl+1 is the prediction for the new example given by the derived rule. Let us defined the p-value associated with the potential label y as p(y) :=
#{i = 1, . . . , k + 1 : αi ≥ αk+1 } , k+1
348
Harris Papadopoulos et al.
where #A stands for the number of elements in the set A; to emphasize the dependence on the training set and xl+1 , we will also write p(x1 , y1 , . . . , xl , yl , xl+1 , y) in place of p(y). In Section 3 we will prove that p(y) are indeed valid p-values. Suppose we are given a priori some confidence level 1 − δ, where δ > 0 is a small constant (typically one takes 1% or 5%); sometimes we will say that δ is the significance level. Given the significance level δ, the predictive region output by ICM is {y : p(y) > δ} . (2) In Section 4 we will see that this can be done efficiently.
3
Validity of the Predictive Regions
Recall that valid p-values p(y) should satisfy, for any i.i.d. distribution P and for every significance level δ, P{p(y) ≤ δ} ≤ δ.
(3)
The next proposition shows that (2) defines valid p-values under the general i.i.d. assumption when the randomization is done over the training as well as over the new example (xl+1 , yl+1 ). Proposition 1. For every probability distribution P in IRn × IR and every significance level δ > 0, P l+1 (x1 , y1 , . . . , xl , yl , xl+1 , yl+1 ) : p(x1 , y1 , . . . , xl , yl , xl+1 , yl+1 ) ≤ δ ≤ δ. Proof. We will actually prove the stronger assertion that (3) is true if the randomization is done only over the calibration set and the new example. Let us fix the proper training set x1 , y1 , . . . , xm , ym ; our goal is to prove P k+1 (xm+1 , ym+1 , . . . , xl+1 , yl+1 ) : (4) p(xm+1 , ym+1 , . . . , xl+1 , yl+1 ) ≤ δ ≤ δ. We can imagine that the sequence (xm+1 , ym+1 ), . . . , (xl+1 , yl+1 ) is generated in two stages: – first the unordered set {xm+1 , ym+1 , . . . , xl+1 , yl+1 } is generated;
(5)
Inductive Confidence Machines for Regression
349
– one of the (k + 1)! possible orderings {xπ(m+1) , yπ(m+1) , . . . , xπ(l+1) , yπ(l+1) } (where π : {m + 1, . . . , l + 1} → {m + 1, . . . , l + 1} is a permutation) of (5) is chosen (some of these orderings may lead to the same sequence if some example occurs twice in (5)). Already the second stage will ensure (4): indeed, p(yl+1 ) ≤ δ if and only if αl+1 is among the δ(k + 1) largest αi ; since all permutations π are equiprobable, the probability of this event will not exceed δ.
This proof shows that the method of computing α1 , . . . , αk+1 should only satisfy the following condition in order for the computed p-values to be valid: every αi , i = 1, . . . , k + 1, should be computed only from (xm+i , ym+i ), the proper training set, and the unordered set {xm+1 , ym+1 , . . . , xl+1 , yl+1 }, where yl+1 is understood to be the postulated label y of xl+1 . Definition (1) and definition (7) (see section 4) obviously satisfy this requirement. Fix some significance level δ (small positive constant). Proposition 1 shows that ICM is valid in the following sense. Either the ICM prediction is correct (i.e., the prediction region contains the true label yl+1 ) or an event of small (at most δ) probability occurred. If δ is chosen so that we are prepared to ignore events of probability δ, we can rely on the predictive region covering the true label.
4
Explicit ICM
In this section we will give a slightly more explicit representation of ICM. Let us denote by α(1) , . . . , α(k∗ ) the sequence of all αi corresponding to the calibration set sorted in the descending order, with all repetitions deleted; let js := #{αi : αi ≥ α(s) }, s = 1, . . . , k ∗ , be the number of αs at least as large as α(s) (if all αi are different, j1 = 1, j2 = 2, . . .). Fix the confidence level 1 − δ. The “attainable” significance levels will be js js of the form k+1 ; decrease δ, if necessary, so that it is of this form: δ = k+1 for ∗ some s = 1, . . . , k . It can be easily checked that the predictive region output by ICM can be represented as (6) (ˆ yl+1 − α(s) , yˆl+1 + α(s) ), provided the αs are computed according to (1). Notice that the computational overhead of ICM is light; it is almost as efficient as the underlying algorithm. The decision rule is computed from the proper training set only once, and it is applied to the calibration set also only once. The value of s corresponding to the given significance level δ and the value α(s) can be also computed in advance. For every test example we need to apply the decision rule to it to find its yl+1 ; once this is done, computing the predictive region from (6) is trivial.
350
Harris Papadopoulos et al.
Another Way of Computing αi Definition (1) defines the strangeness of the new example as the error of the decision rule on it. A natural way to make this strangeness measure more precise is to take into account the predicted accuracy of the decision rule f found from the proper training set on a given unlabelled example from {xm+1 , . . . , xl+1 }. Hopefully this should lead to smaller prediction regions. Instead of using the strangeness measure αi = |yi − yˆi |, we can use yi − yˆi , (7) αi := σi where σi is an estimate of the accuracy of the decision rule f on xi . More specifically, we take σi := eµi , where µi is the RR prediction of the value ln(|(yi − f (xi )|) for the example xi . The use of the logarithmic scale instead of the direct one ensures that the estimate is always positive; besides, relatively more weight is given to examples with classifications close to f ’s predictions. It is easy to see that when using αi computed from (1) ICM will output predictive intervals of the same length for all test examples. This is not longer the case when (7) is used; the length of the predictive interval will be proportional to the predicted accuracy of f on the new example. What we are actually accomplishing by using (7) is that the predictive regions obtained will be smaller for points where the RR prediction is good and larger for points where it is bad. Fixed Prediction Interval There are two possible modes of using the p-values computed from (2): 1. For a given significance level δ, find a predictive region such that we can be 1 − δ confident that it covers the true label. 2. Given a fixed predictive region, find the maximum level at which we can be confident that the true label will be covered. The first mode corresponds to the regression ICM considered so far. The second mode is essentially what is usually done in classification problems, where a fixed predictive region may represent one of the possible classifications. It is clear that the maximum confidence interval at which a given predictive interval [a, b] is valid will be 1 − js /(k + 1), where s is the maximum number such that yi − a|, |ˆ yi − b|) . α(s) ≥ max (|ˆ
5
Experimental Results
The first set of experiments check how reliable the obtained predictive regions are. We count the percentage of wrong predictive intervals; in other words, how many times the algorithm fails to give a predictive region that contains the real label of every test example. In effect this checks empirically the validity of our
Inductive Confidence Machines for Regression
351
Table 1. The average success of the predictions made, for different confidence levels using (1) as strangeness measure Kernel Type
Empirical reliability 90% 95% 99% Polynomial 93.6% 97.4% 99.3% RBF 97.5% 98.6% 99.6% ANOVA Splines 97.7% 97.2% 98.8%
Table 2. The average success of the predictions made, for different confidence levels using (7) as strangeness measure Kernel Type
Empirical reliability 90% 95% 99% Polynomial 95.2% 97.8% 99.1% RBF 97.3% 98.8% 99.6% ANOVA Splines 95% 97.6% 99.2%
algorithm, which was proven theoretically in Section 3. We expect that for a large number of examples the percentage of wrong predictions will not exceed (and perhaps will be close to) the specified significance level. A second set of experiments checks the tightness of our predictive regions by calculating the median value of the lengths of all predictive regions obtained for a specific significance level. This gives us a measure of how efficient our algorithm is. We prefer using the median value instead of the mean, because it is more robust: if a few of the predictions are extreme (either very large or very small) due to noise or due to over-fitting, the average will be affected, while the median will remain unchanged. The proposed algorithm has been tested on the Boston Housing data set, which gives the values of houses, ranging from 5K to 50K, depending on 13 attributes. In the experiments 100 splits of this data set have been used, with different examples for the proper training, calibration, and test sets each time. In every split the calibration set consisted of 99 examples, the test set of 25 examples, and the rest of 382 examples was used as the proper training set. In Tables 1 to 4 we give the widths of the predictive regions and the empirical reliability (i.e., the percentage of cases when the true label turned out to be outside the predictive region) of these bounds for specific significance levels (1%, 5%, and 10%) and for specific kernels (Polynomial, RBF, and ANOVA) used in conjunction with RR. The results in Tables 1 and 2 confirm the validity of our algorithm: the rate of successful predictions is at least equal to the desired accuracy. In Tables 3 and 4 we present results about tightness of our predictive regions for both variations of our algorithm. As we can see, in both cases the best results
352
Harris Papadopoulos et al.
Table 3. The median width of the predictive regions, for different confidence levels using (1) as strangeness measure Kernel Type
Median width 90% 95% 99% Polynomial 9.6 12.6 16.1 RBF 9.9 13.5 29.4 ANOVA Splines 9.5 12.2 15.4
Table 4. The median width of the predictions made, for different accuracy levels using (7) as strangeness measure Kernel Type
Median width 90% 95% 99% Polynomial 9.5 11.8 15.6 RBF 10 12.7 23.5 ANOVA Splines 9.7 11.7 15
Table 5. Comparison of the mean width of the predictive regions, for ICM and TCM Variation1 Algorithm
Mean width 90% 95% 99% ICM 10.8 12.7 17.5 TCM Variant1 12.4 16.7 28.8
Table 6. Comparison of the median width of the predictive regions, for ICM and TCM Variation2 Algorithm
Median width 90% 95% 99% ICM 9.5 11.8 15.6 TCM Variant2 7.5 9.3 18.8
were obtained when we used the ANOVA splines as our kernel function. By comparing the results for the two variations we notice that the method which uses (7) as strangeness value gives, on average, slightly better results. The difference is becoming relatively larger as we move toward higher confidence levels. Figures 1 and 2 complement the information given in Tables 3 and 4 for ANOVA splines by also giving other characteristics of the distribution of the predictive interval widths. These figures show that the distribution of the method which uses (7) as strangeness measure is more spread out, as we would expect.
Inductive Confidence Machines for Regression
353
Constant interval sizes
Region width
Anova splines of order 4 with a=10 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5
90%
95% Accuracy level
99%
Fig. 1. Medians, upper and lower quartiles, and 10th and 90th percentile of the distributions of the predictive interval widths for the method using (1) as strangeness value
Finally, in Tables 5 and 6 we compare our algorithm with two variations of TCM which are described in (Melluish, Vovk, Gammerman, 1999) and (Nouretdinov, Melluish, Vovk, 2001), using the polynomial kernel1 . It is obvious that ICM outperforms the first variation of the TCM in all hypothesis tests, while compared with the second variation the difference is small. Though the set of α values in TCM is richer than the one in ICM and the Ridge Regression rule is derived using less examples in the case of induction, this doesn’t seem to worsen the performance of the latter significantly. We also tested the algorithm on the Bank Rejection and the CPU Activity data sets both of which consist of 8192 examples split into 4096 training and 4096 test examples. This was done in order to demonstrate the algorithms ability to handle large sets. The Bank Rejection data set was generated from a simplistic simulator, which simulated the queues in a series of banks. Our task is to predict the rate of rejections(i.e., the fraction of customers that are turned away from the bank because all the open tellers have full queues) depending on 32 attributes. The CPU Activity data set is a collection of a computer systems activity measures collected from a Sun Sparcstation 20/712 with 128 Mbytes of memory running in a multi-user university department. Users would typically be doing a large variety of tasks ranging from accessing the internet, editing files or running 1
In Table 5 we compare the mean widths instead of the median as in (Melluish, Vovk, Gammerman, 1999) only mean widths are reported
354
Harris Papadopoulos et al.
Variable interval sizes
Region width
Anova splines of order 4 with a=10 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5
90%
95% Accuracy level
99%
Fig. 2. Medians, upper and lower quartiles, and 10th and 90th percentiles of the distributions of the predictive interval widths for the method using (7) as strangeness value Table 7. The median width of the predictive regions, for the Bank Rejection and the CPU Activity data sets Data Set Strangeness measure Bank (1) (7) CPU (1) (7)
Median width 90% 95% 99% 0.29 0.39 0.51 0.24 0.28 0.40 8.89 11.81 16.79 8.77 10.71 15.89
very cpu-bound programs. Our task is to predict the portion of time that the cpus run in tuser mode depending on 12 attributes. The median widths of the predictive regions obtained by the Bank Rejection and the CPU Activity data sets are listed in Table 7. The rate of rejections in the Bank Rejection data set ranges from 0 to 0.7 and the portion of time that the cpus run in user mode in the CPU Activity data set ranges from 0 to 99. So even for a 99% confidence level the second variation of our algorithm gives a predictive region which covers only 57% and 17% of the whole range of labels for each set respectively.
Inductive Confidence Machines for Regression
6
355
Conclusions
We have defined ICM, a computationally efficient confidence machine for the regression problem based on inductive inference. In addition to the bare prediction ICM outputs a measure of its accuracy which has a clear probabilistic interpretation. The experimental results obtained give good empirical reliability that is constantly above the specified confidence level. This confirms that the algorithm can be used for obtaining reliable predictions. Furthermore, the width of our predictive regions, is almost as tight as that of the transductive version. The tightness of our predictive regions can be seen by the fact that our best result for the Boston Housing data set, which is given by the second variation of the algorithm (using (7) as strangeness measure), predicts a region that is only 33% of the whole range of house prices at the 99% confidence level.
Acknowledgements We are grateful to David Surkov for useful discussions. This work was partially supported by EPSRC through grants GR/L35812 (“Support Vector and Bayesian learning algorithms”), GR/M14937 (“Predictive complexity: recursiontheoretic variants”), and GR/M16856 (“Comparison of Support Vector Machine and Minimum Message Length methods for induction and prediction”).
References 1. Cristianini, N., & Shawe-Taylor, J. (2000). Support Vector Machines and Other Kernel-based Learning Methods. Cambridge: Cambridge University Press. 2. Fraser, D. A. S. (1957). Non-parametric Methods in Statistics. New York: Wiley. 3. Gammerman, A., Vapnik, V., & Vovk, V. (1998). Learning by transduction. Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence (pp. 148–156). San Francisco: Morgan Kaufmann. 4. Li, M., & Vit´ anyi, P. (1997). An Introduction to Kolmogorov Complexity and Its Applications. Second edition. New York: Springer. 5. Melluish, T., Saunders, C., Nouretdinov, I., & Vovk, V. (2001). Comparing the Bayes and typicalness frameworks. ECML’01. 6. Melluish, T., Vovk, V., & Gammerman, A. (1999). Transduction for Regression Estimation with Confidence. NIPS’99. 7. Nouretdinov, I., Melluish, T., & Vovk, V. (1999). Ridge Regression Confidence Machine. Proceedings of the 18th International Conference on Machine Learning. 8. Nouretdinov, I., Vovk, V., V’yugin, V., & Gammerman, A. (2001). Transductive Confidence Machine is universal. Work in progress. 9. Proedrou, K., Nouretdinov, I., Vovk, V., & Gammerman, A. (2001). Transductive Confidence Machines for Pattern Recognition. Proceedings of the 13th European Conference on Machine Learning. 10. Saunders, C., Gammerman, A., & Vovk, V. (1999). Transduction with confidence and credibility. Proceedings of the 16th International Joint Conference on Artificial Intelligence (pp. 722–726).
356
Harris Papadopoulos et al.
11. Saunders, C., Gammerman, A., & Vovk, V. (2000). Computationally efficient transductive machines. ALT’00 Proceedings. 12. Vapnik, V. (1998). Statistical Learning Theory. New York: Wiley. 13. Vovk, V., Gammerman, A., & Saunders, C. (1999). Machine-learning applications of algorithmic randomness. Proceedings of the 16th International Conference on Machine Learning (pp. 444–453). 14. Vovk, V., & Gammerman, A. (2001). Algorithmic Theory of Randomness and its Computer Applications. Manuscript. 15. Vovk, V., and Gammerman, A. (1999). Statistical applications of algorithmic randomness. Bulletin of the International Statistical Institute. The 52nd Session. Contributed Papers. Tome LVIII. Book 3 (pp. 469–470).
Macro-Operators in Multirelational Learning: A Search-Space Reduction Technique Lourdes Pe˜ na Castillo and Stefan Wrobel Otto-von-Guericke-University Magdeburg {pena,wrobel}@iws.cs.uni-magdeburg.de
Abstract. Refinement operators are frequently used in the area of multirelational learning (Inductive Logic Programming, ILP) in order to search systematically through a generality order on clauses for a correct theory. Only the clauses reachable by a finite number of applications of a refinement operator are considered by a learning system using this refinement operator; ie. the refinement operator determines the search space of the system. For efficiency reasons, we would like a refinement operator to compute the smallest set of clauses necessary to find a correct theory. In this paper we present a formal method based on macro-operators to reduce the search space defined by a downward refinement operator (ρ) while finding the same theory as the original operator. Basically we define a refinement operator which adds to a clause not only single-literals but also automatically created sequences of literals (macro-operators). This in turn allows us to discard clauses which do not belong to a correct theory. Experimental results show that this technique significantly reduces the search-space and thus accelerates the learning process.
1
Introduction
Typically, a multirelational learning system takes as input background knowledge B, positive examples E + and negative examples E − , and has to find a correct theory T . A correct theory is a set of Horn clauses which implies E + and is consistent1 with respect to E − . This theory is then used to classify unseen examples E ? as positive or negative. To find T the system has to search among permitted clauses (hypothesis space) for a set of clauses with the required properties. For instance, if a e ∈ E + is not implied by T the system should search for a more general theory; on the other hand, if T is not consistent with E − the system should look for a more specific one. Refinement operators are commonly used in multirelational learning systems as a way to systematically search for T . A refinement operator is a function which computes either a set of specializations (downward operator) or generalizations (upward operator) of a clause. Thus, refinement operators allow us to search 1
Note, however, that the condition of consistency is often relaxed and the systems actually try to minimize the classification error over the training data.
T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 357–368, 2002. c Springer-Verlag Berlin Heidelberg 2002
358
Lourdes Pe˜ na Castillo and Stefan Wrobel
step-by-step through a generality order2 on clauses for a correct theory. The search space H is then restricted to clauses that can be obtained by successively applying the refinement operator. This approach works if and only if there is a number of refinement steps to every clause in at least one correct theory. [8] For efficiency reasons, we would like a refinement operator ρ to compute the smallest set of clauses necessary to find T . Many downward operators currently used work by adding to a clause C one literal available from B. In this paper we present a method based on automatically created macro-operators [5] to reduce the hypothesis space defined by one of these refinement operators. In the search and planning literature a macro-operator (or macro for short) is a sequence of operators chosen from the primitive “operators” of a problem. We consider our primitive operators the literals available to be added to C, so a macro is a sequence of literals. By refining C using only “valid” literals and macros, we discard a significant number of clauses (hypotheses) which cannot belong to T . Macros are based on the fact that literals exist which if added alone to a clause C affect neither the coverage nor the consistency of C. For example, consider we want a system to learn the concept divisibleBy(x, y) (x is divisible by y) and as part of B we give the literals isN ull(z) and remainder(x, y, z) where x, y are input arguments and z is computed by remainder and used by isN ull. The literal remainder is true for every combination of numeric values for x and y and adding it by itself to C does not help to distinguish between E + and E − . However, the macro remainder(x, y, z), isN ull(z) is certainly more useful; thus our ρ adds to C this macro instead of adding either remainder alone or remainder together with another literal that does not use z. In this paper, we precisely develop the macro-based approach for the class of refinement operators that are lower bounded by a bottom clause, such as the one used in Progol [7]. Specifically, we adapt the refinement operator used in Mio [9] (henceforth referred to as literal-based ρ). The literal-based ρ is then used in our experiments as our comparison point. Experiments on four application domains show that the reduction of the search space obtained using the macro-based ρ produces a significant speed-up of the learning process. Using the macro-based ρ implies to the user almost no extra effort since: 1) a formal characterization of literals which helps to decide whether a literal can be included in a macro without risk of missing a solution is provided, and 2) the mode declaration language is enhanced to allow the user to easily declare these literals. In addition, the macros are domain-independent and constructed automatically by the system based on the mode declarations given by the user. The remaining of this paper is organized as follows. The next section describes the learning system used and the literal-based ρ. Section 3 explains how the search space reduction is achieved. Section 4 defines a macro, the macro-based ρ and the macro generation algorithms. Section 5 discusses our experiments. Related work is briefly surveyed in Section 6 and Section 7 concludes. 2
A generality order determines what constitutes a “specialization” or “generalization” of a clause. Subsumption and logical implication are two of the most widely used generality orders.
Macro-Operators in Multirelational Learning
359
Mio and the Literal-Based ρ
2 2.1
The Learning System
Mio is an example-driven learning system introduced in [9] which uses a Progollike declaration language and, the same as Progol [7], lower bounds the search space with a most specific clause ⊥. This ⊥ is a minimal generalization sufficient to cover an example e ∈ E + . Mio performs a general-to-specific IDA* search of the hypothesis space to find a clause C to add to T . IDA* is guided by the number of literals needed to obtain an I/O-complete clause. An I/O-complete clause has neither unbound output variables in the head nor unbound input variables in the body. In addition, Mio (contrary to Progol) selects stochastically the examples from which it learns, performs parallel search and enforces type strictness. To construct ⊥, the user has to give a set of mode declarations. These mode declarations define the literals from B which can be used as the head and in the body of the clauses in H. In the mode declarations the arguments of a literal are defined either as a constant (#cte), as an output variable (−var), or as an input variable (+var). A literal provides a value if it has a −var and consumes a value if it has a +var. We can say that a literal p is a consumer of literal q if p has at least one +var bound to an output argument value of q (q is then a provider of p). A literal p defined by the mode declarations can appear in ⊥ if and only if p has at least one provider placed before p in ⊥ for each +var ∈ p. The most specific clause is then defined as follows. Definition 1 (Most Specific Clause ⊥ [7]). Let ⊥ be the most specific definite clause (bottom clause) constructed with the literals defined by the mode declarations, background knowledge B and example e ∈ E + such that: (a) B ∧ ⊥ h e (ie. e can be derived in h resolution steps). (b) ⊥∞ ⊥ where ⊥∞ is the (potentially infinite) conjunction of ground literals which are true in all models of B ∧ ¬E (ie. B ∧ ¬E |= ⊥∞ ). The most specific clause can be seen as a sequence of literals where every literal is uniquely mapped to its position in ⊥ starting with the first body literal after the head of ⊥ (ie. h ← 1, . . . , i − 1, i, . . . , n.). We refer henceforth to every literal i (1 ≤ i ≤ n) in ⊥ as the integer corresponding to its position. The literals in ⊥ are placed according to their consumer-provider relationships. For both the literal-based and the macro-based method, we assume that the clauses are evaluated by a heuristic function eval(C) considering the length of C (| C |= number of body literals in C) and the number of positive p and negative n examples covered by C, and that eval(C)3 favours shorter clauses over longer ones with the same coverage and consistency. 3
In Mio, eval(C) = count p covered by(C)/ | C | −count n covered by(C).
360
2.2
Lourdes Pe˜ na Castillo and Stefan Wrobel
Literal-Based ρ
The literal-based refinement operator consists in adding one literal from ⊥ to a clause C. The search space H defined by this operator allows the system to search from the empty clause (✷) to ⊥. Notice that by the definition of ρ given below the hypotheses’ literals keep the topological order of ⊥. Definition 2 (Literal-Based ρ). Let i and j be literals in ⊥ and C be a clause whose last literal is i, then: ρ(C) = {C ∪ {j} | i < j and all + var ∈ j are either bound to −var of literals already in C or to +var in the head of C}. Assume C ∈ ρ(C), then C is heuristically evaluated (eval(C )) iff C is I/O-complete. Then, the search space is the following. Definition 3 (Hypothesis Space H). The hypothesis space H consists of I/O-complete hypotheses whose literals appear in ⊥: H := {C ⊆ ⊥ | C is I/O-complete}. Mio considers a clause C ∈ H as a solution (ie. as a candidate to be included in T ) iff C maximizes eval(C) and satisfies several user-defined parameters such as maximum length allowed, minimum required coverage per clause, etc. Lemma 1. Let C ∈ ρj (Cj ). C can be a solution iff there is no C ∈ ρi (Ci ), i < j such that eval(C) ≥ eval(C ).
3
Reducing the Search Space
As explained in the introduction sometimes there are literals defined in the mode declarations that, when queried, succeed for each combination of input argument values. These literals usually are needed to introduce a variable. The ρ described in Def. 2 produces clauses C ∈ ρ(C) which might vary from C only in having one of these “non-gain”4 literals added. However, since these literals succeed for every combination of input argument values, they modify neither the coverage nor the consistency of C but they do increase its length (ie. eval(C) ≥ eval(C )). Thus C cannot be a solution and could be safely removed from H. One way to avoid that C differs from C only in having one of these “nongain” literals added is to identify such literals in the mode declarations and to use a ρ which does not add “non-gain” literals by themselves to C but only as part of a macro. Since these “non-gain” literals usually contain output variables, we can add them to C together with at least one of their consumers. To declare the “non-gain” literals, we enhance the mode declaration language with the ∗var notation. A ∗var is an output variable argument of a “non-gain” literal. To decide whether a literal should not be added by itself to C, we divide the literals in the following categories based on their consumption/providing properties. 4
The information gain of using this literal is null.
Macro-Operators in Multirelational Learning
361
Definition 4. Let i be a literal in the body of ⊥. Then: 1. i is called an independent provider iff i has output variables but does not succeed for every combination of input argument values (it does not contain ∗var). (e.g. see literals b and c in Table 1) 2. i is called a dependent provider iff i has output variables and succeeds for every combination of input argument values (it contains at least one ∗var). (e.g. see literal a in Table 1) 3. i is called a head provider iff a −var in the head of ⊥ is bound to an output variable (−var or ∗var) in i. (e.g. see literals b and e in Table 1) 4. i is called an independent consumer iff each of its +var is bound to a +var in the head of ⊥ or to a −var in a provider. (e.g. see literal e in Table 1) 5. i is called a dependent consumer iff at least one of its +var is only provided by ∗var (it is not bound to a +var in the head of ⊥ or to a −var in a provider). (e.g. see literal d in Table 1) To illustrate the literal characterization given above, assume we want to learn the target predicate h(+x, +y, −u) and we are given Table 1. Since a is satisfied by all the examples and has an output variable, a is a dependent provider. On the other side a is the only provider of d, thus d is a dependent consumer. In the mode declarations a has to be declared as a(+x, ∗w). All the other mode declarations are identical to the original literal definitions. Table 2 shows how a literal has to be used based on its classification. If a literal is not a provider one should ignore the first column. Theorem 1. Let i be a dependent provider in C ∈ ρ(C). C is not a solution if there is not at least one consumer of i in C too. Proof (by contradiction). Assume that C = C ∪ {i} is a solution and i is a dependent provider and there is not a consumer of i in C . But, since i succeeds for every combination of input argument values, eval(C) ≥ eval(C ) and since C = C ∪ {i}, | C |<| C |. Hence C cannot be a solution (Lemma 1) which contradicts the assumption and completes the proof. By Definition 4 and Theorem 1 we can then conclude that the search space can be further restricted to H := {C ⊆ ⊥ | C is I/O-complete and each depen-
Table 1. Toy example to illustrate the classification of literals Literal Definition Instance h(Ii ) Satisfied Literals h(+x, +y, −u) I1 + a, b, d, e a(+x, −w) I2 + a, c, d, e b(+y, −u, −z) I3 + a, b, d c(+x, +y, −z) I4 a, c, e d(+w) e(+z, −u)
362
Lourdes Pe˜ na Castillo and Stefan Wrobel
Table 2. Literal use according to its classification Provider Type Consumer Type Literal Use Independent-Provider Independent-Consumer As a single literal Independent-Provider Dependent-Consumer In macros Dependent-Provider In macros Head-Provider As a single literal
dent provider in C has at least one consumer for one of its ∗var}. In the next section we propose a refinement operator to accomplish this reduction.
4
Macro-Based Method
So far we have intuitively used the term macro in relation to a refinement operator, but now we are ready to formally define it. For that, we define first what a legal subsequence of literals is. Definition 5 (Subsequence of Literals). Let ⊥ be a sequence of literals ⊥ = 1 . . . n then j1 . . . ji (i ≤ n) is a subsequence of literals of ⊥ when jk < jk+1 , ∀k ∈ {1, . . . , i − 1} and jk ∈ [1, n], ∀k ∈ {1, . . . , i}. Definition 6 (Legal Subsequence of Literals Given Another Subsequence). Let j1 . . . ji−1 be a legal subsequence of literals according with Def. 7. Then ji . . . jm (m ≤ n) is a legal subsequence of literals given j1 . . . ji−1 iff: (a) every ∗var of dependent-provider ji is used by a consumer ∈ {ji+1 , . . . , jm } or ji is also a head-provider; and, (b) every +var of consumer js , ji < js ≤ jm , is bound to an output variable in a provider ∈ {j1 , . . . , js−1 } or to a +var in the head of ⊥. Definition 7 (Legal Subsequence of Literals). A subsequence of literals j1 . . . ji−1 is a legal subsequence iff j1 . . . ji−1 is a legal subsequence given ∅. ∅ is also a legal subsequence. Definition 8 (Macro-Operator). A macro-operator is the shortest legal subsequence of literals ji . . . jm for which exists a subsequence j1 . . . ji−1 of literals in ⊥ where jg < ji , ∀g ∈ {1, . . . , i − 1}, so that ji . . . jm is legal given j1 . . . ji−1 . 4.1
Macros’ Ordering
In the literal-based method the literals are ordered by their position in ⊥. For the macro-based method we introduce a new ordering based on the maximum provider of the literals in a macro.
Macro-Operators in Multirelational Learning
363
Definition 9 (Maximum Provider of a Literal). Let j be a consumer. The maximum provider of j is provider i of j (i < j) with the greatest position in ⊥. The position of the head of ⊥ is 0. Definition 10 (Maximum Provider of a Subsequence of Literals). The maximum provider of a subsequence of literals a = ji . . . jm is: max prov(a) = maxjk ∈{ji ...jm } (maximum provider of literal(jk )) Definition 11 (Comparison between Subsequences of Literals). Let a = ji . . . jm and b = jk . . . jn be legal subsequences of literals, then a < b iff: (a) max prov(a) < max prov(b); or, (b) max prov(a) = max prov(b) and a is lexicographically < b (ie. ji < jk ∧ ji+1 < jk+1 ∧ . . . ∧ jm < jk+m ). 4.2
Macro-Based Refinement Operator
Now everything is ready to define the macro-based refinement operator. Note that single literals can also fulfill the macro’s definition and be used in the refinement operator defined below. Definition 12 (Macro-Based Refinement Operator ρ ). Let a and b be macros obtained from ⊥ and C be a clause (and a legal subsequence of literals) whose last added macro is a, then: ρ (C) = {C ∪ {b} | b ≥ a and b is a legal subsequence given C}. Assume C ∈ ρ (C), then C is heuristically evaluated iff C is I/O-complete. We now prove the crucial property of the macro-based approach, namely that the macro-based ρ finds the same solutions as the literal-based one. Theorem 2. Let Ci be a solution. If Ci ∈ ρm (Ci−1 ) then there exists such n that Ci ∈ ρn (Ci−1 ), (m, n ∈ N0 ). Proof (by induction). Let jm be the last literal in Ci . By Theorem 1 jm cannot be a dependent provider (Ci is a legal subsequence) and by Def. 3 Ci is I/Ocomplete. There are two cases to consider: 1. jm is not a dependent consumer and it exists a macro b s.t. b = jm . 2. jm is a dependent consumer and it exists a macro b s.t. b = ji . . . jm . Basis: Let n = 1. Consider the first case b = jm . Given that Ci ∈ ρ1 (C0 ), all the input argument values of jm are +var in the head of ⊥ (ie. max prov(jm ) = 0) and b is a legal subsequence given ∅. Therefore Ci ∈ ρ1 (C0 ). Now consider the second case b = ji . . . jm . Since Ci ∈ ρm (C0 ) where m =| Ci |, max prov(ji ) = 0 and b is a legal subsequence given ∅. Thus Ci ∈ ρ1 (C0 ).
364
Lourdes Pe˜ na Castillo and Stefan Wrobel
Induction Step: Consider any n > 1. Assume again the case when b = jm . For b to be a legal subsequence given Ci−1 it is only required that at least one provider for every input argument of jm is already in Ci−1 . All the providers of jm can be added before b because ∀ providers k of jm , max prov(jm ) > max prov(k). Then a ρ -chain (ρ1 . . . ρn−1 ) can be found so that all the required providers of jm are in Ci−1 and then b is a legal subsequence given Ci−1 . Thus Ci ∈ ρn (Ci−1 ), as claimed. Assume the second case when b = ji . . . jm . All providers k (k < ji ) of jl , i ≤ l ≤ m, are required to be in Ci−1 so that b is a legal subsequence given Ci−1 and since max prov(b) > max prov(k), ∀k, a ρ -chain can be found so that Ci−1 contains all the required providers of b and b is a legal subsequence given Ci−1 . Hence Ci ∈ ρn (Ci−1 ) as claimed. This completes the proof of the induction step and thus of the theorem. Using the macro-based ρ the hypothesis space H is reduced to I/O-complete clauses which are legal subsequences of literals, ie. H := {C ⊆ ⊥ | C is I/Ocomplete and a legal subsequence}. 4.3
Algorithms to Construct the Macros
The procedure to obtain the ordered set D of all macros from a given most specific clause and mode declarations can be seen in Figure 1. The algorithm we use to compute the set A of all the macros starting with a dependent provider i is shown in Figure 2. In the second algorithm, × means a special case of Cartesian product where the resulting set’s elements are numerically ordered; A[j] represents the element j of set A; and, disaggregate(A[j]) is a function that separates the compound element A[j] into its component parts. Example 1. In this example we illustrate how the algorithm in Fig. 2 works. Suppose ⊥ = h(+x) ← p(+x, ∗y, ∗z), t(+x, −u), o(∗w), q(+x, −w), r(+w, +z), s(+u, +y), m(+z), then the algorithm has to compute the set of macros starting with the dependent provider p(+x, ∗y, ∗z). Notice that the macros are always ordered according to their position in ⊥. In this example the ordered set D is D = {t, q, [p, m], [p, t, s], [p, o, r], [p, q, r], [o, r]} 1. i = p(+x, ∗y, ∗z), Zy = {s}, Zz = {r, m} then A = {p} × {s} ∪ {p} × {r, m} = {[p, s], [p, r], [p, m]} for A [1] (2.1) Ys,+u = {t} (2.2.2) if {t} ∩ {p, s} = ∅ then T1 = {} ∪ {[p, s]} × {t} = {[p, t, s]} (2.2.3) T1 = {[p, t, s]} for A [2] (2.1) Yr,+w = {o, q} (2.2.2) if {o, q} ∩ {p, r} = ∅ then T2 = {[p, o, r], [p, q, r]} (2.2.3) T2 = {[p, o, r], [p, q, r]} for A [3], T3 = {[p, m]} 2. A = {[p, t, s], [p, o, r], [p, q, r], [p, m]}
Macro-Operators in Multirelational Learning
365
– for every literal i ∈ ⊥ = h ← 1, . . . , n do: 1. if i has to be a single-literal macro according with Table 2, add i to D. 2. if i is a dependent provider, construct the set A of all possible macros starting with i (as shown in Fig. 2) and add A to D. – sort D according to Def. 11.
Fig. 1. Algorithm to obtain the macros 1. let Z1 , . . . , Zn be the sets of consumers of i for ∗var1 , . . . , ∗varn ∈ i (ie. Zj = {k ∈ ⊥ | k consumes ∗varj ∈ i}) then: A = {i} × Z1 ∪ {i} × Z2 ∪ . . . ∪ {i} × Zn . 2. for every element A [j] in A do: 2.1 obtain the providers’ set Yk,+var for each +var not provided by i in consumer k of i such that Yk,+var = {g ∈ ⊥ | g is a provider of k for +var and k ∈ A [j] and g > i}. 2.2 let Tj = {A [j]} then for every Yk,+var do: 2.2.1 Tj = ∅. 2.2.2 ∀Tj [l] if Yk,+var ∩ disaggregate(Tj [l]) = ∅ then Tj = {Tj ∪ {Tj [l] × Yk,+var }. 2.2.3 Tj = Tj . 3. A = Tj
S
Fig. 2. Algorithm to obtain the macros starting with dependent provider i
5
Performance Analysis
Let us illustrate the search tree reduction obtained with the macros. Assume that ⊥ = h ← 1, . . . , 16, that the literals 1 and 4 are dependent providers of the literals 13 and 16, and that 2 and 3 are dependent providers of 14 and 15. Let D be D = { 5, 6, 7, 8, 9, 10, 11, 12, [2, 14], [2, 15], [3, 14], [3, 15], [1, 13], [1, 16], [4, 13], [4, 16]}. Suppose that we are looking for a two-literal clause. The search-trees expanded by IDA* using both methods are shown in Figure 3. The main advantage of using the macro-based refinement operator is the reduction of the search space; however, there is a cost for obtaining the macro’s set D. To analyze the performance of the macro-based method, we perform experiments on four datasets. The first dataset contains 180 positive and 17 negative examples of valid chess moves for five pieces5 ; the second consists of 3340 positive and 1498 negative examples of “safe”6 minesweeper moves; the third one is the dataset used in [2] with 256 positive and 512 negative examples of road sections where a traffic problem has occurred; and the last one is the ILP benchmark dataset mutagenesis [12] with 125 positive and 63 negative examples. Mio was run twice with the same parameters on every dataset: once using the literal-based and once the macro-based method. For the first two datasets we created 5 folds and for the last two 10 folds. Both methods were compared 5 6
E + of this dataset are contained in the distribution package of CProgol 4.4. Tiles without a mine.
366
Lourdes Pe˜ na Castillo and Stefan Wrobel
1
1,2 ... 1,13
2
1,16
1
1,13
3
2,3 ... 2,12
2,14
2,15
2
1,16
2,14
3,4 ... 3,12 3,14
3
2,15
3,14
4
3,15
4,5 ... 4,13
4
3,15
4,13 4,16
4,16
5
6 ... 11
5,6 ...
5, 12 ..... 11,12
5
5,6 ...
6 ...11
5,12 ...
11,12
Fig. 3. Search trees expanded by IDA* looking for two-literal clauses. The literalbased method considers 74 leaf-nodes (top) and the macro-based only 36 (bottom). The nodes surrounded by a rectangle are the solutions using the average number of nodes expanded per search (Avg. N/S) and the average run-time (Avg. RT). For these experiments Mio performs parallel search as described in [9]. The results can be seen in Table 3. Mio obtained exactly the same theory for chess, minesweeper and mutagenesis using both methods. In the traffic problem, the theory obtained with the macro-based method has an extra literal in one clause. The reason for this is that the literal defined as dependent provider in the traffic problem does not succeed for every input argument value (it is satisfied by 75.22% of the examples). However, this does not affect the accuracy of the theory. The macro-based method is in average 8 times faster than the literal-based method. However, the actual speedup on every dataset depends on problem features such as the size of the search space, how long it takes to compute eval(C), and the number of dependent providers. The macros are suitable for any domain where there exists a dependent provider; if a dependent provider is falsely declared, the macro approach obtains clauses with unnecessary literals.
6
Related Work
Mio uses mode declarations to define the hypothesis language and computes ⊥ in the same way as Progol [7] does. However, Mio differs from Progol in the search Table 3. Comparison between the literal-based and the macro-based method Dataset Method
Chess Literal- Macrobased based Avg. N/S 41.56 21.55 Avg. RT 3.33s 2.43s
Minesweeper Literal- Macrobased based 1141.90 69.90 16h33m 51m
Traffic Literal- Macrobased based 3179.47 543.41 3h17m 29m
Mutagenesis Literal- Macrobased based 58905.4 14299.5 7h45m 1h46m
Macro-Operators in Multirelational Learning
367
strategy, the heuristics, and the type strictness. Additionally the literal-base ρ differs from Progol’s refinement operator in that it does not perform splitting. Macros were introduced by Korf [5] as a domain-independent weak-method for learning. He cataloged the macro-operators as a non-optimal problem solving technique. Junghanns and Schaeffer [3] used macros to reduce the search-tree for Sokoban while preserving optimality. McGovern and Sutton [6] proposed the use of macros as a method to accelerate reinforcement learning. The macro-based method presented in this work preserves optimality and reduces the search-space of a multirelational learning system. In multirelational learning the macros can be considered as a general framework for syntactic bias specification such as MOBAL‘s schemata [4] or relational clich´es [11]. Relational clich´es add sequences of literals to a clause as a heuristic way to solve the search myopia of a greedy algorithm. Although macros could solve too the shortsightedness of a greedy search, they are not a heuristic but a formal method to reduce the search space that guarantees that the same solutions as the base method are found. Blockeel and De Raedt [1] propose a lookahead technique to improve the evaluation of a clause when a dependent provider is added. In their technique the refinement operator is redefined to incorporate twostep-refinements; the macro-based ρ can be seen as a refinement operator which incorporates n-step-refinements. Contrarily to macros, the lookahead increases the run time and requires the user to provide a template for every providerconsumer match. In contrast to all these methods, in our approach macros are computed automatically and the user only needs to mark the output variables of every dependent provider with the ∗var notation. Quinlan included in Foil determinate literals to overcome the greedy search myopia by adding at once all the determinate literals to a clause [10]. The determinate literals are found automatically by Foil; however, since all the determinate literals are added, a clause refining (pruning) step is later needed to remove the unnecessary literals. Determinate literals must be uniquely satisfied by all the positive examples while a dependent provider must be satisfied by all the examples but can be multiple satisfied.
7
Conclusions and Future Work
In this paper we propose a downward refinement operator (ρ) which adds macrooperators to a clause. By using this macro-based ρ a reduction in the search space is obtained which results in shorter run-times. We have proved that this refinement operator finds the same solutions as the literal-based one. In addition we present two algorithms to compute automatically a set of macros given a most specific clause and a set of mode declarations. A literal’s classification based on the literal’s consumer/provider properties is proposed to assist the user to determine whether a literal can be marked as a dependent provider. For this work we have used as base case a refinement operator similar to the one used in Progol [7]; however, we believe that macro-operators are also suitable for other refinement operators and as a lookahead technique for greedy
368
Lourdes Pe˜ na Castillo and Stefan Wrobel
systems. In the current implementation the user has to indicate in the mode declarations which literals are dependent providers; however, it should be possible to perform an automatic analysis of the body literals and determine which ones are dependent providers. It is part of the future work to explore these ideas.
Acknowledgments This work was partially supported by a scholarship of the federal state SachsenAnhalt, Germany. We would like to thank Saˇso Dˇzeroski for providing the traffic dataset, and O. Meruvia and S. Hoche for helping to improve this paper.
References 1. Hendrik Blockeel and Luc De Raedt. Lookahead and discretization in ILP. In Saˇso Dˇzeroski and Nada Lavrac, editors, Proc. of the 7th Int. Workshop on ILP, volume 1297 of Lecture Notes in AI, pages 77–84. Springer-Verlag, 1997. 367 2. Saˇso Dˇzeroski, Nico Jacobs, Martin Molina, Carlos Moure, Stephen Muggleton, and Wim Van Laer. Detecting traffic problems with ILP. In D. Page, editor, Proc. of the 8th Int. Conference on ILP, volume 1446 of Lecture Notes in AI, pages 281–290. Springer-Verlag, 1998. 365 3. Andreas Junghanns and Jonathan Schaeffer. Sokoban: A challenging single-agent search problem. In IJCAI Workshop “Using Games as an Experimental Testbed for AI Reasearch”, pages 27–36, 1997. 367 4. J¨ org-Uwe Kietz and Stefan Wrobel. Controlling the complexity of learning in logic through syntactic and task-oriented models. In S. Muggleton, editor, Inductive Logic Programming, pages 335–359. Academic Press, 1992. 367 5. Richard E. Korf. Macro-Operators: A weak method for learning. Artificial Intelligence, 26(1):35–77, 1985. 358, 367 6. Amy McGovern, Richard S. Sutton, and Andrew H. Fagg. Roles of macro-actions in accelerating reinforcement learning. In Proc. of the Grace Hopper Celebration of Women in Computing, 1997. 367 7. Stephen Muggleton. Inverse entailment and Progol. New Generation Computing Journal, 13:245–286, 1995. 358, 359, 366, 367 8. Shan-Hwei Nienhuys-Cheng and Ronald de Wolf. Foundations of Inductive Logic Programming, volume 1228 of Lecture Notes in AI. Springer-Verlag, 1997. 358 9. Lourdes Pe˜ na Castillo and Stefan Wrobel. On the stability of example-driven learning systems: a case study in multirelational learning. In C. A. Coello Coello, A. de Albornoz, E. Sucar, and O. Cairo, editors, Proc. of MICAI’2002, volume 2313 of Lecture Notes in AI. Springer-Verlag, 2002. 358, 359, 366 10. J. Ross Quinlan. Determinate literals in inductive logic programming. In John Mylopoulos and Raymond Reiter, editors, Proc. of the 12th IJCAI, volume 2, pages 746–750. Morgan Kaufmann, 1991. 367 11. Glenn Silverstein and Michael J. Pazzani. Relational clich´es: constraining constructive induction during relational learning. In L. Birnbaum and G. Collins, editors, Proc. of the 8th Int. Workshop on Machinge Learning, pages 203–207, 1991. 367 12. Ashwin Srinivasan, Stephen Muggleton, Ross D. King, and Michael J. E. Sternberg. Theories for mutagenicity: a study of first-order and feature based induction. Artificial Intelligence, 85(1–2):277–299, 1996. 365
Propagation of Q-values in Tabular TD(λ) Philippe Preux Laboratoire d’Informatique du Littoral UPRES-EA 2335, Universit´e du Littoral Cˆ ote d’Opale BP 719, 62228 Calais Cedex, France
[email protected]
Abstract. In this paper, we propose a new idea for tabular TD(λ) algorithm. In TD learning, rewards are propagated along the sequence of state/action pairs that have been visited recently. In complement to this, we propose to propagate rewards towards neighboring state/action pairs along this sequence, though unvisited. This leads to a great decrease in the number of iterations required for TD(λ) to be able to generalize since it is no longer necessary that a state/action pair is visited for its Q-value to be updated. The use of this propagation process makes tabular TD(λ) coming closer to neural net based TD(λ) with regards to its ability to generalize, while keeping unchanged other properties of tabular TD(λ).
1
Introduction
Time derivative (TD) algorithms [9] are important reinforcement learning methods. Assuming discrete time, at each time step t ∈ N, being in a certain state st ∈ S, a reinforcement learning algorithm learns which action at ∈ A(st ) to T emit in order to optimize the total amount of rewards RT = t=0 γ t rt it will receive, where rt is the reward received at time step t, γ is the discount factor (γ ∈ [0, 1]), and T can be ∞. In the sequel, we assume that conditions to apply dynamic programming techniques are not met. A key point in the design of a TD(λ) algorithm lies in the choice of a structure to store estimates of qualities (or values). One possibility is to use a look-up table in which each state/action pair is associated to one element. The access as well as the update of a quality costs a single array element access. An update only concerns one state/action pair and to obtain an estimate of all state/action pairs, all pairs should be visited once at the very least. However, the size of the table is O(| S |) which may be considerable. The other possibility is to use some sort of approximation architecture which represents the information in a much more compact form. Different architectures have been used [2]. Among them, neural networks are rather popular and well-known for their ability to generalize from their training. They have been used to tackle problems of large size, such as TDGammon which learnt to play Backgammon at a grand master level [13]. In this case, states are encoded in some way to be fed into the network. The output of the network provides the current estimate of the state value. This estimate is a function of the network weights. The number of weights to be learnt is very small T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 369–380, 2002. c Springer-Verlag Berlin Heidelberg 2002
370
Philippe Preux
with regards to the number of possible states. In TD-Gammon, | S | is estimated to be 1020 , while the number of neurons is varying around 300-400 depending on the version of the program, resulting in O(104 ) weights to learn. These weights are typically learnt using a backpropagation process. So, the access to a state value costs the computation of the output of the network, while its update costs a backprop; clearly, these computaitonal costs are much larger than in the case of tabular TD(λ). However, when weights are updated for a given input state, the estimated value of all states are updated in the same time. There is thus some sort of implicit parallel update of the estimation of all values. This confers the neural based TD(λ) much greater ability for generalization: it is no longer required that a state is visited to have an estimate of its value. Tabular TD(λ) is appealing when the number of states is small, whereas the ability for generalization of neural networks is very attractive. So, in this paper, we propose a variation of tabular TD to enhance its ability to generalize. This variation can be embedded in Q-learning, Q(λ) [14], Sarsa, Sarsa(λ) [8, 7] and their derivatives, and each of these algorithms can benefit of it. Basically, the idea is to add a propagation process of Q-values (and we call the resulting algorithm, the Propagation-TD, or PTD, as well as PQ(λ) and PSarsa(λ)). This idea relies on the observation that two neighboring states s and s are such that, let a an action that can lead from s to s , the quality of (s, a) (denoted Q(s, a)) is likely to be closely related to the value of s (denoted V (s )): if V (s ) is high, then the quality of state/action pairs that lead to s is likely to be high too, and conversely, unless the return when transiting from s to s via action a is very large. Actually, neural network based TD(λ) faces exactly the same problem. This propagation process transforms tabular TD(λ) into something coming close to neural TD with regards to its generalization ability. Indeed, it is no longer required that a state/action pair is visited for its quality to be estimated. Except with regards to the compactness of the neural net representation of Q-values, we end-up with an algorithm that combines the advantages of both approaches to TD learning. In the sequel of the paper, we detail the propagation process in Sec. 2. Then, we evaluate its interest using an experimental approach in Sec. 3. In this evaluation, we are mainly interested in PTD as an agent behavior control learning algorithm, rather than as a value learning algorithm, although both issues are closely related. We are also interested in combining supervised learning with reinforcement learning in order to show the agent how to achieve its goal and makes learning faster. After that, we discuss and conclude in Sec. 4.
2
The Propagation Algorithm
In this section, we detail the keypoint of PTD, that is the propagation process. Before going into details, we wish to make it clear that this is a generic idea from which different strategies can be drawn for general cases, and some specific strategies can be drawn for specific cases, such as episodic tasks. So, we will not detail all these specificities. Some of them will be discussed in the experimental b discussion, in Sec. 3. We use the notation s → s to mean that the action b
Propagation of Q-values in Tabular TD(λ)
371
emitted in state s has a non null probability to lead to state s in a single transition; s is a predecessor of s. The propagation process acts as follows. Let us consider a TD(λ) process, and let us suppose that it has visited the state/action pairs (st , at )t∈[0,n] since it began. Then, the generic idea of PTD is as follows: – build S0 = {st∈[0,n]}. b
Then, Q(s , b) should be updated for any (s , b) such that s → s ∈ S0 , then – build S1 = S0 ∪ {s , such that Q(s , b) has just been updated }. b Then, Q(s , b) should be updated for any (s , b) such that s → s ∈ S1 , and consequently, – build S2 = S1 ∪ {s , such that Q(s , b) has just been updated }, – iterate by building Sd using Sd−1 until some criterion is fulfilled. Fig. 1 sketches the propagation algorithm embedded in a naive1 Q(λ) using accumulating eligibility traces. The propagation process is handled in the “Propagate” procedure. We could equally have given the propagation algorithm embedded into Sarsa(λ) by simply adding a call to “Propagate” at the end of it. The propagate procedure relies on the assumption that we know whether a transition between one state to an other state via a certain action is possible or not; however, we do not need to know the probability of this transition. Let us now discuss some details. The propagation can be iterated as long as new state/action pairs are reached by way of a saturation process. However, the discount factor involves a fast decrease of the amount of update of qualities. So, after some iterations, this amount can be neglected and the propagation can be stopped: this may avoid a huge amount of updates, thus, and may save a huge amount of computation time. Then, propagation can be stopped after a certain amount of iterations either by limiting d, or by cutting off the propagation as the update becomes smaller than a given threshold. The update of Q-values is discounted by the factor γ at each iteration of the propagation (repeat/loop of the “Procedure Propagate” in Fig. 1). This tends to create a gradient of qualities in the state/action space. Visually, in this space, the standard Q(λ) digs some sort of a very narrow furrow while PTD creates a whole gradient field. The update of Q-values is performed as follows using the update equation of Q-learning: Q(s , b) ← Q(s , b) + α[γ d max Q(s, c) − Q(s , b)] c∈A(s)
There are two differences with Q-learning update rule. First, γ appears with exponent d which is the distance from s to the closest state in the eligibility trace. Second, the current reward term is absent since the emission of b in state s has not been performed actually. This choice might be discussed. If this can 1
The word “naive” is used according to [12]. In the case of Watkins Q(λ), a naive update of eligibility traces involves that these are not reset after exploratory steps.
372
Philippe Preux
seem troublesome, it should be pointed out that this is precisely what happens in neural TD(λ). Indeed, in neural TD, the quality of a state/action pair being stored in the weights of the network, changing the weights after a state/action pair has been visited involves an alteration of the quality of many state/action pairs. However, the consequences associated with the emission of the action are not observed for these pairs. In some cases of application of PTD, a certain term could be used to approximate a systematic consequence of an action, such as an action cost. More generally, one might use a model of expected rewards. In episodic tasks, a possible variation on this basic algorithm is to propagate only at the end of the episode and not at each step. For episodic tasks where a reward is always null except when the end of the episode is reached, propagating
Procedure PQ(λ) Initialize Q(s, a) arbitrarily, for all s, a Repeat (for each episode): Initialize s, a Initialize e(s, a) = 0 for all s, a Repeat (for each step in the episode): Take action a, observe r, s Choose a from s (e.g., -greedy strategy) a∗ ← arg maxb Q(s , b) δ ← r + γQ(s , a∗) − Q(s, a) e(s, a) ← e(s, a) + 1 For all s, a Q(s, a) ← Q(s, a) + αδe(s, a) e(s, a) ← γλe(s, a) Propagate s ← s ; a ← a Until s is terminal Procedure Propagate d←0 = 0} Sd ← {(s, a), such that e(s, a) Repeat d←d+1 Sd = Sd−1 For all s ∈ Sd−1 For all (s , b) ∈ Sd−1 If s → s ∈ Sd−1 Then update Q(s , b) add (s , b) to Sd Until stopping criterion is fulfilled b
Fig. 1. Generic outline of Propagation-Q(λ). The basis is the tabular Q(λ) as expressed in [12, p. 184], using a naive strategy, and accumulating eligibility traces. We assume that the “For all” construction has a SIMD semantic, that is, all iterations of a “For all” loop are executed at the same time
Propagation of Q-values in Tabular TD(λ)
373
the qualities only at that time does not alter the way the algorithm works and saves a lot of computation time.
3
Experimental Assessment
In this section, we provide an experimental assessment of propagation Q(λ) by comparing its performance with regards to Q(λ) on a test problem. The test problem consists in finding an outlet or a goal state in a 2D labyrinth. This problem can be made more complex by making it more or less random, stationary or not, ... So, this is actually a whole family of problems rather than a single problem that is used. 3.1
Experimental Setup
The labyrinth is defined in a gridworld where each cell can be either empty or filled with a piece of wall. The state space is the set of cells of the labyrinth. Each cell is numbered and the algorithm knows the cell in which it lies. Only states corresponding to empty cells can be occupied by the agent. In each state, the agent has the choice to stay in its current state, move upward, downward, leftward, or rightward if it does not hit a piece of wall; only non wall-hitting moves are allowed in any state. The agent receives a positive reward when it reaches a goal state of the labyrinth, otherwise it does not receive any reward. Clearly, this problem is markovian, and it is also fully deterministic. This is an episodic task in which a reward is given only at the end of the episode. It is not necessary to propagate at each iteration of the inner repeat/until loop of procedure PQ(λ). It is only necessary to propagate it at the end of the episode. We are mainly interested in algorithm to control an agent so that we are mainly interested in algorithms that learn to behave correctly, rather than in algorithms that learn to predict accurately state values or state/action pair qualities. So, to evaluate our approach, we are interested in the number of iterations required to learn to achieve the task, as well as the number of correct decisions that the algorithm is making regarding actions that have been emitted to reach a goal state, as well as the number of correct decisions the algorithm has learnt regarding actions that would have been emitted if the algorithm had followed other trajectories. This latter somehow measures the generalization the algorithm has made from its experience along trajectories it followed to reach goal states. With the word “iteration”, we mean here one iteration of the inner repeat/until loop in procedure PQ(λ) of Fig. 1. We compare the performance of Q(λ) and PQ(λ). Both are run on the two leftmost mazes shown at Fig. 2. The first one (called the “Pacman maze” later on) is composed of 206 states (wall cells are not counted), while the second one is composed of 2310 states. This latter maze is drawn from the partigame paper [4] and thus called the “Partigame maze”. Apart from the difference in their size, the two mazes differ greatly in that in the pacman maze, there are lots of walls and in most states, the number of possible actions is reduced to 2 (or 3 if immobility is possible). In the partigame maze, there are 4
374
Philippe Preux
(resp. 5) possible actions in the large majority of states. Finally, in the pacman maze, goal states are the two outlets while in the partigame maze, the goal state is the one used as such in the partigame paper. For each run of the algorithm, we set it into an initial cell and let it find a goal state. Then, we reset its position and run it again performing 100 reaches of the goal (without resetting Q-values along these 100 runs). To obtain correct statistics, we average the performance over 10 runs. In the pacman maze, initial states are drawn at random at each new run while in the partigame maze, initial and goal states are set to those used in the partigame paper [4]. To avoid certain biases due to the pseudo-random generator and be able to discuss experimental results more thoroughly, the algorithms can be run so that the initial states are the same for the different algorithms. Propagation is stopped whenever the amount being propagated becomes smaller than a certain threshold (10−10 ). γ is set to 0.9, α to 0.5, and λ to 0.9. Q-values are initialized to 0. The selection of action is -greedy, with set to 0.1, that is 10% of exploratory moves (exploratory moves are random moves). 3.2
Results
As expected, the average number of iterations is significantly smaller for PQ(λ). More preciselly, for both Q(λ) and PQ(λ), the naive strategy provides much better results than the non naive version. Consequently, we now use naive versions of the algorithms unless explciitely mentioned. At the least, Q(λ) needs 20% more iterations than PQ(λ) in the pacman maze, 4 times more for the partigame maze: this is clearly an effect of the size of the state/action space. Both algorithms perform approximately the same amount of backups (2 106 for the pacman maze along 100 episodes); PQ(λ) performs much more backups during the first episode than during the next ones. However, it should be said that the
G
G
G
S S
Fig. 2. Three mazes used in this paper. The leftmost maze has two goal states (the outlets) as indicated by G’s leftmost and rightmost cells at mid-heigth). In the rightmost maze, the algorithm has to find its way from an initial cell (S located in the bottom line) to a goal state (G located in the upper right corner). The rightmost maze is used at section 3.4. It has one initial state (S) and two goal states (stars), one goal being better than the other
Propagation of Q-values in Tabular TD(λ)
375
Table 1. This table summarizes some results obtained on the labyrinth problem for the pacman maze and the partigame maze. Figures are averaged over 10 runs, each made of 100 episodes. The second and third lines gives the size of the problems, either as the number of states, or as the number of possible state/action pairs (this number is given considering that immobility is forbidden; when immobility is a valid action, the number of state/action pairs is the sum of the second and third lines). The “Length” column gives the average number of iterations performed to reach the goal (this number takes into account the distance between the initial state and the closest goal state), the “Backups” column gives the average number of backups per run, while the “Greedy actions” column gives the average percentage of states for which the learnt greedy action is correct Pacman maze Partigame maze states 206 2310 state/action pairs 488 8750 Algorithm Length Backups Greedy actions Length Backups Greedy actions Q(λ) 46.7 2.3 106 48% 14.8 6 106 5% 6 PQ(λ) 38.3 2.7 10 77% 3.5 3 106 60%
way we performed the comparison is unfair for PQ(λ). Indeed, for this kind of problems (deterministic), once Q-values have been propagated to all state/action pairs, exploration is no longer necessary. The amount of exploratory moves being fixed by the value of (0.1), PQ(λ) always performs 10% exploratory moves that are almost always useless: once the gradient field has been created, a mere greedy selection of action is optimal. During a single execution of PQ(λ), the first episodes require much more iterations that the others. If we do not take the 10 first episodes into account in the measure of performance, the relative performances of the compared algorithms remain unchanged. This shows that the difference of performance does not rely on a transient effect. We have performed an analysis of the role of the parameter (α, γ, and λ). The results are contrasted since Q(λ) is at its best with high values for α and λ (α = 1.0, λ = 0.75), while PQ(λ) performs at its best with small values (α = λ = 0.1); both algorithms perform better with γ = 0.5. 3.3
Capacity of Generalization
It is interesting to try to measure the capacity of generalization of the algorithms. The capacity of generalization can be assessed as follows: having learnt a good trajectory from an initial state to a final state, for what fraction of the state space have correct actions also been learnt? Clearly, after one run of Q-Learning, a one step trajectory has been learnt; for Q(λ), a several step trajectory has been learnt, according to the length of the eligibility trace when the goal state is reached. In PQ(λ), much more correct actions have been learnt. For the two
376
Philippe Preux
Table 2. Proportion of states for which the correct action has been learnt Algorithm
after 1 episode after 1000 episodes Pacman maze Partigame maze Pacman maze Partigame maze Q-Learning 0.4% 0.04% 16.9% 6.0% Watkins’ Q(λ) 0.4% 0.04% 18.7% 7.3% naive Q(λ) 1.8% 0.2% 16.9% 8.0% PQ(λ) 11.1% 7.6% 72.4% 59.0%
mazes that are used here, we obtain the results of table 2 after 1 and 100 episodes. As expected, PQ(λ) obtains the highest measures. Of course, the first episode of PQ(λ) requires larger run times than for the other algorithms (approximately 10 times with our non optimized version). However, the next runs of PQ(λ) are very efficient and very fast: as far as a gradient is already available, the algorithm has just to follow it greedily to reach the goal. It is also interesting to discuss the proportion of correct behaviors that are learnt after a certain amount of episodes, or after having used a certain amount of CPU time. After 100 episodes on the partigame maze, PQ(λ) has learnt 59% correct greedy actions (that is, in 59% of the state space, PQ(λ) greedy selection selects the correct action to perform – indeed running 100 or 1000 episodes does not increase significantly this figure); after 1 500 episodes, naive Q(λ) has only learnt 9%. Using the same CPU duration (1 minute on a Pentium III, 500 MHz running Linux), PQ(λ) performs 100 episodes, while Q(λ) performs 104 episodes. In this case, Q(λ) has only learnt 15% correct actions in the whole state space, that is one quarter of what PQ(λ) does in the same amount of time. Regarding the number of backups, PQ(λ) perfoms approximately 3 106 backups, while Q(λ) performs 5 107 within this amount of time. When plotted against the number of episodes, this proportion of correctly learnt greedy actions levels; to get closer to 100%, one has to perform more episodes so that certain yet unexplored regions of the state space get explored. From what has been reported, it is clear that, as expected, PQ(λ) is able to generalize much better than classical Q(λ). 3.4
Dealing with Local Optima
For the moment, reaching either one of the two outlets of the pacman maze provides the same positive return. We now consider a problem closely related in which one outlet is sub-optimal: reaching one of the two outlets (say, the leftmost) provides a return equal to +5.0, while reaching the other outlet provides a return of +10.0. We expect that PQ(λ) (as well as regular Q(λ)) will be able to learn to reach both outlets, though favoring the outlet associated with the largest return. Results are displayed in table 3 for two mazes, the pacman maze where one outlet is made suboptimal, and a misleading maze drawn from [3]. In the pacman maze, the two goals are equally easy to find, while in the misleading
Propagation of Q-values in Tabular TD(λ)
377
Table 3. This table displays the proportion of executions that reach either the sub-optimal or the optimal goal in the pacman maze and in the misleading maze where the two goals are distinguished with regards to the return they provide when reached. Results are obtained over 103 runs of each algorithm PQ(λ) Q-learning pacman maze 91% 84% misleading maze 24.5% 3%
maze, the sub-optimal goal is much easier to find than the real optimum. PQ(λ) and Q(λ) reach the best outlet much more often than the sub-optimal one. 3.5
Combining Reinforcement Learning with Training
One can hope to greatly improve the performance of Q(λ) by showing it a trajectory (also called “training” technique), for instance, by way of a graphical interface. However, this “obvious” improvement is not so successful or, for the least, it is less successful that one would expect. Indeed, after having been shown a trajectory and subsequently having been reset to its initial position, Q(λ) begins by following the demonstrated trajectory but, after some steps and depending on the value of , it escapes from this trajectory because it has performed an exploratory move. Once Q(λ) has left the demonstrated trajectory, it is completely lost and is generally unable to return to it. Then, Q(λ) has to reach the target by its own means, the demonstrated trajectory being then totally unused. PTD solves this problem. When being shown the trajectory, PTD creates a gradient towards the taught trajectory. Then, when behaving autonomously, this trajectory is followed and exploratory moves simply lead out of the trajectory to which it is attracted back by the gradient field when performing a greedy move. Furthermore, exploratory moves naturally lead to the optimization of the taught trajectory. Indeed, when shown via the graphical interface, the trajectory is generally not perfect, that is, it is seldomly the best possible trajectory. Starting from an already good approximation of the best trajectory, PTD optimizes it little by little. To illustrate this point, on the pacman maze, during the first episode, we train the algorithm: instead of selecting itself the action to emit, the algorithm follows a training trajectory. This trajectory is voluntarily sub-optimal, being 1.94 longer than the shortest trajectory from the initial state to the closest goal (the training trajectory is the dashed line on the Pacman maze in Fig. 2). During the next episodes, Q(λ) trajectories are generally getting a little bit longer, while those of PQ(λ) are getting shorter. Averaged over 100 runs, after 1000 episodes, the length of the trajectories followed by PQ(λ) has shrunk down to an average of 1.27 times longer than the shortest one, while the average length of those followed by Q(λ) is 1.83 times longer. The shortest trajectory followed by PQ(λ) is 1.11 times longer than the shortest, the 10% extra-length being explained by
378
Philippe Preux
= 0.1 leading to 10% exploratory moves; the longest trajectory is 1.74. In the case of Q(λ), the shortest trajectory is 1.42 and the longest is 4.0. An other worthy point regards whether if the training trajectory leads towards a sub-optimal goal, the algorithm is still able to reach the optimal goal. To check this, in the pacman maze, we make the rightmost outlet sub-optimal, while the leftmost outlet is the best rewarding goal and we train the algorithm with the same trajectory as before, leading to the sub-optimal goal. Performing 105 episodes after having been trained during the first one, PQ(λ) finds the best optimum after 103 epsodes for the first time. On average, the best optimum is found 66% along these 105 episodes. More generally, we think that the use of PTD (instead of tabular TD), in combination with training, can be applied in many cases and can bring important speed-ups despite its initial overcost which is largely compensated by its ability to generalize from its own experience and from training.
4
Discussion and Conclusion
In this paper, we have proposed the propagation TD algorithm. Based on tabular TD methods, PTD tends to bridge the gap between tabular TD and neural TD with regards to its generalization capabilities. While tabular TD updates only Q-values of state/action pairs that are visited, PTD propagates the updates to state/action pairs that lead to states that have been visited. Propagation is grounded on the idea that if a state s has a high value, then state/action pairs that lead to s are likely to have a high quality. This idea is general and can be applied to all tabular TD algorithms. There are a number of nice features of PTD. First, PTD does not involve any extra parameter. Second, as a natural extension of tabular TD(λ) algorithms, propagation can be used instead of these algorithms in many places and applications. For example, it can take advantage of techniques to speed-up Q(λ), or be used in hierarchical Q(λ) to solve POMDPs [15], ... Third, though not studied here, existant convergence proofs should be able to be adapted from other algorithms to PTD2 . Fourth, PTD is worthy when combining reinforcement learning with training (or supervised learning) by avoiding tabular TD to be unable to come back to the taught trajectory after an exploratory move. In some sense, the rather blind tabular TD method becomes far sighted. Fifth, the experimental assessment has shown that, even though it has been implemented very crudely and we have not spent any time to optimize neither the propagation of Q-values, nor the number of backups, the run time of PTD is very reasonable and the trade-off between the run time and the number of learning iterations is not bad for PTD. These last three points make us think that PTD is worthy for applications based on Q-Learning that require generalization abilities. The fact that PTD only needs to know which transitions are possible is also a nice point with regards to real time dynamic programming [1] for which a complete model is necessary. 2
As a matter of fact, the idea of PTD has been proposed very recently and independently by other authors, accompagnied with such a convergence proof [16]
Propagation of Q-values in Tabular TD(λ)
379
With regards to existing work, PTD shares some similarities with Dyna, Priˆ oritized sweeping, and queue-Dyna. All three algorithms build a model (Tˆ, R) ˆ where T (s, a, s ) is the estimated probability that taking action a in state s leads ˆ a) is the estimated return of taking action a in state s. Dyna to state s , and R(s, was introduced by [10, 11]. In complement to regular Q-learning, Dyna mainˆ at each iteration and uses it to update its tains and updates the model (Tˆ, R) estimates of the quality of k other state/action pairs drawn at random among those that have been visited. Prioritized sweeping and queue-Dyna are two similar techniques that have been proposed independently, respectively by [3] and [5]. They are both derived from Dyna from which they differ in that state values are estimated instead of state/action pair qualities, and updated estimates are not drawn at random. Each state is characterized by its predecessors, as well as a priority. Value estimates are updated according to the priority of states: the value of the states having the highest priority is updated. For each updated value V (s), the priority of s is reset to 0, while the priority of the predecessors s of s is updated proportionaly to the change in V (s) and Tˆ (s, a, s ). Thus, PTD is yet an other strategy. First, PTD does not make use of such a thorough model: it solely relies on whether a transition between two states is possible or not. Second, the updated state/action pairs are all those that have been visited since the beginning of the episode as PTD uses eligibility traces, as well as neighboring state/action pairs of updated state/action pairs. In the near future, we wish to optimize the propagation process. More fundamentally, we wish to compare more precisely the ability to generalize of PTD with regards to neural TD. We also wish to evaluate the generalization abilities of PTD with regards to the size of the state/action space, and the performance of PTD in a non deterministic environment. We are also currently evaluating the usefulness of using PTD instead of Q-learning to control the animat MAABAC [6]: this is a multi-segmented artefact in which multiple reinforcement agents learn to collectively solve a task. In this application, the environment is no longer markovian, nor stationary.
Acknowledgements The author would like to thank anonymous reviewers for their constructive remarks, as well as pointing towards the recently published and very closely related paper [16].
References [1] A. G. Barto, S. J. Bradtke, and S. P. Singh. Learning to act using real-time dynamic programming. Artificial Intelligence, 72:81–138, 1995. 378 [2] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-dynamic programming. Athena Scientific, 1996. 369 [3] A. W. Moore and C. G. Atkeson. Prioritized sweeping: reinforcement learning with less data and less real time. Machine Learning, 13:103–130, 1993. 376, 379
380
Philippe Preux
[4] A. W. Moore and C. G. Atkeson. The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces. Machine Learning, 21, 1995. 373, 374 [5] J. Peng and R. J. Williams. Efficient learning and planning within the dyna framework. Adaptive Behavior, 1(4):437–454, 1993. 379 [6] Ph. Preux, Ch. Cassagnab`ere, S. Delepoulle, and J-Cl. Darcheville. A non supervised multi-reinforcement agents architecture to model the development of behavior of living organisms. In Proc. European Workshop on Reinforcement Learning, October 2001. 379 [7] G. A. Rummery. Problem Solving with Reinforcement Learning. PhD thesis, Cambridge University, 1995. 370 [8] G. A. Rummery and M. Niranjan. On-line Q-learning using connectionist systems. Technical Report TR 166, Cambridge University, Enginerring Department, September 1994. 370 [9] R. S. Sutton. Learning to predict by the method of temporal difference. Machine Learning, 3:9–44, 1988. 369 [10] R. S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proc. Seventh Int’l Conf. on Machine Learning, pages 216–224. Morgan Kaufmann, 1990. 379 [11] R. S. Sutton. Planning by incremental dynamic programming. In Proc. Eighth Int’l Conf. on Machine Learning, pages 353–357. Morgan Kaufmann, 1991. 379 [12] R. S. Sutton and A. G. Barto. Reinforcement learning: an introduction. MIT Press, 1998. 371, 372 [13] G. Tesauro. Temporal difference learning and TD-Gammon. Communications of the ACM, 38:58–68, 1995. 369 [14] C. J. C. H. Watkins. Learning from delayed rewards. PhD thesis, King’s college, Cambridge, UK, 1989. 370 [15] M. Wiering and J. Schmidhuber. HQ-Learning. Adaptive Behavior, 6(2):219–246, 1997. 378 [16] W. Zhu and S. Levinson. PQ-learning: an efficient robot learning method for intelligent behavior acquisition. In Proc. 7th Int’l Conf. on Intelligent Autonomous Systems, March 2002. 378, 379
Transductive Confidence Machines for Pattern Recognition Kostas Proedrou, Ilia Nouretdinov, Volodya Vovk, and Alex Gammerman Department of Computer Science, Royal Holloway, University of London Egham, Surrey TW20 0EX, England {konstant,ilia,vovk,alex}@cs.rhul.ac.uk
Abstract. We propose a new algorithm for pattern recognition that outputs some measures of “reliability” for every prediction made, in contrast to the current algorithms that output “bare” predictions only. Our method uses a rule similar to that of nearest neighbours to infer predictions; thus its predictive performance is close to that of nearest neighbours, while the measures of confidence it outputs provide practically useful information for individual predictions.
1
Introduction
Current machine learning algorithms usually lack measures that can give an indication of how “good” the predictions are. Even when such measures are present they have certain disadvantages, such as: – They cannot be applied to individual test examples. – They often are not very useful in practice (PAC theory). – They often rely on strong underlying assumptions (Bayesian methods). In our case none of these disadvantages are present. Our only assumption is that data items are produced independently by the same stochastic mechanism (iid assumption), our measures of confidence are applicable to individual examples, while experimental results show that they produce good results for benchmark data sets (and so potentially are useful in practice). The iid assumption that we make is a very natural one for most applications of pattern recognition, as it only implies that – all our examples are produced by the same underlying probability distribution and – they are produced independently of each other; so the order in which they appear is not relevant. Many algorithms have been proposed in the past, both in the Bayesian and in the PAC settings, that provide additional information of the “quality” of the predictions. Bayesian algorithms usually provide useful confidence values but when the underlying distribution is not known these values are “misleading”. Experiments T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 381–390, 2002. c Springer-Verlag Berlin Heidelberg 2002
382
Kostas Proedrou et al.
in (Melluish et al., 2001) have shown that in Bayesian algorithms, when the underlying probability distribution of the examples is not known, the deviation from the expected percentage of misclassified examples is too large to give any practical meaning to the confidence values. For example, we expect that from all examples with a confidence value of 90% the percentage of those wrongly classified will be close to 10%. Bayesian algorithms instead, can produce a much higher percentage of error at the above confidence level; in experiments in (Melluish et al., 2001) this error is between 20% and 40%. PAC theory doesn’t make any assumptions about the underlying probability distribution, but its results are often not useful in practice. To demonstrate crudeness of the usual PAC bounds, we reproduce an example from (Nouretdinov, Vovk et al., 2001). Littlestone and Warmuth’s theorem stated in (Cristianini et al., 2000) is one of the tightest results of PAC theory, but still usually does not give practically meaningful results. The theorem states that for a two-class Support Vector classifier f the probability of mistakes is l el 1 d ln + ln err(f ) ≤ l−d d δ with probability 1−δ, where l is the training size and d is the number of Support Vectors. For the US Postal Service (USPS) database (described below and in Vapnik, 1998, Section 12.2), the error bound given by that theorem for one out of ten classifiers is close to 7291e 1 274 ln ≈ 0.17, 7291 − 274 274 even if we ignore the term ln δl (274 is the average number of support vectors for polynomials of degree 3, which give the best predictive performance; see Table 12.2 in Vapnik, 1998). Since there are ten classifiers, the upper bound on the total probability of mistakes becomes 1.7, which is not helpful at all. Our prediction method is based on the so called algorithmic theory of randomness. A description of this theory is the subject of Section 2. Then, in Section 3, we describe our algorithm, and in the next section we give some experimental and comparison results for our algorithm as applied to the USPS and other data sets. Our algorithm follows the transductive approach, as for the classification of every new example it uses the whole training set to infer a rule for that particular example only. In contrast, in the inductive approach a general rule is derived from the training set and then applied to each training example. For this reason we shall call our algorithm transductive confidence machine for nearest neighbours (TCM-NN). It is also possible to use the inductive approach to obtain confidence measures for our predictions (see e.g. (Papadopoulos et al., 2002) for an example of how to obtain confident predictions in the case of regression using the inductive approach).
Transductive Confidence Machines for Pattern Recognition
2
383
Algorithmic Theory of Randomness
According to classical probability theory if we toss a fair coin n times, all sequences {0, 1}n will have the same probability 21n of occuring. We would be much more surprised, however, to see a sequence like 111111111 . . . 1 than a sequence like 011010100 . . . 1. The classical approach to probabilty theory can only give probabilites of different outcomes, but cannot say anything about the typicalness of sequences. Intuitively, sequences that don’t seem to have any specific pattern in their elements would be more typical than sequences in which one can easily find regularities. An important result of the theory of algorithic randomness is that there exists a universal method of finding regularities in data sequences. This result is due to Martin-L¨ of, who was the first to introduce the notion of a randomness test. A slightly modified definition of Martin-L¨of’s test1 states that a function t : Z ∗ → [0, 1] is a test for randomness with respect to a class of probability distributions Q in Z if – for all n ∈ N, for all s ∈ [0, 1] and for all probability distributions P in Q, P n {z ∈ Z n : t(z) ≤ s} ≤ s,
(1)
– t is semi-computable from above. Here Z is a space that possesses some computability properties; in our application, Z is the set of all possible examples. Every randomness test creates a series of nested subsets. Each subset is associated with a number s that bounds the value t(z) that the test takes. We can expect that every randomness test will detect only some of the non-random patterns occuring in each sequence. Martin-L¨of proved, however, that we can merge all such tests to obtain a universal test for randomness2 . Such a test would be able to find all non-random patterns in a sequence of elements. Unfortunately, universal tests for randomness are not computable. Thus we have to approximate them using valid (in the sense of satisfying (1)) non-universal tests. In the next section, we will give a valid randomness test for finite sequences of real numbers produced under the iid assumption and we shall use what we call a strangeness measure to map each example into a single real value in order to utilize that test for obtaining confident predictions using the nearest neighbours algorithm. 1
2
The definition stated here is equivalent with Martin-L¨ of’s original definition; the only difference being the use of the ‘direct scale’ (randomness values from 0 to 1), instead of the ‘logarithmic scale’ (randomness values from 0 to +∞). A proof of the existence of universal randomness tests can be found in (Li & Vit´ anyi, 1997), Chapter 2.4.
384
3 3.1
Kostas Proedrou et al.
Nearest Neighbours and Randomness Formal Setting of the Problem
We have a training set {(x1 , y1 ), . . . , (xm , ym )}, of m elements, where xi = (x1i , . . . , xni ) is the set of feature values for example i and yi is the classification for example i, taking values from a finite set of possible classifications, which we identify as {1, 2, . . . , c}. We also have a test set of r examples similar to the ones in the training set, only this time the actual classifications are withheld from us. Our goal is to assign to every test example one of the possible classifications. For every classification we also want to give some confidence measures, valid in the sense of (1), that will enable us to gain more insight in the predictions that we make. 3.2
Nearest Neighbours Transductive Confidence Machine
Let us denote the sorted sequence (in ascending order) of the distances of exy ample i from the other examples with the same classification y as Diy . Also, Dij −y will stand for the jth shortest distance in this sequence and Di for the sorted sequence of distances containing examples with classification different from y. We assign to every example a measure called the individual strangeness measure. This measure defines the strangeness of the example in relation to the rest of the examples. In our case the strangeness measure for an example i with label y is defined as k y j=1 Dij , (2) αi = k −y j=1 Dij where k is the number of neighbours used. Thus, our measure for strangeness is the ratio of the sum of the k nearest distances from the same class to the sum of the k nearest distances from all other classes. This is a natural measure to use, as the strangeness of an example increases when the distance from the examples of the same class becomes bigger or when the distance from the other classes becomes smaller. Now let us return to algorithmic randomness theory. In (Melluish et al., 2001) it is proved that the function p(αnew ) =
#{i : αi ≥ αnew } , m+1
(3)
where αnew is the strangeness value for the test example (assuming there is only one test example, or that the test examples are processed one at a time), is a valid randomness test in the iid case. The proof takes advantage of the fact that since our distribution is iid all permutations of a sequence have the same probability of occuring. If we have a sequence α1 , . . . , αm and a new element αnew is introduced then αnew can take any place in the new (sorted) sequence with the same probability, as all permutations of the new sequence are equiprobable.
Transductive Confidence Machines for Pattern Recognition
385
Thus, the probability that αnew is among the j largest occurs with probability j of at most m+1 . The values taken by the above randomness test will be called p-values. The p-value for the sequence {α1 , . . . , αm , αnew }, where {α1 , . . . , αm } are the strangeness measures for the training examples and αnew is the strangeness measure of a new test example with a possible classification assigned to it, is the value p(αnew ). We can now give our algorithm.
TCM-NN Algorithm Choose k, the number of nearest neighbours to be used for i = 1 to m do Find and store Diy and Di−y end for Calculate alpha values for all training examples for i = 1 to r do Calculate the dist vector as the distances of the new example from all training examples for j = 1 to c do for every training example t classified as j do j > dist(t) recalculate the alpha value of example t if Dtk end for for every training example t classified as non-j do −j > dist(t) recalculate the alpha value of example t if Dtk end for Calculate alpha value for the new example classified as j Calculate p-value for the new example classified as j end for Predict the class with the largest p-value Output as confidence one minus the 2nd largest p-value Output as credibility the largest p-value end for
For each possible classification of a test example we construct the sequence of strangeness values of the training set augmented by the strangeness value of the new test example3 . The prediction for each example is the classification that gives the most typical completion of the sequence of strangeness measures of the training set under the iid assumption. Each prediction is accompanied by two other measures. The most important of them is the confidence measure. Since, by equation (1), the second largest 3
Note that some of the strangeness values of the training set may be different for different test examples or different possible classifications assigned to a test example. In this sense our algorithm is transductive, as the training set is being reused for each test example.
386
Kostas Proedrou et al.
p-value is an upper bound on the probability that the excluded classifications will be correct, the confidence measure indicates how likely the predicted classification is the correct one. The credibility measure gives the typicalness of the predicted classification. This value indicates how well suited the training set is for the classification of a particular test example. Low credibility would mean that the test example is strange with respect to the training examples, e.g. trying to classify a letter using a training set that consists of digits. In principle, we would want for each prediction all p-values to be close to 0, apart from the one that gives the correct classification, that we would want to be close to 1.
4
Experimental Results
The standard comparison criterion in classification problems is the percentage of incorrectly classified examples. Here we shall also use a second one. We fix a specific significance level δ, say 1%, and we accept as possible classifications the ones whose p-value is above that level. In this way we can determine how many test examples can be classified with a confidence of at least 1 − δ. We have tested our algorithm on the following datasets: – USPS. It consists of handwritten digits from 0 − 9. The training set consists of 7291 examples and the test set of 2007 examples. Each example has 256 attributes (pixels) that describe the given digit. All data were pre-processed as follows. As any image from the USPS data set was represented as 256 numbers (x1 , ..., x256 ), we replaced it by (y1 , ..., y256 ), where 256 xi xi − S , S = i=1 , yi = D 256 256 2 i=1 (xi − S) D= 256 The aim of this preprocessing is to normalise the level of brightness. After the preprocessing, the mean value of each image becomes 0 and the standard deviation becomes 1. – Satellite. These are 6435 satellite images(4435 for train and 2000 for test). The classification task is to identify between 6 different soil conditions that are represented in the images. – Shuttle. The classes of this dataset are the appropriate actions that should be taken under certain conditions(described by 9 attributes) in a space shuttle. There are 43500 train examples, 14500 test examples and 7 different classes. – Segment. 2310 outdoor images described by 9 attributes each. The classifications are : brick-face, sky, foliage, cement, window, path, grass. The last three datasets are used in the Statlog project (King et al., 1995). For comparison purposes we followed the same testing procedure. For the satellite
Transductive Confidence Machines for Pattern Recognition
387
Table 1. Comparison of the error rate of TCM-NN with other learning algorithms Algorithm C4.5 CART NB k-nn CASTLE Discrim Neural TCM Satellite 15.1 13.8 30.7 9.4 19.4 17.1 13.9 10.6 Shuttle 0.04 4.55 0.44 3.77 4.83 4.9 0.11 Segment 4 4 26 7.7 11.2 11.6 3.68 Dataset
Table 2. Comparison of the error percentage of TCM-NN with other algorithms on the USPS dataset Learning Nearest TCM-NN Support Vector Five layer Algorithms Neighbours Machine Neural Network % of error 4.29% 4.29% 4.04% 5.1%
and shuttle datasets we used the same training and test set, while for the segment one we used 10 fold cross-validation. In Table 1 we compare the performance of our algorithm4 with 7 others, all taken from the Statlog project, on the satellite, shuttle and segment datasets. The algorithms are two decision tree algorithms, C4.5 and CART, the Naive Bayes classifier (NB), the k-nearest neighbours algorithm, a Bayesian network algorithm (CASTLE), a linear discriminant algorithm (Discrim) and a backpropagation neural network. Two of the values from Table 1 are missing as these results are not mentioned in (King et al., 1995). Table 2 contains experimental and comparison results on the USPS dataset. The error percentage for the Five Layer Neural Network was obtained from (Vapnik, 1998), while for the other three algorithms the results were produced by the authors on the same training and test set. It is clear from both tables that TCM-NN’s performance is almost as good as the performance of the best algorithms for all datasets used. Next, in Figure 1 we compare the error rate of TCM with that of the original nearest neighbours algorithm on the USPS dataset using a different number of neighbours each time. Though the perfromace of both algorithms is decreasing as the number of neighbours is increasing it seems that TCM is more robust as its error rate is increasing much slower. When the second comparison criterion is used, our algorithm makes ‘region’ predictions (outputs a set of classifications) instead of point predictions. For a specified significance level δ the correct classification will be in the predicted set of classifications with a probability of at least 1 − δ, since the set of rejected classifications can occur with probability of at most δ. In Figure 2 we demonstrate this relationship between error classification and confidence level using 50 random instances of the USPS dataset. 4
We normally use one nearest neighbour for testing TCM-NN. When this is not the case, the number of neighbours used will be stated explicitly.
388
Kostas Proedrou et al.
9 TCM-NN Error % NN Error % +
8 7 % 6 + + + + +
5 4
+
+ + +
+
+ + +
+ + +
+ + + +
3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of nearest neighbours
Fig. 1. Error percentage of TCN-NN and NN on the USPS dataset using 1-20 nearest neighbours In Table 3 we detail the results of ‘region’ predictions for significance levels of 1% and 5%, giving the percentage of examples for which the predicted set contains one label, more than one label and no labels. For the shuttle and USPS datasets we predict a set containing one classification for 99.17% and 94.77% of the examples respectively with great certainty (confidence of 99% or more). We can also note that as the overall error rate is increasing the number of examples that can be given a single classification is decreasing. Since greater error percentages mean more difficult classification problems it is natural that more examples will be assigned more than one possible classifications. Finally, the last column in Table 3 gives the percentage of examples of the ‘one class’ column that were correctly classified. These percentages are very close and in most cases higher than the corresponding confidence levels; thus indicating the practical usefulness of TCM’s confidence measure5 .
5
Conclusion
The TCM-NN algorithm presented here has the advantage of giving probabilistic measures for each individual prediction that we make. In this way we gain more insight into how likely a correct classification is for an example when given a specific training set. Furthermore, the percentage of errors of TCM-NN seems to be as good as that of other learning algorithms. 5
Note that choosing a smaller significance level doesn’t necessarily guarantee a greater rate of success, as we only consider the examples that are assigned one classification. The former holds only when we consider all test examples (see Figure 2).
369
Transductive Confidence Machines for Pattern Recognition
90 80 70 60 -
Correct Predictions 50 -
-
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 Confidence Level
Fig. 2. Percentage of correct 'region1 predictions for different confidence levels using 50 random instances of the TJSPS dataset
The scheme we have proposed can be used on top of every classification algorithm and not only nearest neighbours, by defining the individual strangeness measure (2) in a different way. For example, the method can be applied to the Support Vector Machine algorithm using as a strangeness measure the distance of each example from the hyper-plane that separates the different classes. Finally, as an approximation to the universal test defined in Section 2 we have used the
Table 3. TCM-NN Performance. The column ''One class" gives the number of examples for which a confident prediction is made, the column ''22 classes" gives the number of examples for which two or more possible classifications were not excluded at the given significance level, and the column ''No class" gives the number of examples for which all possible classifications were excluded at the given significance level. The last column shows the percentage of correct predictions for the examples we could confidently predict at each significance level
390
Kostas Proedrou et al.
statistical p-test (3). It remains an open problem though whether one can find valid tests for randomness (under the general iid assumption) that are better approximations to the universal tests for randomness than the one used here.
Acknowledgements This work was partially supported by EPSRC through grants GR/L35812 (“Support Vector and Bayesian learning algorithms”), GR/M14937 (“Predictive complexity: recursion-theoretic variants”), and GR/M16856 (“Comparison of Support Vector Machine and Minimum Message Length methods for induction and prediction”). We are grateful to the Program Committee for useful comments.
References 1. Cristianini, N., & Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines and Other Kernel-based Methods. Cambridge: Cambridge University Press. 2. Fraser, D. A. (1976). Non-parametric methods in statistics. New York: Wiley. 3. King, R. D., Feng, C., & Sutherland, A. (1995). Statlog: Comparison of classification algorithms on large real-world problems. Applied Artificial Intelligence, 9(3), pp 259–287. 4. Li, M., & Vit´ anyi, P. (1997). An introduction to Kolmogorov complexity and its applications (2nd edn.). New York: Springer. 5. Melluish, T., Saunders, C., Nouretdinov, I., & Vovk, V. (2001). Comparing the Bayes and typicalness frameworks. In Proceedings of ECML’2001. 6. Nouretdinov, I., Melluish, T., Vovk V. (2001). Ridge Regression Confidence Machine. In Proceedings of the 18th International Conference on Machine Learning. 7. Nouretdinov, I., Vovk, V., Vyugin, M., & Gammerman, A. (2001). Pattern recognition and density estimation under the general iid assumption. In Proceedings of the Fourteenth Annual Conference on Computational Learning Theory and Fifth European Conference on Computational Learning Theory. 8. Papadopoulos, H., Proedrou, K., Vovk, V., Gammerman, A. (2002). Inductive Confidence Machines for Regression.. In Proceedings of ECML’2002. 9. Vapnik, V. (1998). Statistical Learning Theory. New York: Wiley. 10. Vovk, V., & Gammerman, A. (2002). Algorithmic Theory of Randomness and its Computer Applications. Manuscript.
Characterizing Markov Decision Processes Bohdana Ratitch and Doina Precup McGill University, Montreal, Canada {bohdana,dprecup}@cs.mcgill.ca http://www.cs.mcgill.ca/{~sonce,~dprecup} Abstract. Problem characteristics often have a significant influence on the difficulty of solving optimization problems. In this paper, we propose attributes for characterizing Markov Decision Processes (MDPs), and discuss how they affect the performance of reinforcement learning algorithms that use function approximation. The attributes measure mainly the amount of randomness in the environment. Their values can be calculated from the MDP model or estimated on-line. We show empirically that two of the proposed attributes have a statistically significant effect on the quality of learning. We discuss how measurements of the proposed MDP attributes can be used to facilitate the design of reinforcement learning systems.
1
Introduction
Reinforcement learning (RL) [17] is a general approach for learning from interaction with a stochastic, unknown environment. RL has proven quite successful in handling large, realistic domains, by using function approximation techniques. However, the properties of RL algorithms using function approximation (FA) are still not fully understood. While convergence theorems exist for some valuebased RL algorithms using state aggregation or linear function approximation (e.g., [1, 16]), examples of divergence of some RL methods combined with certain function approximation architectures also exist [1]. It is not known in general which combinations of RL and FA methods are guaranteed to produce stable or unstable behavior. Moreover, when unstable behavior occurs, it is not clear if it is a rare event, pertinent mostly to maliciously engineered problems, or if instability is a real impediment to most practical applications. Most efforts for analyzing RL with FA assume that the problem to be solved is a general stochastic Markov Decision Process (MDP), while very little research has been devoted to defining or studying sub-classes of MDPs. This generality of the RL approach makes it very appealing. This is in contrast with prior research in combinatorial optimization (e.g., [13, 6]), which showed that the performance of approximate optimization algorithms can be drastically affected by characteristics of the problem at hand. For instance, the performance of local search algorithms is affected by characteristics of the search space for a given problem instance, such as the number of local optima, the sizes of the regions of attraction, and the diameter of the search space. Recent research (e.g., [7, 10]) has shown that such problem characteristics can be used to predict T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 391–404, 2002. c Springer-Verlag Berlin Heidelberg 2002
392
Bohdana Ratitch and Doina Precup
the behavior of local search algorithms, and improve algorithm selection. In this paper, we show that a similar effect is present in algorithms for learning to control MDPs: the performance of RL algorithms with function approximation is heavily influenced by characteristics of the MDP. Our focus is to identify relevant characteristics of MDPs, propose ways of measuring them, and determine their influence on the quality of the solution found by value-based RL algorithms. Prior theoretical [14] and empirical results [3, 9] suggest that the amount of stochasticity in an MDP can influence the complexity of finding an optimal policy. We propose quantitative attributes for measuring the amount of randomness in an MDP (e.g., entropy of state transitions, variance of immediate rewards and controllability), and for characterizing the structure of the MDP. These attributes can be computed exactly for MDPs with small discrete state space, and they can be approximated for MDPs with large or continuous state spaces, using samples from the environment. Our research builds on the work of Kirman [9], who studied the influence of stochasticity on dynamic programming algorithms. In this paper, we redefine some of his attributes and propose new ones. We treat both discrete and continuous state space, while Kirman focused on discrete problems. We also focus on on-line, incremental RL algorithms with FA, rather than off-line dynamic programming. We discuss the potential for using MDP attributes for choosing RL-FA algorithms suited for the task at hand, and for automatically setting user-tunable parameters of the algorithms (e.g., exploration rate, learning rate, and eligibility). At present, both the choice of an algorithm and the parameter setting are usually done by a very time-consuming trial-and-error process. We believe that measuring MDP attributes can help automate this process. We present an empirical study focused on the effect of two attributes, state transition entropy and controllability, on the quality of the behavior learned using RL with FA. The results show that these MDP characteristics have a statistically significant effect on the quality of the learned policies. The experiments were performed on randomly generated MDPs with continuous state spaces, as well as randomized versions of the Mountain Car task [17]. The paper is organized as follows. In Sect.2, we introduce basic MDP notation and relevant reinforcement learning issues. In Sect.3, we introduce several domain-independent attributes by which we propose to characterize MDPs, and give intuitions regarding their potential influence on learning algorithms. Sect.4 contains the details of our empirical study. In Sect.5 we summarize the contributions of this paper and discuss future work.
2
Markov Decision Processes
Markov Decision Processes (MDPs) are a standard, general formalism for modeling stochastic, sequential decision problems [15]. At every discrete time step t, the environment is in some state st ∈ S, where the state space S may be finite or infinite. The agent perceives st and performs an action at from a discrete, finite action set A. One time step later, the agent receives a real-valued numerical
Characterizing Markov Decision Processes
393
reward rt+1 and the environment transitions to a new state, st+1 . In general, both the rewards and the state transitions are stochastic. The Markov property means that the next state st+1 and the immediate reward rt+1 depend only on the current state and action, st , at . The model of the MDP consists of the trana a sition probabilities Ps,s and the expected values of the immediate rewards Rs,s , ∀s, a, s . The goal of the agent is to find a policy, a way of behaving, that maximizes the cumulative reward over time. A policy is a mapping π : S × A → [0, 1], where π(s, a) denotes the probability that the agent takes action a when the environment is in state s. The long-term reward received by the agent is called the return, and is defined as an additive function ∞ of the reward sequence. For instance, the discounted return is defined as t=0 γ t rt+1 , where γ ∈ [0, 1). Many RL algorithms estimate value functions, which are defined with respect to policies and reflect the expected value of the return. The actionvalue function of policy π represents the expected discounted return obtained whenstarting from state s, taking a, and henceforth following π: Qπ (s, a) = ∞ Eπ { k=0 γ k rt+k+1 |st = s, at = a} , ∀s ∈ S, a ∈ A. The optimal action-value function is Q∗ (s, a) = maxπ Qπ (s, a), ∀s ∈ S, a ∈ A. An optimal policy is one for which this maximum is attained. If the optimal action-value function is learned, then an optimal policy can be implicitly derived as a greedy one with respect to that value function. A policy is called greedy with respect to some actionvalue function Q(s, a) if in each state it selects one of the actions that have the maximum value: π(s, a) > 0 iff a ∈ arg maxa ∈A Q(s, a ). Most RL algorithms iteratively improve estimates of value functions based on samples of transitions obtained on-line. For example, at each time step t, the tabular Sarsa learning algorithm [17] updates the value of the current stateaction pair (st , at ) based on the observed reward rt+1 and the next state-action pair st+1 , at+1 , as: Q( st , at ) ← Q(st , at ) + αt [rt+1 + γQ(st+1 , at+1 ) −Q(st , at )], αt ∈ (0, 1). Input Target
(1)
Unlike supervised learning, RL is a trial-and-error approach. The learning agent has to find which actions are the best without the help of a teacher, by trying them out. This process is called exploration. The quality and the speed of learning with finite data can depend dramatically on the agent’s exploration strategy. In MDPs with large or continuous state spaces, value functions can be represented by function approximators (e.g., CMACs or neural networks [17]). In that case, RL methods sample training data for the approximator, which consist of inputs (e.g., state-action pairs) and targets (e.g., estimates of the action-value function). Equation (1) shows an example of inputs and targets for the SARSA algorithm. The approximator generalizes the value estimates gathered for a subset of the state-action pairs to the entire S × A space. The interplay between the RL algorithm and the function approximator has an iterative, interleaved manner, as shown in Fig. 1. Function approximation in the context of RL is harder than in the classical, supervised learning setting. In supervised learning, many techniques assume a
394
Bohdana Ratitch and Doina Precup a subset ✗state-action values ✔ of S × A✲ (targets)✲✗ ✔ Q(s,✲ a) generalization of the value function ✲ RL ✲ FA to the entire S × A space ✖✕ (inputs) ✖ ✕
Fig. 1. Interaction of RL and FA learning Table 1. Factors contributing to the noise in the target of the FA training data rt1 +1 = rt2 +1 rt1 +1 = rt2 +1 Q(st1 +1 , at1 +1 ) Q(st2 +1 , at2 +1 ) =⇒ Q(st1 +1 , at1 +1 ) Q(st2 +1 , at2 +1 )
Factor 1: Stochastic immediate rewards =⇒ Factor 2: Stochastic transitions: st1 +1 = st2 +1 =⇒ Factor 3: Stochastic transitions: st1 +1 = st2 +1 =⇒
=
Factor 4: Different action choices: at1 +1 = at2 +1
=
static training set. In RL on the other hand, the estimates of the value function (which are the targets for the FA) evolve and improve gradually. Hence, the FA’s target function appears to be non-stationary. Moreover, the stochasticity in the environment and the exploration process may introduce variability into the training data (i.e., variability in the targets for a fixed input, which we will call “noise” from now on). To identify the potential sources of this noise, let us examine (1) again. Suppose that the same state-action pair (ˆ s, a ˆ) is encountered at time steps t1 and t2 during learning, and the FA is presented with the corresponding targets [rt1 +1 + γQ(st1 +1 , at1 +1 )] and [rt2 +1 + γQ(st2 +1 , at2 +1 )]. Table 1 shows four factors that can contribute to the noise. Note that these factors arise from the particular structure of the estimated value functions, from the randomized nature of the RL algorithm and, most of all, from the inherent randomness in the MDP. We will now introduce several attributes that help differentiate these sources of randomness, and quantify their effect.
3
MDP Attributes
We present six domain-independent attributes that can be used to quantitatively describe an MDP. For simplicity, we define them assuming discrete state and action spaces and availability of the MDP model. Later, we discuss how these assumptions can be lifted. State transition entropy (STE) measures the amount of stochasticity due to the environment’s state dynamics. Let Os,a denote a random variable representing the outcome (next state) of the transition from state s when the agent performs action a. This variable takes values in S. We use the standard informationtheoretic definition of entropy to measure the STE for a state-action pair (s, a) (as defined in [9]):
Characterizing Markov Decision Processes
ST E(s, a) = H(Os,a ) = −
a a Ps,s log Ps,s
395
(2)
s ∈S
A high value of ST E(s, a) means that there are many possible next states s a (with Ps,s = 0) which have about the same transition probabilities. In an MDP with high STE values, the agent is more likely to encounter many different states, even by performing the same action a in some state s. In this case, state space exploration happens naturally to some extent, regardless of the exploration strategy used by the agent. Since extensive exploration is essential for RL algorithms, a high STE may be conducive to good performance of an RL agent. At the same time, though, a high STE will increase the variability of the state transitions. This suggests that the noise due to Factors 2 and 3 in Table 1 may increase, which can be detrimental for learning. The controllability (C) of a state s is a normalized measure of the information gain when predicting the next state based on knowledge of the action taken, as opposed to making the prediction before an action is chosen (note that a similar, but not identical, attribute is used by Kirman [9]). Let Os denote a random variable (with values from S) representing the outcome of a uniformly random action in state s. Let As denote a random variable representing the action taken in state s. We consider As to be chosen from a uniform distribution. Now, given the value of As , information gain is the reduction in the entropy of Os : H(Os ) − H(Os |As ), where H(Os ) = −
Ps,s log Ps,s = −
s ∈S
s ∈S
H(Os |As ) = −
(
1 a 1 a Ps,s ) log( Ps,s ) |A| |A| a∈A
a∈A
1 a a Ps,s log(Ps,s ) |A|
a∈A
s ∈S
The controllability in state s is defined as: C(s) =
H(Os ) − H(Os |As ) H(Os )
(3)
If H(Os ) = 0 (deterministic transitions for all actions), then C(s) is defined to be 1. It may also be useful (see Sect. 5) to measure the forward controllability (FC) of a state-action pair, which is the expected controllability of the next state: a F C(s, a) = Ps,s (4) C(s ) s ∈S
High controllability means that the agent can exercise a lot of control over which trajectories (sequences of states) it goes through, by choosing appropriate actions. Having such control enables the agent to reap higher returns in environments where some trajectories are more profitable than others. Similar to the STE, the level of controllability in an MDP also influences the potential exploration of the state space. Because in a highly controllable state s the outcomes of
396
Bohdana Ratitch and Doina Precup
different actions are quite different, the agent can choose what areas to explore. This can be advantageous for the RL algorithm, but may be detrimental for function approximation, because of the noise due to Factor 4 in Table 1. The variance of immediate rewards, V IR(s, a), characterizes the amount of stochasticity in the immediate reward signal. High VIR causes an increase in the noise due to Factor 1 in Table 1, thus making learning potentially more difficult. The risk factor (RF) measures the likelihood of getting a low reward after the agent performs a uniformly random action. This measure is important if the agent has to perform as well as possible during learning. Let rsau denote the reward observed on a transition from a state s after performing a uniformly random action, au . The risk factor in state s is defined as: RF (s) = P r[rsau < E{rsau } − (s)],
(5)
where (s) is a positive number, possibly dependent on the state, which quantifies the tolerance to lower-than-usual rewards. Note that low immediate rewards do not necessarily mean low long-term returns. Nevertheless, knowledge of RF may help minimize losses, especially during the early stages of learning. The final two attributes are meant to capture the structure in the state transitions. The transition distance, T D(s, a), measures the expected distance between state s and its successor states, according to some distance metric on S. We are currently investigating what distance metric would be appropriate. One candidate is a (weighted) Euclidean distance, but this is not adequate for all environments. The transition distance may affect RL when using global function approximators, such as neural networks. In the case of incremental learning with global approximators, training on two consecutive inputs that are very different may create mutual interference of the parameter updates and impede learning. Hence, such MDPs may benefit from using local approximators (e.g. Radial Basis Networks). The transition variability, T V (s, a), measures the average distance between possible next states. With a good T V metric, a high value of T V (s, a) would indicate that the next states can have very different values, and hence introduce noise due to Factor 3 in Table 1. In continuous state spaces, the attributes are defined by using integrals instead of sums. If the model of the process is not available, these attributes can be estimated from the observed transitions, both for discrete and continuous state spaces. The attributes can be measured locally (for each state or state-action pair) or globally, as an average over the entire MDP. Local measures are most useful for tuning the parameters of the RL algorithm. For example, in Sect.5, we suggest how they can be used to adapt the exploration strategy. In the experiments presented in the next section, we use global measures - sample averages of the attribute values, computed under the assumption that all states and actions are equally probable. This choice of the sampling distribution is motivated by our intention to characterize the MDP before any learning takes place. Note, however, that under some circumstances, weighted averages might be of more interest (e.g., if we want to compute these attributes based on behavior generated by a restricted class of policies). If the sample averages were estimated
Characterizing Markov Decision Processes
397
on-line, during learning, they would naturally reflecting the state distribution that the agent is actually encountering.
4
Empirical Study
In this section we focus on studying the effect of two of state-transition entropy (STE) and controllability (C), on the quality of the policies learned by an RL algorithm using linear function approximation. The work of Kirman suggests that these attributes influence the performance of off-line dynamic programming (DP) algorithms, to which RL approaches are related. Hence, it seems natural to start with studying these two attributes. Our experiments with the other attributes are still in their preliminary stages, and the results will be reported in future work. In Sect. 4.1 we present the application domains used in the experiments. The experimental details and the results are described in Sect.4.2. 4.1
Tasks
In order to study empirically the effect of MDP characteristics on learning, it is desirable to consider a wide range of STE and C values, and to vary these two attributes independently. Unfortunately, the main collection of currentlypopular RL tasks1 contains only a handful of domains, and the continuous tasks are mostly deterministic. So our experiments were performed on artificial random MDPs, as well as randomized versions of the well-known Mountain-Car task. Random discrete MDPs (RMDPs) have already been used for experimental studies with tabular RL algorithms. In this paper, we use as a starting point a design suggested by Sutton and Kautz for discrete, enumerated state spaces2 , but we extend it in order to allow feature-vector representations of the states. Fig. 2 shows how transitions are performed in an RMDP. The left panel shows the case of a discrete, enumerated state space. For each state-action pair (s, a) the next state s is selected from a set of b possible next states, according to a the probability distribution Ps,s , j = 1, ..., b. The reward is then sampled from j a a a normal distribution with mean Rs,s and variance Vs,s . Such MDPs are easy to generate automatically. Our RMDPs are a straightforward extension of this design. A state is described by a feature vector: v1 , ..., vn , with vi ∈ [0, 1]. State transitions are governed by a mixture of b multivariate normal distributions (Gaussians) N (µj , σj ), with means µj = µ1j , . . . µnj and variances σj = σj1 , . . . σjn . The means µij = Mji (s, a) and variances σji = Vji (s, a) are functions of the current state-action pair, (s, a). Sampling from this mixture is performed hierarchically: first one of the b Gaussian components is selected according to probabilities Pj (s, a), j = 1, . . . b, then the next state s is sampled from the selected 1 2
Reinforcement learning repository at the University of Massachusetts, Amherst www-anw.cs.umass.edu/rlr www.cs.umass.edu/˜rich/RandomMDPs.html
398
Bohdana Ratitch and Doina Precup s a s1'
sj'
s
Taken with P(s,a,sj')
sb'
s' r~N(R(s,a,s'),V(s,a,s'))
a N1
Nj
Taken with Pj(s,a)
Nb
v1~N(Mj1(s,a),Vj1(s,a))... vn~N(Mjn(s,a),Vjn(s,a)) s' r~N(R(s,a,s'),V(s,a,s'))
Fig. 2. Random MDPs component, N (µj , σj )3 . Once the next state s is determined, the reward for the transition is sampled from a normal distribution with mean R(s, a, s ) and variance V (s, a, s ). The process may terminate at any time step according to a probability distribution P (s ). Mixtures of Gaussians are a natural and nonrestrictive choice for modeling multi-variate distributions. Of course, one can use other basis distributions as well. We designed a generator for RMDPs of this form4 , which uses as input a textual specification of the number of state variables, actions, branching factor, and also some constraints on the functions mentioned above. In these experiments we used piecewise constant functions to represent Pj (s, a), Mji (s, a), Vji (s, a), R(s, a, s ) and V (s, a, s ), but this choice can be more sophisticated. Mountain-Car[17] is a very well-studied minimum-time-to-goal task. The agent has to drive a car up a steep hill by using three actions: full throttle forward, full throttle reverse, or no throttle. The engine is not sufficiently strong to drive up the hill directly, so the agent has to build up sufficient energy first, by accelerating away from the goal. The state is described by two continuous state variables, the current position and velocity of the car. The rewards are -1 for every time step, until the goal is reached. If the goal has not been reached after 1000 time steps, the episode is terminated. We introduced noise in the classical version of the Mountain Car task by perturbing either the acceleration or the car position. In the first case, the action that corresponds to no throttle remained unaffected, while the other two actions were perturbed by zero-mean Gaussian noise. This is done by adding a random number to the acceleration of +1 or -1. The new value of the acceleration is then applied for one time step. In the second case, the car position was perturbed on every time step by zero-mean Gaussian noise. 4.2
Effect of Entropy and Controllability on Learning
We performed a set of experiments to test the hypothesis that STE and C have a statistically significant effect on the quality of the policies learned with finite 3 4
By setting the variances σji to zero and using discrete feature values, one can obtain RMDPs with discrete state spaces. The C++ implementation of the RMDP generator and the MDPs used in these experiments will be available from www.cs.mcgill.ca/˜sonce/
Characterizing Markov Decision Processes
399
amounts of data. We used a benchmark suite consisting of 50 RMDPs and 10 randomized versions of the Mountain-Car task (RMC). All the RMDPs had two state variables and two actions. In order to estimate the average STE and C values for these tasks, each state variable was discretized into 10 intervals; then, 100 states were chosen uniformly (independently of the discretization) and 150 samples of state transitions were collected for each of these states.5 The STE and C values were estimated for each of these states (and each action, in the case of STE) using counts on these samples. Then the average value for each MDP was computed assuming a uniform state distribution.6 The RMDPs formed 10 groups with different combinations of average STE and C values, as shown in the left panel of Fig. 3. Note that it is not possible to obtain a complete factorial experimental design (where all fixed levels of one attribute are completely crossed with all fixed levels of the other attribute), because the upper limit on C is dependent on STE. However, the RMDP generator allows us to generate any STE and C combination in the lower left part of the graph, up to a limiting curve. For the purpose of this experiment, we chose attribute values distributed such that we can still study the effect of one attribute while keeping the other attribute fixed. Note that each group of RMDPs contains environments that have similar STE and C values, but which are obtained with different parameter settings for the RMDP generator. The RMDPs within each group are in fact quite different in terms of state transition structure and rewards. The average values of STE and C for the RMC tasks are shown in the right panel of Fig. 3. We used two tasks with acceleration noise (with variances 0.08 and 0.35 respectively) and eight tasks with position noise (with variances 5·10−5 , 9 · 10−5, 17 · 10−5, 38 · 10−5, 8 · 10−4, 3 · 10−3, 9 · 10−3 and 15 · 10−3). These tasks were chosen in order to give a good spread of the STE and C values. Note that for the RMC tasks, STE and C are anti-correlated. We used SARSA as the RL algorithm and CMACs as function approximators [17] to represent the action-value functions. The agent followed an -greedy exploration strategy with = 0.01. For all tasks, the CMAC had five 9×9 tilings, each offset by a random fraction of a tile width. For the RMDPs, each parameter w of the CMAC architecture had an associated learning rate which followed 1.25 , where nt is the number of updates to w a decreasing schedule αt = 0.5+n t performed by time step t. For the RMCs, we used a constant learning rate of α = 0.0625. These choices were made for each set of tasks (RMDPs and RMCs) based on preliminary experiments. We chose settings that seemed acceptable for all tasks in each set, without careful tuning for each MDP. 5 6
Note that by sampling states uniformly, we may get more than one state in one bin of the discretization, or we may get no state in another bin. For the purpose of estimating STE and C beforehand, the choice of either uniform state distribution or the distribution generated by a uniformly random policy are the only natural choices. Preliminary experiments indicate that the results under these two distributions are very similar.
400
Bohdana Ratitch and Doina Precup RMDPs
4
1.6
3.5
1.4
3
1.2
2.5
STE
STE
0.8
1.5
0.6
1
0.4
0.5
0.2
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Controllability (C)
Position noise Acceleration noise
1
2
0
RMC Tasks
0
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Controllability (C)
Fig. 3. Values of the Attributes for MDPs in the Benchmark Suite
For each task in the benchmark suite, we performed 30 learning runs. Each run consisted of 10000 episodes, where each episode started in a uniformly chosen state. For the RMDPs, the termination probability was set to 0.01 for all states. Every 100 trials, the current policy was evaluated on a fixed set of 50 test states, uniformly distributed across the state space. The best policy on a particular run r is the one with the maximum average return (M ARr ) over the states in the test set. The learning 30quality is measured as the average returns of the best policies found: LQ = r=1 M ARr . Note that it is not always possible to compare the LQ measure directly for different MDPs, because their optimal policies may have different returns (thus the upper limits on M ARr and LQ are different). So we need to normalize this measure across different MDPs. Ideally, we would like to normalize with respect to the expected return of the optimal policy. Since the optimal policy is not known, we normalize instead by the average return of the uniformly random policy over the same test states (RURP). The normalized LQ , learning quality (NLQ) measure used in the experiments is N LQ = RURP RURP if rewards are positive (as is the case of the RMDPs), and N LQ = LQ otherwise (for the RMCs). We conducted some experiments with RMDPs for which we knew the optimal policy and the results for the optimally and RURPnormalized LQ measures were very similar. Note that learning quality of RL algorithms is most often measured by the return of the final policy (rather than the best policy). In our experiments, the results using the return of the final policy are very similar to those based on the best policy (reported below), only less statistically significant. The within-group variance of the returns is much larger for the final policies, due to two factors. First, final policies are more affected by the learning rate: if the learning rate is too high, the agent may deviate from a good policy. Since the learning rate is not a factor in our analysis, it introduces unexplained variance. Secondly, SARSA with FA can exhibit an oscillatory behavior [5], which also increases the variance of the returns if they are measured after a fixed number of trials. We plan to study the effect of the learning rate more in the future. To determine if there is a statistically significant effect of STE and C on NLQ, we performed three kinds of statistical tests. First, we used analysis of
Characterizing Markov Decision Processes
401
variance [2], to test the null hypothesis that the mean NLQ for all 10 groups of RMDPs is the same. We performed the same test for the 10 RMC tasks. For both domains, this hypothesis can be rejected at a significant confidence level (p < 0.01). This means that at least one of the two attributes has a statistically significant effect on NLQ. We also computed the predictive power (Hay’s statistic [2]) of the group factor, combining STE and C, on NLQ. The values of this statistic are 0.41 for the RMDPs and 0.57 for the RMCs. These values indicate that the effect of STE and C is not only statistically significant but also practically usable: the mean squared error in the prediction of the NLQ is reduced by 41% and 57% respectively for the RMDPs and RMC tasks, as a result of knowing the value of these attributes for the MDP. This result is very important because our long-term goal is to use knowledge about the attributes for making practical decisions, such as the choice of the algorithm or parameter settings for the task at hand. For the RMDP domains, the combination of STE and C values has the most predictive power (41%), whereas STE alone has only 4% prediction power and C alone has none. This suggests that both attributes have an effect on NLQ and should be considered together. Figure 4 shows the learning quality as a function of STE and C for the RMDPs (left panel) and for the RMC tasks (middle and right panels). Note that for the RMC tasks, we cannot study the effects of STE and C independently, because their values are anti-correlated. For ease of comparing the results to those obtained for RMDPs, we include two graphs for the RMC tasks, reflecting the dependency of NLQ on STE and C (middle and right panels of the figure 4). We emphasize that both graphs reflect one trend: as STE increases (and C decreases correspondingly), NLQ decreases. The reader should not conclude that STE and C exhibit independent effects in the case of the RMC tasks. As can be seen in the figure, for both domains (RMDPs and RMCs) the quality decreases as the entropy increases. We also conducted Least Significant Difference (LSD) tests [2] to compare the mean NLQ of the different pairs of RMDP groups and different pairs of RMC tasks. These tests (conducted at a conventional 0.05 confidence level), show that there is a statistically significant difference in the mean NLQ for all groups of RMDPs with different STE values, but the effect of STE becomes less significant as the value of the STE increases (potentially due to a floor effect). The trend is the same for the RMC tasks. As discussed in Sect. 3, high entropy is associated with the amounts of noise in the training data due to Factors 2 and 3 in Table 1, which makes learning more difficult. As we discussed in Sect. 2, the amount of noise also depends on the shape of the action-value functions. For example, if the action-value function is constant across the state-action space, then there will be no noise due to Factor 3 (see Table 1). Additional experiments with RMDPs that have relatively smooth and flat action-value functions7 showed that in this case, the learning quality increased as the STE increased. This is due to the positive effect of extensive state-space exploration in high-entropy MDPs. Thus, the effect of STE on learning quality is 7
In those RMDPs, one action has higher rewards than the other in all states and the functions R(s, a, s ) have a small range.
402
Bohdana Ratitch and Doina Precup RMDPs
1.5
16
1.45
1.4
STE~0.5
1.35
NLQ
16
14
14
12
12
NLQ
NLQ
1.25
1.2
1.15
STE~1.5 STE~3.5
1.1
0
0.2
0.4
0.6
Controllability (C)
0.9
8
8
6
6
4
4
2
RMC Tasks
10
10
1.3
1.05
RMC Tasks
2
0 0.25 0.37
STE
0.8
1.29 1.52
0.06 0.24
0.45 0.68 0.81
1
Controllability (C)
Fig. 4. Learning Quality a tradeoff between the negative effect of noise and the positive effect of natural state space exploration. The LSD tests also show differences in NLQ for the groups of RMDPs with different C values. The differences are significant between some of the groups with STE≈ 0.5 and STE≈ 1.5 levels. They appear when C changes by about 0.4. As can be seen from the left panel of Fig.4, the learning quality increases as controllability increases. As discussed in Sect. 3, high controllability means that the agent can better exploit the environment, and has more control over the exploration process as well.
5
Conclusions and Future Work
In this paper, we proposed attributes to quantitatively characterize MDPs, in particular in terms of the amount of stochasticity. The proposed attributes can be either computed given the model of the process or estimated from samples collected as the agent interacts with its environment. We presented the results of an empirical study confirming that two attributes, state transition entropy and controllability, have a statistically significant effect on the quality of the policies learned by a reinforcement learning agent using linear function approximation. The experiments showed that better policies are learned in highly controllable environments. The effect of entropy shows a trade-off between the amount of noise due to environment stochasticity, and the natural exploration of the state space. The fact that the attributes have predictive power suggests that they can be used in the design of practical RL systems. Our experiments showed that these attributes also affect learning speed. However, statistically studying this aspect of learning performance is difficult, since there is no generally accepted way to measure and compare learning speed across different tasks, especially when convergence is not always guaranteed. We are currently trying to find a good measure of speed that would allow a statistically meaningful study. We are also currently investigating whether the effect of these attributes depends on the RL algorithm. This may provide useful information
Characterizing Markov Decision Processes
403
in order to make good algorithmic choices. We are currently in the process of studying the effect of the other attributes presented in Sect. 3. The empirical results we presented suggest that entropy and controllability can be used in order to guide the exploration strategy of the RL agent. A significant amount of research has been devoted to sophisticated exploration schemes (e.g., [11], [4], [12]). Most of this work is concerned with action exploration, i.e. trying out different actions in the states encountered by the agent. Comparatively little effort has been devoted to investigating state-space exploration (i.e. explicitly reasoning about which parts of the state space are worth exploring). The E 3 algorithm [8] uses state-space exploration in order to find near-optimal policies in polynomial time, in finite state spaces. We are currently working on an algorithm for achieving good state-space exploration, guided by local measures of the attributes presented in Sect. 3. The agent uses a Gibbs (softmax) exploration policy [17]. The probabilities of the actions are based on a linear combination of the action values, local measures of the MDP attributes and the empirical variance of the FA targets. The weights in this combination are timedependent, in order to ensure more exploration in the beginning of learning, and more exploitation later.
Acknowledgments This research was supported by grants from NSERC and FCAR. We thank Ricard Gavald` a, Ted Perkins, and two anonymous reviewers for valuable comments.
References [1] Bertsekas, D. P., Tsitsiklis, J. N.: Neuro-Dynamic Programming. Belmont, MA: Athena Scientific (1996) 391 [2] Cohen, P. R.: Empirical Methods for Artificial Intelligence. Cambridge, MA: The MIT Press (1995) 401 [3] Dean, T., Kaelbling, L., Kirman, J., Nicholson, A.: Planning under Time Constraints in Stochastic Domains. Artificial Intelligence 76(1-2) (1995) 35-74 392 [4] Dearden, R., Friedman, N., Andre, D.: Model-Based Bayesian Exploration. In Uncertainty in Artificial Intelligence: Proceedings of the Fifteenth Conference (UAI1999) 150-159 403 [5] Gordon, J. G.: Reinforcement Learning with Function Approximation Converges to a Region. Advances in Neural Information Processing Systems 13 (2001) 10401046 400 [6] Hogg, T., Huberman, B. A., Williams, C. P.: Phase Transitions and the Search Problem (Editorial). Artificial Intelligence, 81 (1996) 1-16 391 [7] Hoos, H. H., Stutzle, T. : Local Search Algorithms for SAT: An Empirical Evaluation. Journal of Automated Reasoning, 24 (2000) 421-481. 391 [8] Kearns, M., Singh, S.: Near-Optimal Reinforcement Learning in Polynomial Time. In Proceedings of the 15th International Conference on Machine Learning (1998) 260-268 403 [9] Kirman, J.: Predicting Real-Time Planner Performance by Domain Characterization. Ph.D. Thesis, Brown University (1995) 392, 394, 395
404
Bohdana Ratitch and Doina Precup
[10] Lagoudakis, M., Littman, M. L. : Algorithm Selection using Reinforcement Learning Proceedings of the 17th International Conference on Machine Learning (2000) 511-518 391 [11] Meuleau, N., Bourgine, P.: Exploration of Multi-State Environments: Local Measures and Back-Propagation of Uncertainty. Machine Learning 35(2) (1999) 117154 403 [12] Moore, A. W., Atkeson, C. G.: Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time. Machine Learning, 13 (1993) 103-130 403 [13] Papadimitriou, C. H., Steiglitz, K: Combinatorial Optimization: Algorithms and Complexity. Prentice Hall (1982) 391 [14] Papadimitriou, C. H., Tsitsiklis, J. N.: The Complexity of Markov Chain Decision Processes. Mathematics of Operations Research 12(3) (1987) 441-450 392 [15] Puterman, M. L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley (1994) 392 [16] Singh, S. P., Jaakkola, T., Jordan, M. I.: Reinforcement Learning with Soft State Aggregation. Advances in Neural Information Processing Systems, 7 (1995) 361368 391 [17] Sutton, R. S., Barto, A. G.: Reinforcement Learning. An Introduction. Cambridge, MA: The MIT Press (1998) 391, 392, 393, 398, 399, 403
Phase Transitions and Stochastic Local Search in k-Term DNF Learning Ulrich R¨ uckert, Stefan Kramer, and Luc De Raedt Machine Learning and Natural Language Lab, Institute of Computer Science, University of Freiburg Georges-K¨ ohler-Allee, Geb¨ aude 079, D-79110 Freiburg i. Br., Germany {rueckert,skramer,deraedt}@informatik.uni-freiburg.de
Abstract. In the past decade, there has been a lot of interest in phase transitions within artificial intelligence, and more recently, in machine learning and inductive logic programming. We investigate phase transitions in learning k-term DNF boolean formulae, a practically relevant class of concepts. We do not only show that there exist phase transitions, but also characterize and locate these phase transitions using the parameters k, the number of positive and negative examples, and the number of boolean variables. Subsequently, we investigate stochastic local search (SLS) for k-term DNF learning. We compare several variants that first reduce k-term DNF to SAT and then apply well-known SLS algorithms, such as GSAT and WalkSAT. Our experiments indicate that WalkSAT is able to solve the largest fraction of hard problem instances.
1
Introduction
The study of phase transitions of NP-complete problems [15, 3, 6, 7] has become quite popular within many subfields of artificial intelligence in the past decade. However, phase transitions have not yet received a lot of attention within the field of machine learning. Indeed, so far, there are only a few results that concern inductive logic programming [11, 8, 9]. The existence of phase transitions in inductive logic programming is not surprising because inductive logic programming is known to be computationally expensive, especially due to the use of θ-subsumption tests [21], which are NP-complete. In this paper, we study an important class of boolean formulae, i.e. k-term DNF, and show that phase transitions also occur in this propositional framework. The task in k-term DNF learning is to induce a DNF formula with at most k disjuncts (or terms) that covers all positive and none of the negative examples (this is the consistency requirement). Examples are boolean variable assignments. Learning k-term DNF is of practical relevance because one is often interested in finding the smallest set of rules that explains all the examples. This criterion is motivated by the principle of William of Ockham. Various practical machine learning systems employ the covering algorithm in the hope of finding small rule-sets. Moreover, k-term DNF has some interesting computational properties. It has polynomial sample complexity (in the PAC-learning sense) but T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 405–417, 2002. c Springer-Verlag Berlin Heidelberg 2002
406
Ulrich R¨ uckert et al.
the consistency problem is hard [14]. The polynomial sample complexity implies that only a polynomial number of examples is needed in order to converge with high probability to a good approximation of the concept. On the other hand, the computation of complete and consistent concepts (the consistency problem) cannot be done in polynomial time (unless RP = N P ). The combination of these properties makes k-term DNF the ideal class of formulae to start an investigation of phase transitions in boolean learning. The contributions of this paper are as follows. First, we show that phase transitions exist for learning k-term DNF. This result is not surprising because of the hardness of the consistency problem. Secondly, we locate the phase transitions that arise in k-term DNF. Thirdly, we introduce the use of stochastic local search methods for learning hard k-term DNF problems. Stochastic local search algorithms approximate the optimal solution at much lower computational costs. Well-known examples of stochastic local search algorithms for SAT include GSAT and WalkSAT. Finally, our experiments demonstrate that these stochastic local search algorithms are effective. This paper is organized as follows: Section 2 introduces k-term DNF learning, and Section 3 identifies and localizes the phase transition in k-term DNF learning. Subsequently, Section 4 presents stochastic local search that is based on the reduction of k-term DNF learning to the satisfiability problem (SAT) and compares variants thereof on test sets of hard problem instances. Finally, Section 5 discusses further work, related work and concludes.
2
K-term DNF Learning
A k-term DNF formula is a disjunction of k terms, where each term is a conjunction of literals. E.g. (a1 ∧ ¬a2 ∧ a3 ) ∨ (a1 ∧ a4 ∧ a5 ) is a 2-term DNF with the terms (a1 ∧ ¬a2 ∧ a3 ) and (a1 ∧ a4 ∧ a5 ). The k-term DNF learning problem can now be formalized as follows [14]: Given – – – –
a a a a
set of Boolean variables V ar, set P os of truth value assignments pi : V ar → {0, 1}, set N eg of truth value assignments ni : V ar → {0, 1}, and natural number k,
Find a k-term DNF formula that is consistent with P os and N eg, i.e. that evaluates to 1 (true) for all variable assignments in P os and to 0 (f alse) for all variable assignments in N eg. We can make a few observations about this problem. First, for k = |P os|, we have a trivial solution F , where each term in F covers exactly one positive example. Obviously, we are only interested in problem instances with 1 ≤ k < |P os|. Second, if we know a solution F for a given kmin , we can easily derive solutions for any k > kmin . That means we can safely weaken the condition
Phase Transitions and Stochastic Local Search in k-Term DNF Learning
407
“formula needs to have exactly k terms” to “formula needs to have at most k terms”. Finally, assume that we have discovered a solution F for a given problem instance. Upon closer inspection we might discover that some literals are redundant in F , i.e. removing or adding these literals from or to F would still yield a solution. To examine which literals might be added to F , we can compute the least general specialization of F . The least general specialization (lgs) of F is a formula, that covers the same positive examples as F , but as few other instances as possible. To construct the lgs, we determine, which positive examples are covered by the individual terms in the solution: Covi =def {p ∈ P os| the ith term of F is satisfied by p}. We can then compute the lgs using the least general generalization (lgg) of those positive examples. The least general generalization of a set of examples e (over the variables V ari , 1 ≤ i ≤ n) can be efficiently computed by merging the literals: V ari if all examples in e set V ari to 1 merge(i, e) =def ¬V ari if all examples in e set V ari to 0 1 otherwise lgg(e) =def
merge(i, e)
1≤i≤n
The least general specialization is then: lgs(F ) =def lgg(Covi (F )) 1≤i≤k
One can show that lgs(F ) is a solution if F is a solution. As a consequence, a problem instance P has a solution if and only if it has a solution, that is a least general specialization (proof omitted). We can leverage these considerations to construct a complete algorithm for solving the k-term DNF learning problem. Instead of searching through the space of all possible formulae, we only search for least general solutions. More precisely: 1. Recursively enumerate all possible partitionings of P os into k pairwise disjoint subsets Pi . This can be done by starting with an empty partitioning and adding one positive example per recursion step. 2. In every recursion step, build the formula F =def 1≤i≤k lgg(Pi ), which corresponds to the current (incomplete) partitioning. 3. Whenever F is satisfied by a negative example, backtrack, otherwise continue the recursion. 4. When all positive examples have been added and the resulting F did not cover any negative examples, F is a solution. Here is a short example: consider the learning problem with three variables V ar = {V1 , V2 , V3 }, three positive examples P os = {001, 011, 100}, two negative examples N eg = {101, 111} and k = 2. The algorithm will start with the empty
408
Ulrich R¨ uckert et al.
partitioning {∅, ∅} and recursively add the positive examples, thereby calculating the formula F = lgg(P1 ) ∨ lgg(P2 ). As soon as the algorithm reaches the partitioning {{001, 100}, {011}}, F will be ¬V2 ∨ (¬V1 ∧ V2 ∧ V3 ) and will cover the first negative example. Thus, the algorithm backtracks. However, when generating the partitioning {{001, 011}, {100}}, F is (¬V1 ∧ V3 ) ∨ (V1 ∧ ¬V2 ∧ ¬V3 ), which is consistent with N eg. The algorithm outputs F as a solution. Note that the terms of F are always satisfied by the positive examples in the corresponding subsets of the partitioning. The size of this search space for a k-term DNF learning problem with n positive examples is the number of possible partitionings of P os into k pairwise disjoint nonempty subsets. This is the Stirling number of the second kind k−1 1 i k n S(n, k) = k! i=0 (−1) i (k−i) [28]. For large n, S(n, k) grows approximately exponentially to the base k. For most practical settings this is considerably lower than 3|V ar|·k , the size of the space of all k-term formulae. Additionally, one can prune the search whenever a negative example is covered during formula construction. Note, however, that searching the partitioning space is redundant: two or more partitionings of P os might lead to the same formula. For that reason it might be more efficient to search the k-term formula space in some settings, especially for low k and high |P os|.
3
The Phase Transition
To identify the location and size of the phase transition for k-term DNF learning, we examined the solubility and search costs for randomly generated problem instances. K-term DNF learning problem instances can be classified by four parameters: the number of Boolean variables n, the size of the set of positive examples |P os|, the size of the set of negative examples |N eg| and k, the maximal number of terms in the desired formula. We generated the positive and negative examples of the problem instances by choosing either V ari = 1 or V ari = 0 with the same probability for each variable i, 1 ≤ i ≤ n. The search costs were measured by counting the number of partitionings generated by the complete algorithm sketched in Section 2. The search costs for finding a solution using the complete algorithm for such a randomly generated problem instance obviously depend on all of the four parameters. For instance, when keeping n, |P os|, and k fixed and varying |N eg|, one would expect to have – on average – low search costs for very low or very high |N eg|. With only a few negative examples, almost any formula covering P os should be a solution, hence the search should terminate soon. For very large |N eg|, we can rarely generate formulae covering even a small subset of P os without also covering one of the many negative examples. Consequently, we can prune the search early and search costs should be low, too. Only in the region between obviously soluble and obviously insoluble problem instances, the average search costs should be high. Similar considerations can be made about n and |P os|, but it is not obvious, to which degree each parameter affects solubility and average search costs.
Phase Transitions and Stochastic Local Search in k-Term D N F Learning
409
Fig, 1.Ps,!(ab eve) m d seas-& ~ d s t s(helow) pltitted as SD p p h grid m5tou.r ploi.foi the $ ~ o b l e m s e t t ~ nwihh,h .~ = 3, IPm[ = 15) 1 5 INq] 5 128,and ?.=;.a
.=l@
To examine this further, we calculated t h e probability PsOl of a problem instance being soluble a n d t h e search costs for a b r o a d range of problem settings. For instance, Figure 1shows the plots of these two quantities for fixed lPosl = 10 a n d k = 2, and varying Negl and n. Each d a t a point represents t h e average over 100 probleminstances. As expected, search costs are especially high in t h e region of Ps,l = 0.5. If t h e methods from s t a t i s t i c d mechanics c a n b e applied t o k-term DNF learning, we should b e able t o identify a "control parameter" a describing t h e location of t h e phase transition 16, 151. If finite-size scaling methods hold, we should b e able t o express PsOl as a function of this control parameter around some critical point a , [E, 21:
I
a-a, a, The t e r m
7mimics t h e
) .N?)
"reduced temperature"
9i n physical s y s t e m :
while t h e t e r m N ? provides t h e change of scale. As w i t h many other NPcomplete problems [TI, we expect f t o b e t h e cumulative distribution function of a normal distribution. Figure 2 shows t h a t PsOlincreases rapidly with t h e n u d e r of variables n . Thus, choosing n a s t h e control parameter seems t o b e a reasonable idea. W e (arbitrarily) choose the critical point n , s o t h a t PsOl(n,) = 0.5. Unlike w i t h
410
Ulrich R¨ uckert et al. 1 0.9
0.8
0.8
0.7
0.7 Normalized Search Costs
1 0.9
PSol
0.6 0.5 0.4
0.6 0.5 0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
50
100
150
200 n
250
300
350
400
0
0
50
100
150
200 n
250
300
350
400
Fig. 2. PSol and search costs for problem settings with k = 2 and (|P os|, |N eg|) being – from left to right – (10, 100), (12, 60), (16, 20), (16, 60), and (16, 100)
some other NP-complete problems, in k-term DNF learning nc is not simply a constant. Instead, its value depends on |P os|, |N eg|, and k. We will now try to express nc as a function of |P os|, |N eg|, and k so that PSol (nc ) = 0.5. First of all, note that there is an inherent asymmetry in the constraints imposed upon a term by the positive and negative examples: assume we have a term c containing l literals. Assume further that c is consistent with some positive examples P osc ⊆ P os and all negative examples in N eg. If we require the term to cover a new (random) positive example p, we have to replace c with lgg(c, p). On average, we would expect the number of literals in lgg(c, p) to be half the number of literals in c. Since a formula contains more than one term, we will expect that c needs to cover only “suitable” examples, so we expect the number of literals in c to decrease slightly slower than by factor 0.5. Still, the number of literals decreases exponentially with the number of covered positive examples. On the other hand, if we add a new negative example e, the term has to differ only in one literal in order to be consistent with the new negative example. If l ≥ 1, c is already consistent with e in most cases. Only with probability 0.5l we do have to add one new literal to c. Thus, the number of literals in c will increase considerably slower than the number of negative examples consistent with c. This leads to two observations about nc : – Observation 1: nc grows exponentially with the number of positive examples |P os|. Assume, we found parameters n, |P os|, |N eg|, and k so that PSol (n, |P os|, |N eg|, k) = 0.5. If we add a new positive example e, a formula F has to additionally cover e in order to remain consistent with all positive examples. That means we have to replace at least one term c of F with lgg(c, e), effectively reducing the number of literals in F by some unknown factor. Then, F more likely covers a negative example and this in turn decreases PSol . In order to
Phase Transitions and Stochastic Local Search in k-Term DNF Learning
411
350
300
250
n
200
150
100
50
0
0
20
40
60
80
100
120
140
|Neg|
Fig. 3. The location of nc for k = 2 depending on |N eg| for |P os| being – from bottom to top – 8, 9, 10, 12, 14, and 16
keep PSol constant we have to increase n by a factor β, thus restoring the previous level of literals in c. Since formulae have more than one term, the size of the exponent is an (unknown) function γ, depending on |P os| and k. This yields: (2) nc ≈ β γ(|P os|,k) – Observation 2: In fact, the value of β depends on the number of negative examples. Adding a new variable only increases PSol , if it increases the number of literals in F . The more negative examples are present, the more variables we have to add on average until we can add a new literal to F without making it inconsistent. As indicated by figure 1, nc grows with log |N eg|. This seems to be reasonable given the fact that – on average – we need 2l negative examples to increase the number of literals in term c by one. We would therefore expect that nc ∝ a · log2 (|N eg|), with the factor a depending on k. Assuming β = a · log2 (|N eg|) as described above, we obtain: nc ≈ (a log2 |N eg|)γ(|P os|,k)
(3)
a is the growth rate for fixed |P os| and variable |N eg|, while γ describes the growth rate of nc with increasing |P os|. To identify a and γ, we calculated nc for a set of problem settings in the range of 2 ≤ k ≤ 5, 1 ≤ |N eg| ≤ 120, and 7 ≤ |P os| ≤ 25. From the resulting graphs, we estimated a and γ(|P os|, k) using non-linear least square function regression (Nelder-Mead). We found that γ can be approximated very well by a linear function of |P os|: γ(x) =def b · x + c. Figure 3 shows the computed and the approximated value of nc for k=3 and |P os| ∈ {9, 12, 15, 18}. Table 1 shows the values of a, b, and c for 2 ≤ k ≤ 5. Finally, these considerations lead us to our hypothesis about nc .
412
Ulrich R¨ uckert et al.
Table 1. The values of a,b, and c for determining nc depending on the number of terms k Number of terms a b c 2 3.6995 0.080602 0.49471 3 1.8072 0.056234 0.68868 4 1.4542 0.041363 0.76301 5 1.3927 0.026572 0.85334
Hypothesis: nc ≈ (a · log2 |N eg|)(b·|P os|+c) a seems to converge for larger values of k. Unfortunately, k-term DNF learning requires huge computational resources for k > 5, so we could not examine this further. To verify the correctness of the approximation, we predicted nc at |P os| = 30, |N eg| = 30, and k = 3 to be about 178. We then computed 100 random problem instances and indeed found that PSol (30, 30, 3, 178) = 0.51, with an average search cost of 50 million recursions. In order to put our hypothesized nc to the test, we now check whether equation 1 adequately describes the phase transition. We computed PSol for k = 3, |P os| ∈ {10, 12, 14, 16, 18}, and |N eg| ∈ {20, 40, 60, 80, 100, 120}. We varied n between 1 and 384 and solved 1000 randomly generated problem instances per parameter setting to determine PSol . Figure 4 shows PSol for some selected probc lem settings, plotted against n and α(n) =def n−n nc . As can be seen, the selected problem settings can be adequately described by α, even though we did not introduce a “change of scale” parameter N 1/ν . Further investigations showed, that problem settings with the same |N eg| are virtually indistinguishable when plotted against α. Only for small |N eg|, the slope of PSol (α) is slightly smaller than predicted. However, similar anomalies for small control parameter values are known for other NP-complete problems as well [7, 18].
4
Stochastic Local Search
Most NP-complete problems can be formulated as a search through the space of possible solutions. For the hard problem instances the size of the search space is extremely large, and, as a consequence, complete algorithms require huge computational resources. Thus, for most practical problems, we are gladly willing to sacrifice completeness (i.e. the certainty of finding a solution, if there is one) for adequate runtime behavior. Though being incomplete, stochastic local search (SLS) algorithms have been shown to find solutions for many hard NP-complete problems in a fraction of the time required by the best complete algorithms. Since the introduction of GSAT [23], there has been a lot of research on SLS
Phase Transitions and Stochastic Local Search in k-Term DNF Learning
0.8
0.8
0.7
0.7
0.6
0.6 PSol
1 0.9
PSol
1 0.9
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
413
0
100
200 n
300
400
0 −0.5
0 α =def (n − nc) / nc
0.5
Fig. 4. PSol for k = 2 and (|P os|, |N eg|) being – from left to right – (10, 80), (10, 100), (10, 120), (12, 80), (12, 100), (12, 120), (16, 80), (16, 100), and (16, 120), plotted against n and α
algorithms, and a couple of different variants have been proposed and evaluated [12, 24, 25, 17]. There are two main properties of SLS algorithms: first, instead of doing a systematic search through the whole instance space, an SLS algorithm starts a local search at a randomly selected location and restarts at different random locations, if it does not find a solution within a given time frame. Second, the search is local in a sense that it steps only to neighboring instances. During its search an SLS algorithm usually favors those instances, that minimize some predefined global evaluation function. To be successful, an SLS algorithm needs to have some way of escaping or avoiding local optima. Often, this is achieved by performing randomized steps from time to time (the so called “noise”1). An easy way to apply SLS algorithms to k-term DNF learning is to reduce a given k-term DNF learning problem to a satisfiability (SAT) problem and apply one of the many published SLS algorithms to the resulting SAT problem. In [13] Kamath et al. introduced a reduction of k-term DNF learning to SAT. They generated a test set of 47 k-term DNF learning problem instances ranging from problems with eight variables up to 32 variables. The reduction of this test set to SAT has been widely used as a benchmark for SAT SLS algorithms [26, 24, 23, 25]. Unfortunately, the test set seems to be very easy to solve. Kamath et al. used a target concept driven approach for constructing the problem instances. For each problem instance they built a random target formula. Then they uniformly generated random examples and labeled them according to the target formula. We reproduced the largest problem instances from the test set and found that even our complete algorithm solved all of them within a few 1 2
Note that this is different from the “noise” we use in machine learning! For GSAT+Tabu we state the size of the tabu table instead of a noise level
414
Ulrich R¨ uckert et al.
Table 2. The success rates (i.e. fraction of tries that found a solution) for various SLS algorithms running on the reduced test sets Noise Level2 Success Rate Success Rate Success Rate Test Set 1 Test Set 2 Test Set 3 GSAT n/a 78.5% 0% 0% 0.25 87.2% 0% 0% GSAT+RandomWalk 0.5 89.3% 1.7% 0% 0.75 56.8% 0.8% 0% 5 92.9% 0% 0% GSAT+Tabu 10 93.5% 0% 0% 15 84.4% 0% 0% 0.25 100% 97.5% 76.0% WalkSAT 0.5 100% 98.2% 62.6% 0.75 100% 90.5% 19.4% 0.25 93.1% 2.8% 0% Novelty 0.5 97.7% 4.2% 0% 0.75 98.1% 6.7% 0% Algorithm
seconds. Even worse, a propositional version of FOIL [19] was able to solve them in less than a second. Obviously, the information gain heuristic works especially well for problem instances, which were generated by sampling over the uniform distribution. Clearly, this test set is too easy to be used as a hard benchmark for k-term DNF learning. In order to evaluate SLS algorithms on harder problem instances, we generated three test sets, all taken from the phase transition region of the problem setting space. Each test set contains one hundred soluble (for k=3) problem instances. The first test set was generated with |P os| = 10, |N eg| = 10, n = 10, the second one with |P os| = 20, |N eg| = 20, n = 42 and the third one with |P os| = 30, |N eg| = 30, n = 180. We reduced the test sets to SAT using the reduction from [13]. The resulting SAT problems describe the desired solution F using 2 · |V ar| · k variables. They use another |P os| · k auxiliary variables to express, which positive example is covered by which term. The constraints put on those variables by the positive and negative examples are encoded in k · (|V ar| · (|P os| + 1) + |N eg|) + |P os| clauses. We tested a range of known SLS algorithms on the SAT-encoded problems of the test sets (see [12] for a description of the algorithms). We ran ten tries per problem instance and counted the number of successful tries (i.e. tries that found a solution) for each algorithm. For WalkSAT and Novelty each try was cut off after 100000 flips, for the GSAT based algorithms, we chose a cutoff value of 20 times the number of variables in the corresponding SAT problem. Table 2 shows the fraction of successful tries for each algorithm. On the hardest test set only WalkSAT yielded reasonable results. GSAT and its derivatives failed even on the second test set. Though WalkSAT and GSAT+RandomWalk are conceptually very similar, their results
Phase Transitions and Stochastic Local Search in k-Term DNF Learning
415
differ strongly (similar results have been found in [26]). It seems that k-term DNF learning especially benefits from WalkSAT’s bias towards steps, which do not break any currently satisfied clause. This behavior ensures that the structural dependencies between the already satisfied clauses remain intact once they are found.
5
Conclusion
In the preceding sections we examined the NP-complete problem of k-term DNF learning. In Machine Learning we are not so much interested in the decision problem, but much more in the corresponding optimization problem: DNF minimization. As with many other NP-complete problems, an algorithm for the decision problem can be easily generalized to solving the optimization problem (usually by adding branch and bound techniques) and vice versa [7]. In fact, we found that the presented SLS algorithm was able to quickly find a solution for all k > kmin . This is also supported by the results in Section 2, where we identified the location of the phase transition: if a problem instance is located in the phase transition for a given k, the corresponding decision problem instances for all k > k are in the “obviously soluble” region. The DNF minimization problem settings seems to be at the core of most propositional concept learning settings, in the sense that: 1. Learning problems with discrete (and even with continuous-valued) attribute sets can be easily reformulated into a form that uses only two-valued (i.e. Boolean) attributes. This form corresponds exactly to our problem description. 2. Most propositional concept learners use representations that are subsets of or equivalent to DNF formulae, e.g. decision trees or disjunctive sets of rules. 3. Most concept learning algorithms include a bias towards a short representation of the hypotheses. While this might not necessarily increase the predictive accuracy, it is commonly considered as a desirable property [27, 4]. We showed that SLS algorithms can be successfully applied to hard randomly generated problem instances. Problem instances that are sampled from a uniform (or near uniform) distribution, can usually be solved in less than a few seconds by the presented SLS algorithm. The examples in the test cases were generated randomly; they do not follow a particular distribution. We would therefore expect that “real world” problems, which are obtained according to an (unknown) distribution are much easier to solve than the presented hard problem instances. However, it is not yet clear, whether or not SLS algorithms can efficiently deal with more structured problems. We are currently evaluating SLS algorithms for problem sets in the domain of chess endgames, such as the problem of predicting the minimum number of moves before a win on the Rook’s side in the KRK (King-Rook-King) endgame. The domain of chess endgames provides an ideal testbed for k-term DNF learning, since here we deal with noise-free datasets with discrete attributes only, and we are more interested in compression than in
416
Ulrich R¨ uckert et al.
predictivity. Finding a minimum theory for such endgame data was also a goal for previous research in this area [1, 22, 20] and is of continuing interest [5], but has not been tackled since then, partly due to the complexity of the task. Another field of interest is the application of SLS algorithms in different learning settings. SLS algorithms can easily be adapted to tolerating noise by introducing some noise threshold for the score. Whenever the score of a formula falls below this threshold, the remaining uncovered examples are considered as noise. Finally, we have to emphasize that stochastic search in general has been used before in Machine Learning (see, e.g., [10, 16]). However, to the best of our knowledge, this is the first attempt to introduce algorithms from stochastic local search (SLS) into propositional Machine Learning. Reducing the problem of k-term DNF to SAT, we can draw from a huge body of results from the area of satisfiability algorithms. In this sense, we hope that this work can stimulate further work along these lines in Machine Learning.
References [1] Bain, M. E. (1994) Learning logical exceptions in chess. PhD thesis. Department of Statistics and Modelling Science, University of Strathclyde, Scotland 416 [2] Barber, M. N. (1983) Finite-size scaling. Phase Transitions and critical phenomena, Vol. 8, 145-266, Academic Press 409 [3] Cheeseman, P., Kanefsky, B., and Taylor, W. M. (1991). Where the really hard problems are. Proceedings of the 12th IJCAI, 331-337 405 [4] Domingos, P. (1999) The role of Occam’s razor in knowledge discovery. Data Mining and Knowledge Discovery, Vol. 3, Nr. 4, 409-425 415 [5] F¨ urnkranz, J. (2002) Personal communication 416 [6] Gent, I. P., and Walsh, T. (1995). The number partition phase transition. Research report 95-185, Department of Computer Science, University of Strathclyde 405, 409 [7] Gent, I. P., and Walsh, T. (1996). The TSP phase transition, Artificial Intelligence, 88, 1-2, 349-358 405, 409, 412, 415 [8] Giordana, A., Saitta, L., Sebag, M., and Botta, M. (2000) Analyzing Relational Learning in the Phase Transition Framework. Proc. 17th International Conf. on Machine Learning, 311-318 405 [9] Giordana, A., Saitta, L. (2000) Phase Transitions in Relational Learning. Machine Learning, 41(2), 217-25 405 [10] Giordana, A., Saitta, L., and Zini, F. (1994) Learning disjunctive concepts by means of genetic algorithms. Proc. 11th International Conf. on Machine Learning, 96-104 416 [11] Giordana, A., Botta, M., and Saitta, L. (1999) An experimental study of phase transitions in matching. IJCAI 1999, 1198-1203 405 [12] Hoos, H. H. (1998) Stochastic local search - methods, models, applications, PhD Thesis, Technische Universit¨ at Darmstadt 413, 414 [13] Kamath, A. P., Karmarkar, N. K., Ramakrishnan, K. G., and Resende, M. G. C. (1991). A continous approach to inductive inference. Mathematical Programming, 57, 1992, 215-238. 413, 414
Phase Transitions and Stochastic Local Search in k-Term DNF Learning
417
[14] Kearns, M. J., and Vazirani, U. V. (1994). An introduction to computational learning theory. Cambridge, MA: MIT Press 406 [15] Kirkpatrick, S., and Selman, B. (1994). Critical behavior in the satisfiability of random boolean expressions. Science, 264, 1297-1301 405, 409 [16] Kovacic, M. (1994) MILP – a stochastic approach to Inductive Logic Programming. Proc. 4th International Workshop on Inductive Logic Programming, 123–138 416 [17] McAllester, D., Selman, B., and Kautz, H. (1997) Evidence for invariants in local search. Proceedings of the 14th National Conference on Artificial Intelligence, 321326 413 [18] Mitchell, D., Selman, B., and Levesque, H. (1992) Hard and easy distributions of SAT problems. Proceedings of the 10th National Conference on Artificial Intelligence, AAAI Press/ MIT Press, San Jose, CA, 459-465 412 [19] Mooney, R. J. (1995) Encouraging experimental results on learning CNF. Machine Learning, Vol. 19, 1, 79-92 414 [20] Nalimov, E. V., Haworth, G.McC., and Heinz, E. A. (2000) Space-efficient indexing of chess endgame tables. ICGA Journal, Vol. 23, Nr. 3, 148-162 416 [21] Plotkin, G. D. (1970) A note on inductive generalization. Machine Intelligence 5, Edinburgh University Press, 153-163. 405 [22] Quinlan, J. R., and Cameron-Jones, R. M. (1995) Induction of logic programs: FOIL and related systems. New Generation Computing, Vol. 13, 287-312 416 [23] Selman, B., and Kautz, H. A. (1992) A new method for solving hard satisfiability problems. Proceedings of the 10th National Conference on Artificial Intelligence, San Jose, CA, 440-446 412, 413 [24] Selman, B., Kautz, H. A., and Cohen, B. (1993) Local search strategies for satisfiability testing. Proceedings of the 10th National Conference on Artificial Intelligence, San Jose, CA 413 [25] Selman, B., and Kautz, H. A. (1993) Domain-independent extensions to GSAT: solving large structured satisfiability problems. Proceedings of IJCAI 93, 290-295 413 [26] Selman, B., Kautz, H. A. and Cohen, B. (1994) Noise strategies for improving local search. Proceedings of the 14th National Conference on Artificial Intelligence, 337343 413, 415 [27] Webb, G. J. (1996) Further evidence against the utility of Occam’s razor. Journal of Artificial Intelligence Research, Vol. 4, 397-417 415 [28] Weisstein, E. W. (2002) Stirling number of the second kind, http://mathworld.wolfram.com/StirlingNumberoftheSecondKind.html 408
Discriminative Clustering: Optimal Contingency Tables by Learning Metrics Janne Sinkkonen, Samuel Kaski, and Janne Nikkil¨ a Helsinki University of Technology, Neural Networks Research Centre P.O. Box 9800, FIN-02015 HUT, Finland {Janne.Sinkkonen,Samuel.Kaski,Janne.Nikkila}@hut.fi http://www.cis.hut.fi/projects/mi
Abstract. The learning metrics principle describes a way to derive metrics to the data space from paired data. Variation of the primary data is assumed relevant only to the extent it causes changes in the auxiliary data. Discriminative clustering finds clusters of primary data that are homogeneous in the auxiliary data. In this paper, discriminative clustering using a mutual information criterion is shown to be asymptotically equivalent to vector quantization in learning metrics. We also present a new, finite-data variant of discriminative clustering and show that it builds contingency tables that detect optimally statistical dependency between the clusters and the auxiliary data. A finite-data algorithm is demonstrated to outperform the older mutual information maximizing variant.
1
Introduction
The metric of the data space determines the goodness of the results of unsupervised learning: clustering, nonlinear projection methods, and density estimation. The metric, in turn, is determined by feature extraction, variable selection, transformation, and preprocessing of the data. The principle of learning metrics aims at automating part of the process of metric selection, by learning the metric from data. It is assumed that the data comes in pairs (x, c): during learning, the primary data vectors x ∈ Rn are paired with auxiliary data c which in this paper are discrete classes. Important variation in x is supposed to be revealed by variation in the the conditional density p(c|x). The distance d between two close-by data points x and x + dx is defined to be the difference between the corresponding distributions of c, measured by the Kullback-Leibler divergence DKL . It is well known (see e.g. [3]) that the divergence is locally equal to the quadratic form with the Fisher information matrix J, i.e. d2L (x, x + dx) ≡ DKL (p(c|x)p(c|x + dx)) = dxT J(x)dx .
(1)
The Fisher information matrix has classically appeared in the context of constructing metrics for probabilistic model families. A novelty here is that the data T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 418–430, 2002. c Springer-Verlag Berlin Heidelberg 2002
Discriminative Clustering
419
vector x is considered as the parameters of the Fisher information matrix, the aim being to construct a new metric into the data space. The Kullback-Leibler divergence defines a metric locally, and the metric can in principle be extended to an information metric or Fisher metric to the whole data space. We call the idea of measuring distances in the data space by approximations of (1) the learning metrics principle [1, 2]. The principle is presumable useful for tasks in which there is suitable auxiliary data available, but instead of merely predicting the values of auxiliary data the goal is to analyze, explore, or mine the primary data. Charting companies based on financial indicators is one example; there the bankruptcy risk (whether the company goes bankrupt or not) is natural auxiliary data. Learning metrics is similar to supervised learning in that the user has to choose proper auxiliary data. The difference is that in supervised learning the sole purpose is to predict the auxiliary data, whereas in learning metrics the metric is supervised while the rest of the analysis can be unsupervised, given the metric. In this paper we analyze clustering in learning metrics, or discriminative clustering (earlier also called semisupervised clustering) [2]. In general, a goal of clustering is to minimize within-cluster distortion or variation, and to maximize between-cluster variation. We apply the learning metrics by measuring distortions within each cluster by a kind of within-cluster Kullback-Leibler divergence. This causes the clusters to be internally as homogeneous as possible in conditional distributions p(c|x) of the auxiliary variable. The mutual differences between the distributions p(c|x) of the clusters are then automatically maximized, giving a reason to call the method discriminative. We have earlier derived and analyzed discriminative clustering with information-theoretic methods, assuming infinite amount of data. In this paper we will derive a finite-data variant and theoretical context for it, in the limit of “hard” clusters (vector quantization). It is not possible to use gradientbased algorithms for hard clusters, and hence we derive optimization algorithms for a smooth variant for which standard fast optimization procedures are then applicable.
2 2.1
Discriminative Clustering Is Asymptotically Vector Quantization in Fisher Metrics Discriminative Clustering
We will first introduce the cost function of discriminative clustering by applying the learning metrics principle to the classic vector quantization or K-means clustering. In vector quantization the goal is to find a set of prototypes or codebook vectors mj that minimizes the average distortion E caused when the data are
420
Janne Sinkkonen et al.
represented by the prototypes: E=
Vj
j
D(x, mj ) p(x) dx ,
(2)
where D(x, mj ) is the distortion caused by representing x by mj , and Vj is the Voronoi region of the cell j. The Voronoi region Vj consists of all points that are closer to mj than to any other model, that is, x ∈ Vj if D(x, mj ) ≤ D(x, mk )
(3)
for all k. The learning metrics principle is applied to (2) by introducing a set of distributional prototypes ψ j , one for each partition j, and by measuring distortions of representing the distributions p(c|x) by the prototypes ψ j . The average distortion is EKL = DKL (p(c|x), ψj ) p(x) dx , (4) j
Vj
where distortion between distributions has been measured by the KullbackLeibler divergence. Note that the Voronoi regions Vj are still kept local in the primary data space by defining them with respect to the Euclidean distortion (3). The cost (4) is minimized with respect to both sets of prototypes, mj and ψ j . The optimization is discussed further in Section 5. It can be shown that minimizing (4) maximizes the mutual information between the auxiliary data and the clusters, considered as a random variable [2]. This holds even for the soft variant discussed in Section 5. 2.2
Asymptotic Connection to Learning Metrics
In this section we aim to clarify the motivation behind discriminative clustering, by deriving a connection between it and the learning metrics principle of using (1) as the distance measure. The connection is only theoretical in that it holds only for the asymptotic limit of a large number of clusters, whereas in practice the number of clusters will be small. The asymptotic connection can be derived under some simplifying assumptions. It is assumed that almost all Voronoi regions become increasingly local when their number increases. (In singular cases, the data samples are identified with their equivalence classes having zero mutual distance.) There are always some non-compact and therefore inevitably non-local Voronoi regions at the borders of the data manifold, but it is assumed that the probability mass within them can be made arbitrarily small by increasing the number of regions. Assume further that the densities p(c|x) are differentiable. Then the class distributions p(c|x) can be made arbitrarily close to linear within each region Vj by increasing the number of Voronoi regions.
Discriminative Clustering
421
Let EVj ´ adenote the expectation over the Voronoi region Vj with respect to the probability density p(x). At the optimum of the cost EKL , we have ψj = EVj [p(c|x)], i.e. the parameters ψ j are equal to the means of the conditional distribution within the Voronoi regions (see [2]; this holds even for the soft clusters). Since p(c|x) is linear within each Voronoi region, there exists a linear operator Lj for each Vj , for which p(c|x) = Lj x. The distributional prototypes then become ˜ j = p(c|m ˜ j) , ψ j = EVj [p(c|x)] = EVj [Lj x] = Lj EVj [x] ≡ Lj m and the cost function becomes ˜ j(x) )) p(x) dx . EKL = DKL (p(c|x), p(c|m j
Vj
˜ j = EVj [x] for each That is, given a locally linear p(c|x), there exists a point m Voronoi region such that the Kullback-Leibler divergence appearing in the cost ˜ j(x) ) instead of function can be measured with respect to the distribution p(c|m the average over the whole Voronoi region. Since the Kullback-Leibler divergence is locally equal to a quadratic form of ˜ j to get the Fisher information matrix, we may expand the divergence around m ˜ j(x) )T J(m ˜ j(x) )(x − m ˜ j(x) ) p(x) dx , EKL = (x − m (5) j
Vj
˜ j(x) . ˜ j(x) ) is the Fisher information matrix evaluated at m where J(m Note that the Voronoi regions Vj are still defined by the parameters mj and in the original, usually Euclidean metric. In summary, discriminative clustering or maximization of mutual information asymptotically finds a partitioning from the family of local Euclidean Voronoi partitionings, for which the within-cluster distortion in the Fisher metric is minimized. In other words, discriminative clustering asymptotically performs vector quantization in the Fisher metric by Euclidean Voronoi regions: Euclidean metrics define the family of Voronoi partitionings {Vj }j over which the optimization is done, and the Fisher metric is used to measure distortion inside the regions.
3 3.1
Estimation from Finite Data Maximum Likelihood
Note that for finite data minimizing the cost function (4) is equivalent to maximizing log ψj,c(x) , (6) L= j
x∈Vj
where c(x) is the index of the class of the sample x. This is the log likelihood of a piece-wise constant conditional density estimator. The estimator predicts
422
Janne Sinkkonen et al.
the distribution of C to be ψ j within the Voronoi region j. The likelihood is maximized with respect to both the ψ j and the partitioning, under the defined constraints. 3.2
Maximum a Posteriori
The natural extension of maximum likelihood estimation is to introduce a prior and to find the maximum a posterior (MAP) estimate. The Bayesian framework is particularly natural for discriminative clustering since we are actually interested only on the resulting clusters, not the distribution of the auxiliary data within them. The class distributions can therefore be conveniently integrated out from the posterior (although seemingly paradoxical, the auxiliary data of course guides the clustering). Denote the observed auxiliary data set by D(c) , and the primary data set by D(x) . We then wish to find the set of clusters {m} which maximizes the posterior p({m}, {ψ}|D(c) , D(x) )d{ψ} , p({m}|D(c) , D(x) ) = {ψ}
or equivalently log p({m}|D(c) , D(x) ). Here the integration is over all ψ j . Denote the number of classes by Nc , the number of clusters by k, and the total number of samples by N . Denote the part of the data assigned to cluster j (c) by Dj , and the number of data samples of class i in cluster j by nji . Further denote Nj = i nji . Assume the improper and separable prior p({m}, {ψ}) ∝ p({ψ}) = j p(ψ j ). Then, (c) (x) p(D(c) |{m}, {ψ}, D(x) )p({ψ})d{ψ} p({m}|D , D ) ∝ {ψ} (c) p(Dj |{m}, ψ j , D(x) )p(ψ j ) dψ j = j
=
ψj
j
ψj
i
n
ψjiji p(ψ j ) dψ j ≡
Qj .
j
n0 −1 We will use a conjugate (Dirichlet) prior, p(ψ j ) ∝ i ψjii , where n0 = {n0i }i are the prior parameters common to all j, and N 0 = i n0i . Then the “partition(c) specific” density p(Dj |{m}, ψ j )p(ψ j ) is Dirichlet with respect to ψ and the factors Qj of the total posterior become
Qj =
Z
(c)
p(Dj |{m}, j
j, D
(x)
)p(
j) d
j
∝
Z Y ψ
n0 i +nji −1 ij
j
The log of the posterior probability then is
i
d
j
=
Q Γ(n
0 i
+ nji ) . Γ (N 0 + Nj ) i
Discriminative Clustering
log p({m}|D(c) , D(x) ) =
log Γ(n0i + nji ) −
ij
log Γ(N 0 + Nj ) .
423
(7)
j
In MAP estimation this function needs to be maximized. 3.3
Asymptotic Connection to Maximization of Mutual Information
It is shown that for a fixed number of clusters, the cost function (7) of the new method approaches mutual information as the number of data samples increases. Denote sji ≡ n0i + nji − 1, Sj ≡ i sji = N 0 + Nj − Nc , and S = j Sj . Then, log Γ(sji + 1) − log Γ[(Sj + Nc − 1) + 1] . (8) log p({m}|D(c) , D(x) ) = ij
j
It is straightforward to show using the Stirling approximation and Taylor approximations (Appendix A), that Nc k(log S + 1) sji /S log p({m}|D(c) , D(x) ) = +O sji /S log , (9) S Sj /S S ij where sji /S approaches pji , the probability of class i in cluster j, and Sj /S approaches pj as the number of data samples increases. Hence, (9) approaches the mutual information, added by a constant.
4
Discriminative Clustering Optimizes Contingency Tables
Contingency tables (see [4]) are classical methods for measuring statistical dependency between discrete-valued (categorical) random variables. The categories are fixed before the analysis, and for two variables the co-occurrences of the categories in a sample are tabulated into a two-dimensional table. A classic example due to Fisher is to measure whether the order of adding milk and tea affects the taste. The first variable indicates the order of adding the ingredients, and the second whether the taste is better or worse. In medicine the other variable could indicate health status and the other one demographic groups. The resulting contingency table is tested for dependency between the row and column variables. The literature for various kinds of tests and uses of contingency tables is extensive, see for example [4, 5, 6, 7]. The effect of small sample sizes and/or small cell frequencies has been the subject of much controversy. Bayesian methods are principled means for coping with small data sets; below we will derive a connection between the Bayesian approach presented in [7], and our discriminative clustering method. Given discrete-valued auxiliary data, the result of any clustering method can be analyzed as a contingency table. The possible values of the auxiliary variable correspond to columns and the clusters to rows. Clustering compresses
424
Janne Sinkkonen et al.
a potentially large number of multivariate continuous-valued observations into a manageable number of categories, and the contingency table can, at least in principle, be tested for dependency. Note that the difference from the traditional use of contingency tables is that the row categories are not fixed but clustering tries to find a suitable categorization. The question here is, is discriminative clustering a good way of constructing such contingency tables? The answer is that it is optimal in the sense introduced below. Good [7] derived a “Bayesian test” for dependency in contingency tables by computing the Bayes factor against H, ¯ P ({nij }|H) , P ({nij }|H)
(10)
where H is the hypothesis of statistical independence of the row and column categories. The probabilities are derived assuming mixtures of Dirichlet distributions as priors. In the special case of one fixed margin (the auxiliary data) in the contingency table, and the prior defined in Section 3.21 , the Bayes factor is ¯ P ({nij }|{n(ci )}, H) P ({nij }|{n(ci )}, H) 0 Γ (N 0 )k Γ (kn0 )Nc Γ (N + kN 0 ) i,j Γ (nji + n ) × = 0 Γ (n0 )Nc k i Γ (n(ci ) + kn0 )Γ (kN 0 ) j (Nj + N ) = p({m}|D(c) , D(x) ) × const. , (11) where the constant does not depend on Nj or nij . Here n(ci ) denotes the number of samples in the (auxiliary) class ci . MAP estimation for discriminative clustering is thus equivalent to constructing a dependency table that results in a maximal Bayes factor, under the constraints of the model.
5
Algorithms
Optimization of both variants of discriminative clustering, the finite data version (7) and the infinite-data version (4), is hard since the gradient is zero except on the Voronoi borders. Hence gradient-based optimization algorithms are not applicable. We have earlier [2] proposed a “smoothed” infinite-data variant which can be optimized by an on-line algorithm, reviewed below. A similar smoothed variant will be introduced for MAP estimation as well. 5.1
Algorithm for Large Data Sets
Smooth parameterized membership functions yj (x; {m})) were introduced to the cost function (4). Their values vary between 0 and 1, and j yj (x) = 1. The 1
In contrast to [7], we used priors with equal total amount of “prior data” for both hypotheses.
Discriminative Clustering
425
smoothed cost function is EKL
=
yj (x; {m})DKL (p(c|x), ψj ) p(x) dx .
(12)
j
The membership functions can be for instance normalized Gaussians, 2 2 yj (x) = Z −1 (x)e−x−mj /σ , where Z normalizes the sum to unity for each x. The cost function can be minimized by the following stochastic approximation algorithm. Denote the i.i.d. data pair at the on-line step t by (x(t), c(t)) and index the (discrete) value of c(t) by i, that is, c(t) = ci . Draw two clusters, j and l, independently with probabilities given by the membership functions {yk (x(t))}k . Reparameterize the distributional prototypes by the “soft-max”, log ψji = γji − log m exp(γjm ), to keep them summed up to unity. Adapt the prototypes by mj (t + 1) = mj (t) − α(t) [x(t) − mj (t)] log
ψli (t) ψji (t)
(13)
γjm (t + 1) = γjm (t) − α(t) [ψjm (t) − δmi ] ,
(14)
where δmi is the Kronecker delta. Due to the symmetry between j and l, it is possible (and apparently beneficial) to adapt the parameters twice for each t by swapping j and l in (13) and (14) for the second adaptation. Note that no updating takes place if j = l, i.e. then mj (t + 1) = mj (t). During learning the parameter α(t) decreases gradually toward zero according to a schedule that fulfills the conditions of the stochastic approximation theory. 5.2
MAP Algorithm for Finite Data Sets
In an analogous fashion to the infinite-data variant we postulate smooth membership functions yj (x; {m}) that govern the assignment of the data x to the clusters. Then the smoothed “number” of samples of class i within cluster j becomes nij = c(x)=i yj (x), and the MAP cost function (7) becomes log p({m}|D(c) , D(x) ) =
log Γn0i +
ij
c(x)=i
yj (x)−
log Γ Nj0 +
j
yj (x) .
x
(15) For normalized Gaussian membership functions the gradient of the cost function with respect to the jth model vector is (Appendix B) σ2
∂ log p({m}|D(c) , D(x) ) = (x − mj )yl (x)yj (x)(Lc(x),j − Lc(x),l ) , (16) ∂mj x,l
where Lij ≡ Ψ(nji + n0i ) − Ψ(Nj + Nj0 ) .
426
Jmne Sinkkonen et a1
Here Q is the digamma function, derivative of the logarithm of r. The MAP estimate can then be solved with general-purpose nonlinear optimization methods. We have used the conjugate gradient algorithm. Note that Q approaches the logarithm when its argument grows, and hence for large data sets the gradient approaches the average of (13) over the d a t a and the lth membership function, with $ji nij/Nj.
--
6
Empirical Results
The algorithm is first demonstrated with a toy example in Figure 1. The data (10,000 samples) comes from a two-dimensional spherically symmetric Gaussian distribution. The two-class auxiliary data changes only in the vertical dimension, indicating that only the vertical dimension is relevant. The algorithm learns to model only the relevant dimension. As far as we know there do not exist alternative methods for precisely the same task, partitioning the primary data space to clusters that are homogeneous in terms of the auxiliary data. We have earlier compared the older mutual information maximizing variant (section 5.1) with two clustering methods: the plain mixture of Gaussians a n d MDA2 [S,91, a mixture model for the joint distribution of primary and auxiliary data. For gene expression data our algorithm outperformed the alternatives [2]. Here we will add the new variant (section 5.2) to the comparison. A random half of the Landsat satellite data set from the UCI Machine Learning Repository (36 dimensions, six classes, and 6435 samples) was partitioned into 2-10 clusters, using the six-fold class indicator as the auxiliary data. For each number of clusters, solutions were computed for 30 values of the smoothing parameter u , ranging from two to 100 on the logarithmic scale. All the prior parameters ny were set to unity. The models were evaluated by computing the log-posterior probability (7) of the left-out data.
Fig. 1.A demonstration of the MAP algorithm. The probability density function of the data is shown in shades of gray and the cluster centers with circles. The conditional density of one of the two auxiliary classes is shown in the inset. (Here u = 0.4)
Discriminative Clustering 5 clusters
6 clusters
7 clusters
-2000
-2000
-2000
-3000
-3000
-3000
-4000
-4000
-4000
-5000
-5000
-5000
2.
5.
10.0
20.
50.
2.
5.
8 clusters
10.0
20.
50.
2.
-2000
-3000
-3000
-3000
-4000
-4000
-4000
-5000
-5000
-5000
10.0
20.
50.
2.
5.
10.0
20.
10.0
20.
50.
10 clusters
-2000
5.
5.
9 clusters
-2000
2.
427
50.
2.
5.
10.0
20.
50.
Fig. 2. The performance of the conjugate-gradient MAP algorithm (solid line) compared to the older discriminative clustering algorithm (dashed line), plain mixture of Gaussians (dotted line) and MDA2, a mixture model for the joint distribution of primary and auxiliary data (dash-dotted line). Sets of clusters were computed with each method with several values of the smoothing parameter σ, and the posterior log-probability (7) of the validation data is shown for a hard assignment of each sample to exactly one cluster. Results measured with empirical mutual information (not shown) are qualitatively similar. The smallest visible value corresponds to assigning all samples to the same cluster
The log-posterior probabilities of the validation set are presented in Figure 2. For all numbers of clusters the new algorithm performed better, having a larger edge at smaller numbers of clusters. Surprisingly, in constrast to earlier experiments with other data sets, for this data set the alternative clustering methods seem to outperform the older variant of discriminative clustering. For 4–7 clusters, the models were compared by ten-fold cross-validation. The best value for σ was chosen with validation data, in preliminary tests. The new model was significantly better for all cluster numbers (paired t test, p< 0.001).
7
Conclusions
In summary, we have applied the learning metrics principle to clustering, and coined the approach discriminative clustering. It was shown that discriminative clustering is asymptotically, in the limit of a large number of clusters, equivalent
428
Janne Sinkkonen et al.
to clustering in Fisher metrics, with the additional constraint that the clusters are (Euclidean) Voronoi regions in the primary data space. In the earlier work [1] Fisher metrics were derived from explicit conditional density estimators for clustering with Self-Organizing Maps; discriminative clustering has the advantage that the (arbitrary) density estimator is not required. We have derived a finite-data discriminative clustering method that maximizes the posterior probability of the cluster centroids. There exist related methods for infinite data, proposed by us and others, derived by maximizing the mutual information [2, 10, 11]. For discrete primary data there exist also finite-data generative models [12, 13]; the main difference in our methods is the ability to derive a metric to continuous primary data spaces. Finally, we have shown that the cost function is equivalent to the Bayes factor of a contingency table with the marginal distribution of the auxiliary data fixed. The Bayes factor is the odds of the data likelihood given the hypothesis that the rows and columns are independent, vs. the alternative hypothesis of dependency. Hence, discriminative clustering can be interpreted to find a set of clusters that maximize the statistical dependency with the auxiliary data.
Acknowledgment This work was supported by the Academy of Finland, in part by the grants 50061 and 52123.
References [1] Kaski, S., Sinkkonen, J., Peltonen, J.: Bankruptcy analysis with self-organizing maps in learning metrics. IEEE Trans. Neural Networks 12 (2001) 936–947 419, 428 [2] Sinkkonen, J., Kaski, S.: Clustering based on conditional distributions in an auxiliary space. Neural Computation 14 (2002) 217–239 419, 420, 421, 424, 426, 428 [3] Kullback, S.: Information Theory and Statistics. Wiley, New York (1959) 418 [4] Agresti, A.: A survey of exact inference for contingency tables. Statistical Science 7 (1992) 131–153 423 [5] Fisher, R. A.: On the interpretation of χ2 from the contingency tables, and the calculation of p. J. Royal Stat. Soc. 85 (1922) 87–94 423 [6] Freeman, G. H., Halton, J. H.: Note on an exact treatment of contingency, goodness of fit and other problems of significance. Biometrika 38 (1951) 141–149 423 [7] Good, I. J.: On the application of symmetric Dirichlet distributions and their mixtures to contingency tables. Annals of Statistics 4 (1976) 1159–1189 423, 424 [8] Hastie, T., Tibshirani, R., Buja, A.: Flexible discriminant and mixture models. In Kay, J. Titterington, D. (eds): Neural Networks and Statistics. Oxford University Press (1995) 426 [9] Miller, D. J., Uyar, H. S.: A mixture of experts classifier with learning based on both labelled and unlabelled data. In Mozer, M., Jordan, M., Petsche, T. (eds): Advances in Neural Information Processing Systems 9. MIT Press, Cambridge, MA (1997) 571–577 426
Discriminative Clustering
429
[10] Becker, S.: Mutual information maximization: models of cortical self-organization. Network: Computation in Neural Systems 7 (1996) 7–31 428 [11] Tishby, N., Pereira, F. C., Bialek, W.: The information bottleneck method. In: 37th Annual Allerton Conference on Communication, Control, and Computing. Urbana, Illinois (1999) 428 [12] Hofmann, T., Puzicha, J., Jordan, M. I.: Learning from dyadic data. In: Kearns, M. S., Solla, S. A., Cohn, D. A. (eds): Advances in Neural Information Processing Systems 11. Morgan Kaufmann Publishers, San Mateo, CA (1998) 466–472 428 [13] Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learning (2001) 177–196 428
A
Connection of MAP Estimation to Maximization of Mutual Information
The Stirling approximation log Γ(s + 1) = s log s − s + O(log s) applied to (8) yields log p({m}|D(c) , D(x) ) =
sji log sji −
ij
− (Nc − 1)
Sj log(Sj + Nc − 1)
j
log(Sj + Nc − 1) + O(Nc k(log S + 1)) .
j
The zeroth-order Taylor expansion log(S + n) = log S + O Sn gives after rearrangements, for Sj > 1, log p({m}|D(c) , D(x) ) = sji log sji − Sj log Sj + O(Nc k(log S + 1)) . ij
j
Division by S then gives (9).
B
Gradient of the MAP Cost Function
Denote for brevity tji = nji + n0i and Tj = respect to mj is
i tji .
The gradient of (15) with
∂ ∂ ∂ log p({m}|D(c) , D(x) ) = yl (x)Ψ(tli ) − yl (x)Ψ(Tl ) ∂mj ∂mj ∂mj il c(x)=i
x,l
∂ yl (x)[Ψ(tl,c(x) ) − Ψ(Tl )] . = ∂mj x,l
It is straightforward to show that for normalized Gaussian membership functions 1 ∂ yl (x) = 2 (x − mj )(δlj − yl (x))yj (x) . ∂mj σ
430
Janne Sinkkonen et al.
Substituting this to the gradient gives σ2
∂ log p({m}|D(c) , D(x) ) = (x−mj )(δlj −yl (x))yj (x)[Ψ(tl,c(x) )−Ψ(Tl)] . ∂mj x,l
(17) The final form (16) for the gradient results from applying the identity (δlj − yl )yj Ll = yl yj (Lj − Ll ) , l
to (17).
l
Boosting Density Function Estimators Franck Thollard, Marc Sebban, and Philippe Ezequel EURISE, Department of Computer Science Universit´e Jean Monnet de Saint-Etienne {franck.thollard,marc.sebban,ezequel}@univ-st-etienne.fr
Abstract. In this paper, we focus on the adaptation of boosting to density function estimation, useful in a number of fields including Natural Language Processing and Computational Biology. Previously, boosting has been used to optimize classification algorithms, improving generalization accuracy by combining many classifiers. The core of the boosting strategy, in the well-known Adaboost algorithm [4], consists in updating the learning instance distribution, increasing (resp. decreasing) the weight of misclassified (resp. correctly classified) examples by the current classifier. Except in [17, 18], few works have attempted to exploit interesting theoretical properties of boosting (such as margin maximization) independently of a classification task. In this paper, we do not take into account classification errors to optimize a classifier, but rather density estimation errors to optimize an estimator (here a probabilistic automaton) of a given target density. Experimental results are presented showing the interest of our approach.
1
Introduction
Most of the machine learning algorithms in supervised learning aim at providing efficient classification rules, often by optimizing the success rate of a given classifier. However, in some other machine learning areas, such as in Natural Language Processing, the main objective rather consists in correctly estimating probability densities over strings. In such a context, the algorithms aim at modelling a target density from a learning sample, in order to assess the occurrence probability of a new instance. This way to proceed is particularly useful in machine learning areas such as shallow parsing, spelling correction, speech recognition [7, 8, 12, 19] and computational biology [2, 10]. Many algorithms are available for estimating these densities from learning data: Hidden Markov Models [7], probabilistic automata [9, 13, 21], Markov Models and their smoothing techniques [5], etc. Recently, some work has dealt with density estimation by combining some of these models [3, 20]. In this paper, we also use such a strategy by introducing a boosting approach to density estimation. Although during the last decade many papers have shown the interest of voting classification algorithms (such as boosting or bagging) (see e.g. [1, 11, 14]), to the best of our knowledge this is the first attempt to use the optimization properties of boosting in such a context. However, we think that this way to proceed T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 431–443, 2002. c Springer-Verlag Berlin Heidelberg 2002
432
Franck Thollard et al.
deserves further investigation. Thanks to the margin maximization principle, it not only would allow the optimisation of estimation performance, but also would avoid the tricky use of smoothing techniques, often crucial in density estimation. By combining many weak hypotheses (usually a hypothesis is a classifier), the algorithm Adaboost [4] generates a relevant final weighted classifier. Recently, many theoretical results have justified the relevance of the boosting in machine learning [15]. But so far, its use has been limited to optimize classification tasks, despite recent original extensions to prototype selection [18] and feature selection [17]. Using boosting for estimating density functions is much more difficult than a simple classification optimization. It requires modification of the weight update rule, the core of the Adaboost algorithm, that we recall in section 2. Here, we do not aim to generate an efficient classifier which minimizes the rate of misclassified examples, but rather we aim to automatically and correctly estimate target density function. We show in section 3 that the final combined estimator must then minimize estimation errors (i.e. over and under estimations) regarding the original distribution. For the experimental study, a probabilistic automaton is used during the boosting step as a weak estimator. We present it in section 4. Finally, its performance is tested on a language modelling task, and results are presented in section 5.
2
Properties of Boosting
Boosting consists in combining many (T ) weak hypotheses produced from various distributions Dt (e) over the learning set (LS). The pseudo-code of the original boosting algorithm, called Adaboost [4], is described by the Algorithm 1. At the beginning of the process, each instance e is initially distributed according to an uniform density D1 (e) (note that a given example can of course occur many times in LS, resulting in a higher density at this point). At each stage t, Adaboost decreases (resp. increases) the weight of the training instances, a priori labeled y(e), correctly (resp. incorrectly) classified by the current weak hypothesis ht . Boosting thus forces the weak learner to learn the hardest examples. The weighted combination H(e) of all the weak hypotheses results in a better performing model. Schapire and Singer [16] proved that, in order to minimize the training error, one must seek to minimize Zt (the normalization factor, i.e. the sum of the updates) on each round of boosting. It is easy to show that for minimizing the objective function Zt , the confidence αt of each weak hypothesis t (used in the final combined classifier H) is 12 log( 1− t ). In order to introduce our boosting approach to density estimation, we recall here notations already proposed in Schapire and Singer [16]. Suppose that y(e) ∈ {−1, +1} and that the range of each weak hypothesis ht is restricted to −1, 0, +1.
Boosting Density Function Estimators
433
Algorithm 1: Pseudo-code for AdaBoost. Data : A learning sample LS, a number of iterations T, a weak learner WL Result : An aggregated classifier H Initialize distribution: ∀e ∈ LS, D1 (e) = for t = 2 to T do ht = WL (LS,Dt );
t = Dt (e) ; αt =
e:y(e) =ht (e) 1−t 1 log( ) 2 t
1 |LS|
;
;
Distribution Update Return H s.t. H(e) =
1 T
(
T t=1
/* ∀e ∈ LS, Dt+1 (e) =
Dt (e)e−αt y(e)ht (e) Zt
*/;
αt ht (e)) ;
Let W −1 , W 0 and W +1 be defined by Wb =
Dt (e)
e∈LS:y(e)ht (e)=b
Using symbols + and - for +1 and -1, the following property is then satisfied: W+ + W− + W0 = 1 W + (resp. W − ) describes then the sum of the weights of the correctly (resp. incorrectly) classified instances. W 0 describes the part of the instances unclassified by the current classifier (for example, a point located on a linear separator).
3
Boosting a Density Estimator
In our framework, the weak hypothesis ht is not a classifier which labels an example e, by giving it a negative or positive label (−1 or +1). In Natural Language modelling, the examples are not split into two negative and positive classes. e is usually described by a symbol and its context, i.e. the beginning of the string in which it appears. Hence, ht is now a model which must provide an occurrence probability for a given example e. The objective is then to compare the current inferred distribution Dt with the original density D1 which is the target distribution. D1 is not the uniform distribution anymore (as in Adaboost), but rather the distribution over the data. D1 (e) describes in fact a conditional probability, to observe in the learning sample a symbol given its context. We aim to fit the original distribution D1 and the distribution Dt estimated by ht . In such a context, we cannot use the instance classes (y(e) in Adaboost). The use of the weight update rule according to the correct or incorrect prediction of h(t), is then impossible. To allow this adaptation of boosting to density estimation, we must solve the following problems:
434
Franck Thollard et al.
1. We must redefine W + and W − which describe respectively the proportion of correctly and incorrectly classified examples in the standard Adaboost. What is now a good or a bad prediction of the weak hypothesis ? 2. Are there examples not modelled by a given weak hypothesis (corresponding to the quantity W 0 in Adaboost) ? 3. We must redefine the weight update rule, taking into account the quality of the current estimator. As to the first problem, we can enumerate three mutually exclusive cases for each learning instance e (see the example described on figure 1): 1. The weak hypothesis ht provides a good estimate of the probability of e. + The weights of such points will be describe by W in our approach, defined + as follows: W = e/Dt (e)=D1 (e) Dt (e) 2. The weak hypothesis ht under-estimates the probability to have e. The weights of such points will be described by W1− , the first part of W − , the weighted sum of instances incorrectly treated by the weak hypothesis. W1− − is then defined as follow: W1 = e/(Dt (e)−D1 (e))<0 Dt (e). 3. The algorithm over-estimates the probability to have e. The weights of such points will be denoted W2− , the second part of W − . W2− is then defined as − follow: W2 = e/(Dt (e)−D1 (e))>0 Dt (e). Contrary to Adaboost, which accepts instances unclassified (those described by W 0 ) by the weak hypothesis, the estimator provides a given density for each learning example, resulting in W 0 = 0. That deals with the second problem. We then handle three quantities, which satisfy the following properties: W + + W1− + W2− = 1 Finally, in order to correct the estimation error of ht , we will increase (resp. decrease) the density of examples under-estimated (resp. over-estimated) by the
Fig. 1. Estimation errors
Boosting Density Function Estimators
435
hypothesis. The weight of the correctly estimated examples remains the same. Then, we will use the following general weight update rule: Dt (e).e−αt (Dt (e)−D1 (e)) Zt where Zt is the normalization factor: Zt = e Dt (e).e−αt (Dt (e)−D1 (e)) . Dt+1 (e) =
According to Adaboost, the confidence level αt of the current weak hypothesis can be assessed by minimizing Zt . Actually, minimizing Zt results in minimizing the error rate of the weak hypothesis, since the main part of the Zt quantity is due to misclassified instances. In our adaptation to density estimation, misestimated instances can be either under or over-estimated. Minimizing Zt would attribute more relevance to under-estimated instances than to over-estimated ones. In such a context, a better optimization would consist in minimizing the following objective function, Dt (e).e−αt |Dt (e)−D1 (e)| Zt∗ = e
The confidence αt of the weak hypothesis is determined by minimizing Zt∗ . ∂Zt∗ =− |Dt (e) − D1 (e)|.Dt (e)e−αt |Dt (e)−D1 (e)| ∂αt e Replacing e−αt |Dt (e)−D1 (e)| by its power series, we obtain −(−αt )n |Dt (e) − D1 (e)|n+1 Dt (e) ∂Zt∗ e = ∂αt n! n≥0
Since |Dt (e) − D1 (e)| ∈ [0, 1], we can assume that, for n ≥ 2, |Dt (e) − D1 (e)|n+1 is negligible, and then ∂Zt∗ − |Dt (e) − D1 (e)|.Dt (e) + αt (|Dt (e) − D1 (e)|2 .Dt (e) ∂αt e e t The value of αt for which ∂Z ∂αt = 0 is then |Dt (e) − D1 (e)|.Dt (e) E(δt (e)) = αt = e 2 E(δt (e)2 ) e |Dt (e) − D1 (e)| .Dt (e)
where δt (e) = |Dt (e) − D1 (e)|, E(δt (e)) is its first statistical moment, and E(δt (e)2 ) its second statistical moment. Despite our approximation, we can note that αt keeps the following interesting properties: 1. αt tends to ∞ when the estimated density tends towards the initial distribution. We then highly weight such a good weak hypothesis. 2. αt tends to 0 for estimated densities which do not have overlap with the initial distribution. The pseudo-code of our boosted-algorithm, called PdfBoost, is presented in the Algorithm 2.
436
Franck Thollard et al.
Algorithm 2: PdfBoost. D1 (e) is the probability to observe e in the learning sample, ∀e ∈ LS for t = 2 to T do Build an automaton ht using Dt−1 ; Get an estimation Dt of the probability density; E(δt (e)) where δt (e) = |Dt (e) − D1 (e)|; Compute the confidence αt = E(δ 2 t (e) ) −αt (Dt (e)−D1 (e))
Update: ∀e ∈ LS: Dt+1 (e) = Dt (e)e /∗Zt is a Normalization Factor∗/;
Zt
;
Return the final model aggregating the T weighted-distributions: 1 D∗ (e) = t
4
αt
(
T
αt Dt (e))
t=1
Probabilistic Automaton as Weak Hypothesis
We recall that boosting can aggregate many weak hypothesis, i.e. many models. This feature induces two kind of constraints on the type of model used: on the one hand, each model must be compact, on the other hand its use must be very efficient. We decided to use, for the experimental study, Probabilistic Deterministic Finite States Automata, since they are, on the one hand, more compact than, say the N-gram model, and on the other hand, the determinism makes them faster than non deterministic probabilistic automata (i.e. Hidden Markov Models) when used in real applications. We present here the formal definition of the model and the inference algorithm. A Probabilistic Finite Automaton (PFA) A is a 7-tuple (Σ, QA , qIA , ξ A ,δ A ,γ A ,F A ) where Σ is the alphabet, i.e. a finite set of symbols QA is the set of states, qIA ∈ Q is the initial state, ξ A ⊂ Q × Σ × Q × (0,1] is a set of probabilistic transitions. F A : QA → [0,1] is the “end of parsing” probabilistic function. Functions δ A and γ A , from Q × Σ to Q and (0, 1] respectivelly are defined as: δ A (qi , σ) = qj iff ∃ p ∈ (0, 1] : (qi , σ, qj , p) ∈ ξ and γ A (qi , σ) = p iff ∃ qj ∈ Q : (qi , σ, qj , p) ∈ ξ. These functions can be trivially extended A to Q × Σ . A We require that for all states q, σ ξ (q, σ, q ) + F (q) = 1. We assume that all states are reachable from the start state with non-zero probability, and that the automaton terminates with probability one. This then defines a distribution over Σ . P rA (x) = γ A (qI , x) × F A (δ(qI , x)) will be the probability of x w.r.t. the automaton A. These automata differ from the one used in many papers in that they define a probability distribution over Σ and not over Σ n with n a constant. Let LS denote a positive sample, i.e. a set of strings belonging to the probabilistic language we are trying to model. Let P T A(LS) denote the prefix tree
Boosting Density Function Estimators a (9/11) 0 (2/11)
b (4/9)
2 (1/4)
1 (1/9) a (4/9)
c (3/4)
437
4 (3/3)
c (3/4) 5 (3/3)
3 (0/4) a (1/4)
3 (1/1)
Fig. 2. PPTA built with LS = {aac,λ, aac, abd, aac, aac, abd, abd, a, ab, λ} acceptor built from a positive sample LS. The prefix tree acceptor is an automaton that only accepts the strings in the sample and in which common prefixes are merged together resulting in a tree-shaped automaton. Let P P T A(LS) denote the probabilistic prefix tree acceptor. It is the probabilistic extension of the P T A(LS) in which each transition has a probability related to the number of times it is used while generating, or equivalently parsing, the positive sample. Let C(q) denote the count of state q, that is, the number of times the state q was used while generating LS from P P T A(LS). Let C(q, #) denote the number of times a string of LS ended on q. Let C(q, a) denote the count of the transition (q, a) in P P T A(LS). The P P T A(LS) is the maximal likelihood estimate built from LS. In particular, for P P T A(LS) the probability estimates are ∧ γ (q, a) = C(q,a) C(q) , a ∈ Σ ∪ {#}. Figure 2 exhibits a P P T A and the learning set it is built from. We now present the second tool used by the generic algorithm: the state merging operation. This operation induces two modifications to the automaton: (i) it modifies the structure (figure 3, left) and (ii) the probability distribution (figure 3, right). It applies to two states. Merging two states can lead to non-determinism. The states that create non-determinism are then recursively merged. When state q results from the merging of the states q and q , the following equality must hold in order to keep an overall consistent model: γ(q, a) =
C(q ,a)+C(q ,a) C(q )+C(q )
, ∀a ∈ Σ ∪ {#}
One can note two properties of the update of the probabilities: (i)
C(q ,a)+C(q ,a) C(q )+C(q )
,a) C(q ,a) is included in [ C(q C(q ) , C(q ) ] which means that the probability of an after merge transition has a value bounded by the two values of the transitions it comes from; (ii) the merge naturally weights more the probability of the transition that holds 20+100 120 20 = 1105 is closer to 1000 than to 100 the more information. For instance, 1000+105 105 . These remarks hold for each pair of transitions that takes part in the merge. Let us merge states qi and qj and define Pqi (resp. Pqj ) as the probability distribution defined by considering state qi (resp. qj ) as the initial state. Since the merge is recursively applied (see figure 3), the probability distribution after merging states qi and qj will be a kind of weighted mean between the distributions Pqi and Pqj .
438
Franck Thollard et al.
A a 0
a b
2
b
9
b 4
a a
10
0
2
a b
4
2 (12/40) c (p=20/40)
6
0 (0/50)
a b
1 (10/50) a (p=50/50)
Merge of 9 and 4. a
a (p=8/40) a (p=40/50)
3 (20/20)
10
a 6
a (p=98/140)
Mergea of 10 and a 6 0
b
2
b
A’ 4
a
6
3’ (20/20)
0’ (22/140)
c (p=20/140)
Fig. 3. Merging states 5 and 2
We are now in a position to present the MDI algorithm itself for which we recall in the next section the main features. The MDI algorithm [21] (algorithm 3) takes two arguments: the learning set LS and a tuning parameter β. It looks for an automaton, that is the result of a tradeoff between a small size and a small distance to the data. The distance measure used is the Kullback-Leibler divergence. The data are represented by the PPTA as it is the maximum likelihood estimate of the data. While merging two states, the distance between the automaton and the data, in general, increases and, at the same time, the number of states and the number of transitions, in general, decreases. Two states will be set compatible if the impact in terms of divergence of their merge divided by the gain of size is smaller than the parameter β. Algorithm 3: MDI (LS, β). A ← Numbering in Breadth First Order(PPTA); for qi = 1 to Nb State (A) do for qj = 0 to i − 1 do if Compatible (A, qi , qj ) < β then Merge (A,qi ,qj ) ; Return A ;
The MDI algorithm tries to infer a small automaton that is close to the data. In fact, the bigger β, the more general (and small with respect to the number of states) the resulting automaton should be. The parameter β hence controls the level of generalization of the algorithm. Since boosting is known to overfit
Boosting Density Function Estimators
439
Table 1. Probability updates, where Di (”to”) = Di (”to|I d like...C”) e D1 C|I’d .. D 1 to| I’d..C 1 Atlanta|I’d..to 1
D2 .4961 .4304 .1682
D3 .8754 .5931 .0910
D4 D5 D1 D2 .9012 1 1 .7407 .9254 .9265 .6061 .5157 .0621 .5640 .2791 .2220
D3 .7864 .5419 .1775
D4 .8155 .6391 .1483
D5 .8532 .6977 .2331
with noisy data, having such a controlling parameter seems to be crucial in this context. Usually, a cross validation protocol is used to assess the right value for the parameter β. The learner is then considered as the algorithm with β estimated on a held-out development set. Another point of view is to consider that each value of β defines a particular algorithm that belongs to a family of learner parameterized by β. Note that the point here is not to optimally define the value of the parameter β, because the algorithm is still a heuristic method. Moreover the boosting just needing a weak learner, we will consider the MDI algorithm, used with a fixed value of the parameter β, as a given weak learner. We will see in the next section that it deserves further investigation.
5
Experimental Issues
We used for this experimental study a very noisy database coming from a language modeling task, i.e. the ATIS task. The Air Travel Information System (ATIS) corpus [6] was developped under a DARPA speech and natural language program that focused on developing language interfaces for information retrieval systems. The corpus consists of speakers of American English making information requests such as: ‘‘I’d like to find the cheapest flight from Washington D C to Atlanta’’. Since the probabilities of the whole sentences are here very small (as a product of all the conditional probabilities), we decided to deal with the conditional probabilities. Since, the PPTA represents the maximum likelihood of the data, it will describe the target density D1 in PdfBoost. At each step t, Dt+1 is not an update of LS, but rather an update of the PPTA transition probabilities. Before studying the PdfBoost behavior on the whole database, we decided to test it on some conditional probabilities to see whether our intuitions are corroborated. The PdfBoost behavior is shown on table 1 where we provide the evolution of some conditional probabilities taken from the example sentence given above (”... D C to Atlanta)”. In this table, Di is the probability provided at iteration i; Di is the probability provided by the first i-aggregated models. As one can see, the probabilities tend rather quickly to the value of the training sample (column D1 in the table). This is true either for the non aggregate models (columns Di ) and for the aggregate ones (columns Di ). One can note
440
Franck Thollard et al.
that aggregate models converge slower, hopefully leading to less overfit than the models taken alone.
6
Behavior on the Full Task
The criterion used to estimate the quality of a model is the perplexity. Probabilistic models cannot be evaluated by classification error rate, as the fundamental problem has become the estimation of a probability distribution over the set of possible strings. The quality of the model is measured by the per symbol log-likelihood of strings x belonging to a test sample according to the distribution defined by the hypothesis PA (x) computed on a test sample S : |S| |x|
LL = −
1 log P (xji |q i ) S j=1 i=1
where P (xji |q i ) denotes the probability of generating xji , the i-th symbol of the jth string in S, given that the generation process was in state q i . The test sample perplexity P P is most commonly used for evaluating language models of speech applications. It is given by P P = 2LL . The minimal perplexity P P = 1 is reached when the next symbol xji is always predicted with probability 1 from the current state q i (i.e. P (xji |q i ) = 1) while P P = |Σ| corresponds to random guessing from an alphabet of size |Σ|. We aim here to check if the behavior of PdfBoost is coherent with the behavior of the standard AdaBoost algorithm. With the standard algorithm, the classifier gets closer and closer to the training set. Figure 4 shows the trainingset perplexity of the aggregated automata. As expected, the perplexity goes down as the number of iterations grows. It stabilizes at around 100 iterations. This means that the update function chosen is adequate with respect to the perplexity.
Fig. 4. Behavior of the aggregated model on the training set
Boosting Density Function Estimators
441
Fig. 5. Behavior of the aggregated model on the development set
Figure 5 shows the behavior of the method on the development set, i.e. its behavior in generalization. As one can see, the two curves are rather different depending on the value of the tuning parameter β. It is interesting to notice that the parameter which performs best on the training set performs worse on the development one. From our point of view, this means that having a parameter that prevents generalization will cause over-fitting. Actually, during the four first boosting steps the development-set perplexity goes down, which shows the interest of our approach. The curves then raise, which, from our point of view, proves over-fitting. We think that tuning the MDI parameter β at each boosting step could prevent over-fitting and thus lead to better results in generalization.
7
Conclusion and Further Work
The preliminary results presented in Figure 5 seem promising to us in that they tend to show that the behavior of PdfBoost is coherent with the one usually observed in the classical boosting. The next step will be to have a complete study of the behavior on the development set, e.g. tuning the inference algorithm at each boosting step in order to prevent overfitting. Since the boosting has been already applied to prototype selection [18], another further work should be to see if there exists a unified boosting that could include the three frameworks known at the moment, i.e. the classic boosting, the boosting applied to protototype selection, and the boosting of density function estimation.
Acknowledgments We wish to thank Alexander Clark for useful remarks on preliminary version.
442
Franck Thollard et al.
References [1] E. Bauer and R. Kohavi. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36(1-2):105–139, 1999. 431 [2] P. Bladi, Y. Chavin, and T. Hunkapillerand McClure. Hidden markov models of biological primary sequence information. In National Academy of Science, pages 1059–1063, USA, 1991. 431 [3] Andrew Brown and Geoffrey Hinton. Products of hidden markov models. In T. Jaakkola and T. Richardson, editors, Artificial Intelligence and Statistics, pages 3–11. Morgan Kaufmann, 2001. 431 [4] Y. Freund and R. E. Shapire. A decision theoretic generalization of online learning and an application to boosting. Intl. Journal of Computer and System Sciences, 55(1):119–139, 1997. 431, 432 [5] Joshua Goodman. A bit of progress in language modeling. Technical report, Microsoft Reserach, 2001. 431 [6] L. Hirschman. Multi-site data collection for a spoken language corpus. In DARPA Speech and Natural Language Workshop, pages 7–14, 1992. 439 [7] Frederick Jelinek. Statistical Methods for Speech Recognition. The MIT Press, Cambridge, Massachusetts, 1998. 431 [8] D. Jurafsky and J. H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall, Englewood Cliffs, New Jersey, 2000. 431 [9] M. J. Kearns, Y. Mansour, D. Ron, R. Rubinfeld, R. E. Schapire, and L. Sellie. On the learnability of discrete distributions. In Proc. of the 25th Annual ACM Symposium on Theory of Computing, pages 273–282, 1994. 431 [10] Anders Krogh, Michael Brown, I. Saira Mian, Kimmen Sjolander, and David Haussler. Hidden markov models in computational biology: Applications to protein modeling. Journal of Molecular Biology, 235:1501–1531, 1994. 431 [11] R. Maclin and D. Opitz. An empirical evaluation of bagging and boosting. In Proc. of the Fourteenth Natl. Conf. on Artificial Intelligence, pages 546–551, 1997. 431 [12] E. Roche and Yves Schabes. Finite-State Language Processing. MIT Press, 1997. 431 [13] D. Ron, Y. Singer, and N. Tishby. On the learnability and usage of acyclic probabilistic finite automata. In ACM, pages 31–40, Santa Cruz, 1995. COLT’95. 431 [14] E. Tjong Kim Sang. Text chunking by system combination. In CoNLLL-2000 and LLL-2000, pages 151–153, Lisbon, Portugal, 2000. 431 [15] R. E. Schapire, Y. Freund, P. Bartlett, and W. Sun Lee. Boosting the margin: a new explanation for the effectiveness of voting methods. Annals of Statistics, 1998. 432 [16] R. E. Schapire and Y. Singer. Improved boosting algorithms using confidencerated predictions. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, 1998. 432 [17] M. Sebban and R. Nock. Contribution of boosting in wrapper models. In Proc. of the Thirth European Conf. on Principles and Practice of Knowledge Discovery in Databases, pages 214–222, 1999. 431, 432 [18] M. Sebban, R. Nock, and S. Lallich. Boosting neighborhood-based classifiers. In Proc. of the Seventeenth Intl. Conf. on Machine Learning, 2001. 431, 432, 441
Boosting Density Function Estimators
443
[19] A. Stolcke and S. Omohundro. Inducting probabilistic grammars by bayesian model merging. In Lecture Notes in Artifitial Intelligence, editor, Second Intl Collo. on Gramatical Inference, 862, pages 106–118. ICGI-94, 1994. 431 [20] F. Thollard. Improving probabilistic grammatical inference core algorithms with post-processing techniques. In Eighth Intl. Conf. on Machine Learning, pages 561–568, Williams, July 2001. Morgan Kauffman. 431 [21] F. Thollard, P. Dupont, and C. de la Higuera. Probabilistic dfa inference using kullback-leibler divergence and minimality. In Pat Langley, editor, Seventh Intl. Conf. on Machine Learning, San Francisco, June 2000. Morgan Kaufmann. 431, 438
Ranking with Predictive Clustering Trees Ljupˇco Todorovski1, Hendrik Blockeel2 , and Saˇso Dˇzeroski1 1
2
Department of Intelligent Systems, Joˇzef Stefan Institute Jamova 39, SI-1000 Ljubljana, Slovenia {Ljupco.Todorovski,Saso.Dzeroski}@ijs.si Department of Computer Science, Katholieke Universiteit Leuven Celestijnenlaan 200A, B-3001 Heverlee, Belgium {Hendrik.Blockeel}@cs.kuleuven.ac.be
Abstract. A novel class of applications of predictive clustering trees is addressed, namely ranking. Predictive clustering trees, as implemented in Clus, allow for predicting multiple target variables. This approach makes sense especially if the target variables are not independent of each other. This is typically the case in ranking, where the (relative) performance of several approaches on the same task has to be predicted from a given description of the task. We propose to use predictive clustering trees for ranking. As compared to existing ranking approaches which are instance-based, our approach also allows for an explanation of the predicted rankings. We illustrate our approach on the task of ranking machine learning algorithms, where the (relative) performance of the learning algorithms on a dataset has to be predicted from a given dataset description.
1
Introduction
In many cases, running an algorithm on a given task can be time consuming, especially when the algorithm is complex and complex tasks are involved. It is therefore desirable to be able to predict the performance of a given algorithm on a given task from a description (set of properties of the task) and without actually running the algorithm. The term “performance of an algorithm” is often used to denote the quality of the solution provided, the running time of the algorithm or some combination of the two. When several algorithms are available to solve the same type of task, the problem of choosing an appropriate algorithm for the particular task at hand arises. An appropriate algorithm would be an algorithm with a good performance on the given task. Being able to predict the performance of the algorithms, without actually starting them on a given task, will make the problem of choosing easier and less time consuming. We can view performance prediction as a multitarget prediction problem, where the same input (the task description) is used to predict several related targets (the performances of the different algorithms). In this context, it is the relative performance of the different algorithms that matters, and not so much the absolute performance of each of them. We are T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 444–455, 2002. c Springer-Verlag Berlin Heidelberg 2002
Ranking with Predictive Clustering Trees
445
thus interested in obtaining an ordering of the algorithms (called also ranking) in terms of their expected relative performance. Within the area of machine learning, many learning algorithms have been developed, especially for classification tasks. A classification task is specified by giving a table of data and indicating the target column: the pair is often referred to as a dataset. The task of predicting the performance of learning algorithms from dataset properties has been addressed within the StatLog project [6], while the task of ranking learning algorithms has been one of the major topics of study of the METAL project [1]. Both are treated as learning problems, where the results of applying selected learning algorithms on selected datasets (baselevel learning) constitute a dataset for meta-level learning. A typical meta-level dataset for ranking thus consists of two parts. The first set of columns (attributes) contains a description of the task at hand. In the case of ranking learning algorithms, it typically contains general and statistical properties of datasets (such as the number of examples and class value and the average kurtosis per numerical attribute). The second set of columns (class values) contains the performance figures for the learning algorithms on the given datasets (e.g., the classification error of C5.0, RIPPER, etc.). Many different variants of ranking have been studied within the METAL project. A prototypical ranker uses a case-based (nearest neighbor) approach. To produce a ranking of the learning algorithms on a new dataset, the most similar datasets from the meta-level dataset are chosen and the performances (rankings) of the algorithms on these datasets are averaged to obtain a prediction of the performance (ranking) on the new dataset [11]. In an alternative approach to ranking, proposed in [2], regression methods are used to estimate the (absolute) performance of each of the learning algorithms on a given task. These individual predictions are then used to obtain the ranking of the algorithms. In this paper, instead of using regression methods for predicting performances of individual algorithms, we propose the use of predictive clustering trees for ranking. In this case, a single predictive clustering tree has the ability to predict performances of all the learning algorithms at once. Thus, in addition to obtaining a ranking, we also obtain an explanation for it. The remainder of this paper is organized as follows. Section 2 describes in more detail the task of ranking of learning algorithms. Section 3 briefly describes predictive clustering trees and describes the particular formulation of the multitarget (relative) performance prediction used in our experiments. Section 4 describes the experimental setup and the results of evaluating our approach to ranking learning algorithms. Finally, Section 5 concludes with a summary and possible directions for future work.
2
Ranking of Learning Algorithms
This section describes in more detail the task of ranking of learning algorithms. This includes the machine learning algorithms ranked, the base-level datasets, the descriptions of the datasets, and the performance evaluation methodology.
446
Ljupˇco Todorovski et al.
Table 1. Eight machine learning algorithms for classification tasks used in our study Acronym c50tree (c50t) c50rules (c50r) c50boost (c50b) ltree (lt) ripper (rip) mlcnb (nb) mlib1 (nn) lindiscr (ld)
2.1
Brief description C5.0 - decision trees based classifier decision rules extracted from a C5.0 tree boosting C5.0 decision trees linear discriminant decision trees decision rules based classifier naive Bayes classifier (MLC++) 1-NN nearest neighbor classifier (MLC++) linear discriminant classifier
The Machine Learning Algorithms and Datasets
In this study, we analyze the relative performance of eight machine learning algorithms for classification tasks. The same set of classifiers and algorithms has been used in the related study of estimating the predictive performance of individual classifiers [2]. The set of algorithms is presented in Table 1: this is a subset of the set of ten algorithms used within the METAL project [1]. Representatives of different classification approaches are included in this set, such as decision trees, decision rules, naive Bayes, nearest neighbor and linear discriminant classifiers. The performance of these eight algorithms has been measured on a set of sixty-five classification tasks (datasets) from the UCI repository [3] and from the METAL project. The list of datasets is given in Table 2. 2.2
Dataset Descriptions
Finding a dataset characterization method that would provide a solid basis for prediction of performance of learning algorithms is probably most important
Table 2. Sixty-five classification datasets used in our study . abalone, acetylation, agaricus-lepiota, allbp, allhyper, allhypo, allrep, australian, balance-scale, bands, breast-cancer-wisconsin, breast-cancer-wisconsin nominal, bupa, car, contraceptive, crx, dermatology, dis, ecoli, flag language, flag religion, flare c, flare c er, flare m, flare m er, flare x, flare x er fluid, german numb, glass, glass2, heart, hepatitis, hypothyroid, ionosphere, iris, kp, led24, led7, lymphography, monk1, monk2, monk3-full, mushrooms, new-thyroid, parity5 5, pima-indiansdiabetes, processed.cleveland 2, processed.cleveland 4, processed.hungarian 2, processed.hungarian 4, processed.switzerland 2, processed.switzerland 4, quisclas, sickeuthyroid, soybean-large, tic-tac-toe, titanic, tumor-LOI, vote, vowel, waveform40, wdbc, wpbc, yeast
Ranking with Predictive Clustering Trees
447
Table 3. DCT dataset properties DCT nr examples nr num attributes nr classes missvalues total missvalues relative nr sym attributes lines with missvalues total lines with missvalues relative countattr count all value ndiscrimfunct fract cancor meanskew meankurtosis classentropy entropyattributes mutualinformation equivalent nr of attrs minattr multicorrel noisesignalratio avgattr multicorrel sdratio avgattr gini sym maxattr multicorrel maxattr gini sym avgattr relevance minattr gini sym maxattr relevance numattrswithoutliers minattr relevance minattr gfunction maxattr gfunction avgattr gfunction
aspect of meta-learning.1 Several different dataset descriptions have been used for meta-learning. One approach to dataset characterization, proposed within the StatLog project [6], is to use a set of general, statistical and information theory based measures of the dataset. The general properties include properties such as number of examples, classes and (symbolic and numeric) attributes in the dataset. Statistical properties are used to characterize numeric attributes in the dataset and they include measures such as average skewness and kurtosis of numeric attributes. Characteristics of discrete attributes are measured with information theory based measures such as average entropy and average mutual information between discrete attributes and the class. The StatLog approach gave rise to the development of the Data set Characterizing Tool (DCT) [9] within the METAL project. The set of DCT properties extends the initial set of StatLog properties. Table 3 presents the set of DCT properties used in this study. The DCT properties include also properties of the individual attributes in the dataset, such as kurtosis of each numerical attribute or entropy of each symbolic attribute. These properties cannot be directly used in propositional meta-learning, where the dataset description is a fixed-length vector of properties. In order to use the DCT properties of the individual attributes, we have to aggregate them using average, minimum or maximum function. Kalousis and Theoharis [8] have proposed an alternative approach to dataset characterization. They use histograms for fine grained aggregation of the DCT properties of the individual attributes. Histograms, used as an aggregation me1
Note that there is an important constraint on the complexity dataset characterization method. The dataset description should be generated faster than evaluating the performance of the learning algorithms on the dataset. Otherwise, the task of metalevel leering would be meaningless. However, analysis of computational complexity of different dataset description approaches is beyond the scope of this paper, it be found in [7].
448
Ljupˇco Todorovski et al.
thod, preserve more information about the DCT properties of the individual attributes compared to the simple aggregation functions of average, minimum and maximum used in the DCT approach. For detailed description of how aggregations based on histograms are calculated see [8]. In this paper, we used the same set of histograms as the one used in [2]. This set includes histograms for four DCT properties of individual attributes and twelve DCT properties of the whole dataset. We refer to the histogram approach to dataset description as HISTO. Finally, in the landmarking approach to dataset description [10], the performances of a set of simple and fast learning algorithms, named landmarkers, are estimated and used as dataset properties. In the original study on using landmarkers for meta-learning, a set of seven landmarkers was proposed. This set includes simple classifiers, such as different versions of a decision node classifier (i.e., a decision tree with a single decision node), naive Bayes, linear discriminant and 1-nearest neighbor. However, three of the landmarkers are already included in the list of classifiers from Table 1 for which we predict the performance. Therefore, in the present study, we use the set of the remaining four landmarkers and we will refer to this approach to dataset description as LAND. 2.3
The Performance of a Learning Algorithm
When building a dataset for meta-learning, we also need an estimate of the performance of the learning algorithms on a given classification task. Most often, the performance of a learning algorithm a on a given classification task d is measured by the predictive error ERR(a, d), i.e., the percentage of incorrectly classified examples. To estimate the predictive error on test examples, unseen during the training of the classifier, a standard ten-fold cross validation method has been used. 2.4
The Performance of Ranking
The performance of ranking is measured by comparing the ranking predicted by the ranking method with the true ranking of the learning algorithms on a given dataset. We used a standard measure of similarity of two rankings, Spearman’s rank correlation coefficient [11]: n ( i=1 Di 2 ) , (1) rs = 1 − 6 n3 − n where Di is the difference between actual and predicted rank of the i’th algorithm and n is the number of learning algorithms. Again, to estimate the performance of ranker on test datasets, unseen during the training of the ranker, a standard ten-fold cross validation method has been used.
Ranking with Predictive Clustering Trees
3
449
Ranking with Predictive Clustering Trees
This section first briefly describes predictive clustering trees. It then discusses how they could be used to predict the errors of different learning algorithm on a given dataset simultaneously. It finally proposes to use the ranks calculated from the errors as the target variables, rather than the errors themselves. 3.1
Predictive Clustering Trees
Decision trees are most often used in the context of classification or single-target regression; i.e., they represent a model in which the value of a single variable is predicted. However, as a decision tree naturally identifies partitions of the data (course-grained at the top of the tree, fine-grained at the bottom), one can also consider a tree as a hierarchy of clusters. A good cluster hierarchy is one in which individuals that are in the same cluster are also similar with respect to a number of observable properties. This leads to a simple method for building trees that allow the prediction of multiple target attributes at once. If we can define a distance measure on tuples of target variable values, we can build decision trees for multi-target prediction. The standard TDIDT algorithm can be used: as a heuristic for selecting tests to include in the tree, we use the minimization of intra-cluster variance (and maximization of inter-cluster variance) in the created clustering. A detailed description of the algorithm can be found in [4]. An implementation is publicly available in the first-order learner Tilde that is included in the ACE tool [5]; however for this paper we have used Clus, a downgrade of Tilde that works only on propositional data. 3.2
Ranking via Predicting Errors
The instance-based approaches to ranking predict rankings of algorithms on a dataset by predicting the errors of the algorithms on the dataset, then creating a ranking from these [11]. An instance here consists of a description of a dataset, plus the performance of eight different algorithms on that dataset. Based on these eight target values, an example can be positioned in an eight-dimensional space. In its standard mode of operation, Clus builds its trees Nso that the2 intra¯ ) where cluster variance is minimized, where variance is defined as j=1 d(xj , x ¯ is the mean vector of the cluster, xj is an element of the cluster, N is the x number of elements in the cluster, and d represents the euclidean distance. So, what Clus does is trying to create clusters in such a way that a given algorithm will perform similarly on all datasets in that cluster. Note that this is different from what we want: creating clusters in which several algorithms have the same relative performance. To illustrate this, suppose we have four algorithms which on two datasets score the following errors: {(0.1, 0.2, 0.3, 0.4), (0.5, 0.6, 0.7, 0.8)}
450
Ljupˇco Todorovski et al.
Clearly the relative performance of the four algorithms is exactly same on the three datasets, so they belong to the same cluster. However, the variance in this cluster is relatively large. Compare this to {(0.1, 0.2, 0.3, 0.4), (0.4, 0.3, 0.2, 0.1)} which has a smaller variance than the previous cluster but is clearly worse: the relative performances are opposite. 3.3
Ranking Trees
A solution for this problem is to first rank the algorithms and to predict these ranks instead of the errors themselves. In this way, we obtain ranking trees. A ranking tree has leaves in which a ranking of the performance of different algorithms is predicted. This transformation removes fluctuations in the variance that are caused by differences in absolute rather than relative performance. Moreover, given the formula for the Spearman’s rank correlation coefficient (1), it is clear that a linear relationship between variance and expected Spearman correlation exists. ¯ )2 Indeed, note that in the case when the ranks are predicted, the variance d(xi , x n 2 equals i=1 Di from the formula (1). This is true under an assumption that the exact ranking number of each algorithm is predicted. This assumption is not fulfilled. Instead of predicting exact ranks, the clustering tree predicts only approximations of rank numbers, e.g.: (6.0, 6.4, 3.65, 6.1, 5.65, 3.5, 5.65, 3.7). Of course, by comparing these approximations we can easily obtain the following exact ranking: (6, 8, 2, 7, 4.5, 1, 4.5, 3) of the eight algorithm. However, the aforementioned equivalence of variance and Spearman’s correlation coefficient does not hold anymore. Thus, minimizing intra-cluster variance should be seen as an approximation to maximizing Spearman’s correlation coefficient. Note, however, that this approximation is far better than minimizing intra-cluster variance based on the error rates themselves.
4
Experiments
Our experiments investigate the performance of ranking with predictive clustering trees induced using the three different dataset characterization approaches presented in Section 2. Following the discussion from Section 3.3, we transformed the target error values into ranks. The remainder of this section first describes the experimental setup. It then presents the experimental results, including an example ranking tree and performance figures on the correlation between actual and predicted rankings.
Ranking with Predictive Clustering Trees
4.1
451
Experimental Setup
Clus was run several times with the same system settings, but on different datasets that vary along two dimensions: – Language bias: DCT, HISTO, LAND, DEF – Targets: errors, ranks The first three language bias settings correspond to the three dataset characterization approaches described in Section 2. DCT uses set of properties of the whole dataset and aggregations of the properties of the individual attributes. HISTO uses more sophisticated aggregation method of histograms for aggregating the properties of the individual attributes. LAND uses estimated performances of four landmarkers for dataset description. Finally, DEF uses no information at all to induce a “default” model tree that consists of a single leaf (i.e., a model that just predicts the average of the performances encountered in the training set). For the first three cases, tests in the constructed tree are always of the form A < c, where A is a numerical attribute from one of the DCT, HISTO or LAND datasets (note that all meta-level attributes are numeric) and c some value for it (any value from A’s domain was allowed). The target values were either the errors themselves, which allows us to compare some results directly with [2], or the ranks, which (according to our explanation in Section 3.3) we hope to yield better results w.r.t. the Spearman’s rank correlation coefficient. Our evaluation is based on a ten-fold cross validation, in order to maximize comparability with the results in [2]. Unfortunately we could not use exactly the same cross validation folds. The Clus system has a number of parameters with which it can be tuned. One parameter that influences the results quite strongly is the so-called “ftest” parameter. Clus uses a stopping criterion that is based on a statistical F-test (the standard way to test whether the average intra-cluster variance after a split is significantly smaller than the original variance); the “significance level” at which this test is performed is the “ftest” parameter. Values close to 0 cause Clus to quickly stop adding nodes to the tree, yielding small trees; the value 1 yields trees of maximal size. Preliminary experiments with Clus and Tilde on this and similar datasets indicated that ftest=1 yielded best results. Therefore we adopted this setting for all the experiments described here. Except for this ftest parameter, the default values were used for all parameters. 4.2
Experimental Results
Table 4 shows the mean absolute deviations of the predicted error rates from the true ones for different learning algorithms. The left-hand side (Clus - ranks) gives results of clustering trees: these are compared to the results in the righthand side taken from [2]. We can see that on average, predictive clustering trees score approximately equally good as the Kernel or Cubist methods on DCT
452
Ljupˇco Todorovski et al.
and HISTO meta-level datasets. Clustering trees perform worse on the LAND dataset. This is due to the fact that we decided to use a set of landmarkers that are disjoint with the set of target classifiers. In [2] seven landmarkers have been used, three of them being the same as the target classifiers. However, having meta-level attributes (landmarkers) that are the same to the meta-level class to be predicted (target classifiers) makes the task of predicting their performance trivial. Thus, the results on the LAND dataset are hard to compare. Note, however, that a single predictive clustering tree predicting the performance of all the learning algorithms at once has a very important advantage over the set of eight regression trees for predicting the performance of individual algorithms. A clustering tree provides a single model that can be easily interpreted. While the above MAD values are useful to compare our approach with previous approaches, our ultimate criterion is the Spearman’s rank correlation coefficient between the predicted ranking and the actual ranking of the methods. Spearman correlations are shown in Table 5.
Table 4. Mean absolute deviations (MADs) for a single predictive clustering tree (predicting all the error rates at once) induced with Clus compared to the MADs of a set of regression trees (one for each learning algorithm) induced with Kernel and Cubist methods. The Kernel and Cubist results are taken from [2]. Note that LAND and LAND* meta-level datasets are different Classifier c50boost c50rules c50tree lindiscr ltree mlcib1 mlcnb ripper
Clus - errors DCT HISTO LAND DEF 0.105 0.114 0.139 0.136 0.100 0.110 0.136 0.135 0.101 0.109 0.137 0.139 0.119 0.124 0.126 0.139 0.106 0.107 0.123 0.134 0.120 0.124 0.144 0.155 0.124 0.135 0.145 0.149 0.135 0.114 0.138 0.147
DCT 0.112 0.110 0.110 0.118 0.105 0.120 0.121 0.113
Kernel HISTO LAND* DCT 0.123 0.050 0.103 0.121 0.051 0.121 0.123 0.054 0.114 0.129 0.063 0.118 0.113 0.041 0.114 0.138 0.081 0.150 0.143 0.064 0.126 0.128 0.056 0.128
Cubist HISTO LAND* 0.128 0.033 0.126 0.036 0.130 0.044 0.140 0.054 0.121 0.032 0.149 0.067 0.149 0.044 0.131 0.041
Table 5. Spearman’s rank correlation coefficients (SRCCs) for the predictive clustering trees (predicting error rates and rankings) approach compared to SRCCs of other ranking approaches. Results for Cubist, Kernel and Zooming are taken from [2] Clus regression trees ranks errors Kernel Cubist Zooming DEF 0.372 0.349 0.330 0.330 0.330 0.399 0.380 0.435 0.083 0.341 DCT 0.174 0.371 HISTO 0.429 0.426 0.405 ∗ ∗ 0.266 0.197 0.090 0.190 LAND
Ranking with Predictive Clustering Trees
453
A first observation is that for each meta-level dataset, ranking trees built from ranks score better than ranking trees built directly from error rates. This corresponds with our intuition, explained in Section 3.3. Furthermore, both clustering trees approaches have better scores than all the others, except for the kernel method with the DCT dataset, which has also the highest overall value. These experimental results provide support for the two effects we identified earlier as possibly positively influencing the results. First, predictive clustering trees capture dependencies between different algorithms better than separate predictive models for each algorithm can. Second, when using intra-cluster variance minimization as a heuristic, it is better to first convert values into ranks. We conclude this discussion with an example tree. Table 6 shows a predictive clustering tree induced on the DCT dataset with ranks as target values. Each leaf node in the tree predicts a ranking of the eight algorithms from Table 1. For example, first leaf node in the tree (marked with (*)) predicts that c50boost (c50b) will perform better than c50rules (c50r) that will perform better than ltree (lt) and so on. The tree indicates that the number of attributes with outliers is most influential for the ranking of the algorithms. It also indicates that the two properties of number of symbolic and numeric attributes in the dataset seem to have good predictive power. Further interpretation and analysis of the tree is possible but it is beyond the scope of this paper.
5
Summary and Further Work
We have used predictive clustering trees to rank (predict the relative performance of) classification algorithms according to the performance on a given dataset using dataset properties. Three different dataset descriptions were used. Two different tasks were considered: predicting actual performances and predicting relative performances (ranking). On the first task of predicting the performance of classifiers, a single clustering tree predicting performances of all classifiers at once performs as well as a set of regression trees, each of them predicting performances of an individual classifier. However, the important advantage of the clustering trees approach is that it provides a single interpretable model. On the second task of predicting ranking, the experimental results show that using ranks as target variables in clustering trees works better than using actual performances for all dataset descriptions. Ranking with a single clustering tree performs better than ranking with a set of regression trees for two out of three dataset description approaches. Finally, ranking with clustering trees outperforms also instance-based approach of Zooming. An immediate direction for further work is to extend our ranking approach to work with relational dataset descriptions, similar to the one presented in [12]. Following the relational approach, properties of individual attributes can be included in the dataset description without being aggregated using mean, maximal and minimal values or histograms. This can be easily done, due to the fact that
454
Ljupˇco Todorovski et al.
Table 6. An example ranking tree (see Table 1 for the legend of the algorithms’ acronyms. Note that symbol < in the leaves denotes “performs better than” NumAttrsWithOutliers > 3 +-yes: AVGAttr_gFunction > -1.236 | +-yes: ClassEntropy > 0.977 | | +-yes: ClassEntropy > 0.999 | | | +-yes: c50b
6 | | +-yes: Nr_sym_attributes > 10 | | | +-yes: ClassEntropy > 0.445 | | | | +-yes: c50b -0.068 | | | | +-yes: c50t 0 | +-yes: Nr_sym_attributes > 4 | | +-yes: c50b -1.064 +-yes: Nr_examples > 303 | +-yes: Nr_num_attributes > 0 | | +-yes: SDRatio > 1.085 | | | +-yes: c50b 1,728 | | +-yes: c50b 0.914 | | +-yes: Nr_sym_attributes > 9 | | | +-yes: c50t 3 | +-yes: c50r 215 | +-yes: ld 9.738 +-yes: MeanKurtosis > 2.891 | +-yes: Nr_examples > 303 | | +-yes: Nr_classes > 6 | | | +-yes: ld 3 +-yes: c50b
Tilde allows for relational tests to be used in the nodes of predictive clustering trees by using an appropriate language bias. Other directions for further work include consideration of additional dataset properties. Dataset properties based on the shape of decision trees induced from a datasets could be interesting in this respect. Finally, the ranking methodology proposed in the paper can be also used and evaluated on other ranking
Ranking with Predictive Clustering Trees
455
tasks. A possible application would be ranking of the optimization algorithms performance on the basis of the optimization problem description.
Acknowledgments This work was supported in part by the METAL project (ESPRIT Framework IV LTR Grant Nr 26.357). Hendrik Blockeel is a post-doctoral fellow of the Fund for Scientific Research (FWO) of Flanders. The Clus system was implemented by Jan Struyf. Thanks to Alexandros Kalousis for providing the meta-level data.
References [1] ESPRIT METAL Project (project number 26.357): A Meta-Learning Assistant for Providing User Support in Machine Learning and Data Mining. http://www.metal-kdd.org/. 445, 446 [2] H. Bensusan and A. Kalousis. Estimating the predictive accuracy of a classifier. In Proc. of the Twelfth European Conference on Machine Learning, pages 25–36. Springer, Berlin, 2001. 445, 446, 448, 451, 452 [3] C. L. Blake and C. J. Merz. UCI repository of machine learning databases. http://www.ics.uci.edu/~mlearn/MLRepository.html. Irvine, CA: University of California, Department of Information and Computer Science, 1998. 446 [4] H. Blockeel, L. De Raedt, and J. Ramon. Top-down induction of clustering trees. In Proc. of the Fifteenth International Conference on Machine Learning, pages 55–63. Morgan Kaufmann, 1998. 449 [5] H. Blockeel, L. Dehaspe, B. Demoen, G. Janssens, J. Ramon, and H. Vandecasteele. Improving the efficiency of inductive logic programming through the use of query packs. Journal of Artificial Intelligence Research, 2002. In press. 449 [6] P. B. Brazdil and R. J. Henery. Analysis of results. In D. Michie, D. J. Spiegelhalter, and C. C. Taylor, editors, Machine learning, neural and statistical classification, pages 98–106. Ellis Horwood, Chichester, 1994. 445, 447 [7] A. Kalousis. Algorithm Selection via Meta-Learning. PhD Thesis. University of Geneva, Department of Computer Science, 2002. 447 [8] A. Kalousis and T. Theoharis. NEOMON: design, implementation and performance results of an intelligent assistant for classifier selection. Intelligent Data Analysis 3(5): 319–337, 1999. 447, 448 [9] G. Lindner and R. Studer. AST: Support for algorithm selection with a CBR approach. In Proc. of the ICML-99 Workshop on Recent Advances in Meta-Learning and Future Work, pages 38–47. J. Stefan Institute, Ljubljana, Slovenia, 1999. 447 [10] B. Pfahringer, H. Bensusan and C. Giraud-Carrier. Meta-Learning by Landmarking Various Learning Algorithms. In Proc. of the Seventeenth International Conference on Machine Learning: 743–750. Morgan Kaufmann, San Francisco, 2000. 448 [11] C. Soares and P. B. Brazdil. Zoomed ranking: Selection of classification algorithms based on relevant performance information. In Proc. of the Fourth European Conference on Principles of Data Mining and Knowledge Discovery, pages 126– 135. Springer, Berlin, 2000. 445, 448, 449 [12] L. Todorovski and S. Dˇzeroski. Experiments in meta-level learning with ILP. In Proc. of the Third European Conference on Principles of Data Mining and Knowledge Discovery, pages 98–106. Springer, Berlin, 1999. 453
Support Vector Machines for Polycategorical Classification Ioannis Tsochantaridis and Thomas Hofmann Department of Computer Science, Brown University Box 1910, Providence, RI 02912, USA {it,th}@cs.brown.edu Abstract. Polycategorical classification deals with the task of solving multiple interdependent classification problems. The key challenge is to systematically exploit possible dependencies among the labels to improve on the standard approach of solving each classification problem independently. Our method operates in two stages: the first stage uses the observed set of labels to learn a joint label model that can be used to predict unobserved pattern labels purely based on inter-label dependencies. The second stage uses the observed labels as well as inferred label predictions as input to a generalized transductive support vector machine. The resulting mixed integer program is heuristically solved with a continuation method. We report experimental results on a collaborative filtering task that provide empirical support for our approach.
1
Introduction
The standard supervised classification setting of inferring a single discriminant function based on a finite sample of labeled patterns has been investigated for decades. More recently, the question of how to make use of additional unlabeled examples has received a lot of attention. Such methods include the Fisher kernel [12] and maximum entropy discrimination method [13], maximum likelihood estimation via EM in text categorization [15], co-training [3], transductive inference [14], and kernel expansion methods [21]. The general hope in this line of research is that unlabeled data provide useful information about the pattern distribution that can be exploited to improve the classification performance, either by inducing an improved pattern representation or by enabling a more robust estimation of the discriminant function. In most cases, certain assumptions have to be made to guarantee that unlabeled data help to improve the performance. In this paper, we investigate a more general setting, called polycategorical classification. Assume that we have multiple (binary) concepts represented by labeling processes P j , 1 ≤ j ≤ k, i.e. each P j denotes a joint probability distribution over labeled patterns (x, y) ∈ d × {−1, 1}. For each concept a sample set S j is available, where samples in S j have been generated i.i.d. according to P j . The goal is to simultaneously learn all k binary classification tasks. Of course, if these tasks were unrelated then one would apply a standard classification method to each sample set S j independently. However, we assume that there are non-trivial dependencies between the labeling processes. T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 456–467, 2002. c Springer-Verlag Berlin Heidelberg 2002
Support Vector Machines for Polycategorical Classification
4:
Fig. 1. Illustration of the set relationships between patterns labeled by different labeling processes
Notice how this setting can be viewed as a generalization of supervised learning with unlabeled data. For each classification problem Pj we have the corresponding sample set SJ,but in addition we have patterns that occur in one or more of the other sample sets S1, 1 # j and are thus annotated by labels from other labeling processes. This induces a rich set structure between the two extrema of examples that are correctly labeled according to the specific concept to be learned and examples that have not been labeled by any process (cf. Fig. 1). Intuitively, a pattern labeled by some other process P1will on average be more useful w.r.t. PJ than an unlabeled pattern, in particular if there is some dependency between the two labeling processes. For example, if one knew a priori that two concepts are identical, Pi= P1, then one could simply use the union SiU S1for training of both concepts, which would drastically increase the number of available training examples. The goal in polycategorical classification is to exploit such dependencies to effectively augment the available training data in order to learn more accurate classification functions. As a motivating example of why these types of problems are actually of relevance in practice, consider the scenario of information filtering in multi-user or multi-agent systems: Each user may define personalized categories for items such as text documents, movies or CDs. In particular, users may annotate items by whether an item is relevant (label +1) or irrelevant (label -1). These preferences or categories will be specific to a particular person, yet there might be similarities between user interests that induce dependencies among the category labels. For example, a document xi labeled with y; t {-1,l) by some user or agent uj might provide evidence about how another user or agent ul might label this example, in particular if both users have shown similar responses on items
458
Ioannis Tsochantaridis and Thomas Hofmann
in S j ∩ S l . There are thus two sources of evidence that are important in predicting yij given xi : the input space representation (which is ordinarily exploited in classification) and dependencies between the labeling processes. The latter is closely related to a technique known as collaborative filtering [7, 17, 20] which makes predictions or recommendations purely based on inter-label dependencies. The success of these techniques in (commercial) recommendation systems shows that a substantial amount of cross-information is contained in user profiles. In polycategorical classification, one aims at combining these two sources of evidence, the item’s feature vector representation and the dependencies between labels provided by different users. This problem has been discussed in the context of recommender systems as the problem of combining content-based and collaborative filtering, cf. for example [2, 16]. Yet, none of the methods proposed so far has shown how to generalize state-of-the-art discriminative methods to incorporate “collaborative” information. The approach we propose can be decomposed into two almost independent stages. The first stage, deals with the problem of learning a probabilistic model of inter-label dependencies. In other words, the goal of the first stage is to estimate the joint label probability P (yi = (yi1 , . . . , yij , . . . , yik )|y(xi )) for each pattern xi that occurs in one of the sample sets. Here y(xi ) denotes the set of known labels for pattern xi . By marginalization we will then obtain prior probabilities P (yij |y(xi )). Notice that these probabilities do not depend on the actual feature representation xi , but just on its observed (partial) label vector y(xi ). Since this estimate does not depend on the observation xi we will also refer to the latter as the prior label probability. In the second stage the sample sets S j are augmented by probabilistically labeled examples. The latter is then used as the input to a generalized transductive Support Vector Machine (SVM) to produce the desired classification functions. The challenge at this stage is how to combine the prior label estimates with the actual feature representations. The rest of the paper is organized as follows: Section 2 describes a statistical model and a corresponding learning algorithm to compute predictions for unobserved labels based on observed labels. Section 3 deals with the generalization of the transductive SVM, while section 4 presents an experimental evaluation on a real-world data set.
2
Modeling Inter-label Dependencies
In this section, we will completely ignore the pattern representation and solely focus on modeling inter-label dependencies. If we denote by m the total number of distinct patterns xi , m ≡ | j S j |, then all labels can be arranged in a m × k matrix Y with entries yij ∈ {−1, ?, 1}, referring to the label the j-th labeling process assigns to the i-th pattern. Here we suggestively use the special symbol ’?’ to denote missing entries. In most cases, this matrix will be sparse in the sense that, N = j |S j | m · k, i.e. only a very small fraction of the entries ˆ ∈ [−1; 1]m×k will actually be observed. The goal is to estimate a matrix Y j with coefficients yˆi corresponding to the expected value of the label Yij under
Support Vector Machines for Polycategorical Classification
459
the model, where Yij denotes the random variable associated with the label of the i–th pattern with respect to the j–th labeling process. 2.1
Log-Likelihood Function
ˆ and the observed As an objective function between the probabilistic estimates Y matrix Y it is natural to consider the log-likelihood,
ˆ Y) = l(Y;
log
i,j:yij =1
1 + yˆij + 2
log
i,j:yij =−1
1 − yˆij , 2
(1)
which we want to maximize. Notice that P (Yij = ±1) = (1 ± E[Yij ])/2 and yˆij = E[Yij ] by definition, so this just measures the average log-probability of the true label under the model leading to the approximation Yˆ . 2.2
Probabilistic Latent Semantic Analysis Model
There are many possibilities to define a joint label model. In this paper, we investigate the use of the probabilistic latent semantic analysis (pLSA) approach presented in [8]. We have previously applied this model in the context of collaborative filtering [10, 9], so it seems to be a good starting point for polycategorical classification. The pLSA model can be written in the following form: yˆij =
R
φri ψrj ,
R
with φri ∈ [−1; 1], ψrj ∈ [0; 1] and
r=1
ψrj = 1,
(2)
r=1
here R denotes the rank of the approximation, which we assume to be given for now. Notice that the total number of free parameters in the model is R · m + (R − 1) · k, which can be far less than m · k, if R min{m, k}. Intuitively, we can think of (φri )r for each r as a prototype vector with probabilistic labels for each pattern xi and of the coefficients (ψrj )j as defining a convex combination of these vectors for the j-th classification problem. The pLSA model clearly bears a resemblance with soft-clustering models; concepts are probabilistically clustered into R groups, where each group corresponds to a super-concept that is characterized by a vector of probabilistic labels over patterns. 2.3
Expectation Maximization Algorithm
In fitting the above model, we would like to maximize the likelihood in Eq. (1) with respect to the parameters (φ, ψ). Explicitly inserting the model into the log-likelihood function and ignoring additive constants results in log (1 + φri )ψrj + log (1 − φri )ψrj (3) l(φ, ψ; Y) = i,j:yij =1
r
i,j:yij =−1
r
460
Ioannis Tsochantaridis and Thomas Hofmann
Since the logarithm of a sum of terms is hard to optimize, we follow the standard Expectation-Maximization (EM) approach [5] of iteratively improving Eq. (3) until a local maximum is reached. We denote by φ(t), ψ(t) the parameter estimates at time step t of the EM procedure. The goal in step t + 1 is to improve on the estimate obtained in step t, which can be quantified in terms of the differential log-likelihood (1 + φr (t + 1))ψrj (t + 1) t+1 = log r i r l j r (1 + φi (t))ψr (t) i,j:yij =1 (1 − φr (t + 1))ψrj (t + 1) + log r i r . (4) j j r (1 − φi (t))ψr (t) i,j:yi =−1
Using a concavity argument (Jensen’s inequality) the differential log-likelihood can be lower bounded as follows lt+1 ≥
i,j:yij =1 r
+
hrij (t) log
i,j:yij =−1
r
where hrij (t)
≡
(1 + φri (t + 1))ψrj (t + 1)
hrij (t) log
(1 + φri (t))ψrj (t) (1 − φri (t + 1))ψrj (t + 1) (1 − φri (t))ψrj (t)
r j r (t) (1+φi (t))ψ j s
, if yij = 1
, if yij = −1.
(1+φi (t))ψs (t) r j r (t) (1−φi (t))ψ (1−φsi (t))ψsj (t) s
s
,
(5)
(6)
After augmenting the lower bound in Eq. (5) by appropriate Lagrange multipliers to enforce the constraints on ψ(t + 1) one can set the gradient with respect to the new parameters estimates φ(t + 1) and ψ(t + 1) to zero. This yields explicit solution of the following form, r r j:yij =1 hij (t) − j:yij =−1 hij (t) r (7) φi (t + 1) = r r j:yij =1 hij (t) + j:yij =−1 hij (t) r i:yij =±1 hij (t) j ψr (t + 1) = (8) s s i:y j =±1 hij (t) i
Eq. (6) corresponds to the E-step (expectation step), while Eqs. (7,8) form the M-step (maximization step). As can be seen, the previous parameter values only enter the M-step equations through the hrij variables. Hence one can maximize the log-likelihood by alternating E-steps and M-steps until convergence is reached. The fact that the EM algorithm converges follows from the fact that the log-likelihood is increased in every step, while being bounded from above.
Support Vector Machines for Polycategorical Classification
2.4
461
Comments
We have voted for the pLSA approach to model inter-label dependencies in this paper. However, we would like to point out that due to the modularity of our approach there are other options that could be employed and combined with the generalization of transductive inference SVMs presented in the subsequent section. For example, as an alternative to pLSA one could use graphical models such as Bayesian networks, where the label vector yj for each labeling process can be treated as an instance and the Bayesian network consists of n nodes, one for every pattern xi . This approach to collaborative filtering has been pursued in [11]. We plan to investigate this research direction in future work.
3 3.1
Transductive SVM with Probabilistic Labels Support Vector Machines
The Support Vector Machine (SVM) [22] is a popular classification method that is based on the principle of margin maximization. SVMs generalize the linear discrimination method known as the maximum margin hyperplane. Assume we parameterize linear classifiers by a weight vector w and a bias term b, f (x) = sign(w, x + b). For linearly separable data S, there are in general many hyperplanes that separate the training data perfectly. These hyperplanes form the so-called version space. The maximum margin principle suggests to choose w∗ and b∗ among the parameters in the version space so that they maximize the minimal distance (the margin) between the hyperplane and any of the training points. SVMs generalize this idea in two ways. First of all in order to be able to deal with non-separable data sets one introduces slack variables ξi , one for every data point, and augments the objective function by an additional penalty term. The penalty term is usually proportional to the sum of the slack variables (L1 -norm), other choices include a squared error. With L1 -norm penalties one arrives at the following standard quadratic program for soft-margin SVMs: 1 2 w + C ξi 2 i=1 n
minimize subject to
yi (w, xi + b) ≥ 1 − ξi , ξi ≥ 0, i = 1, . . . , n
(9) i = 1, . . . , n
Here n denotes the number of available training patterns. After introducing Lagrange parameters αi for the inequality margin constraints one can explicitly solve for w, b and ξi to obtain the dual formulation (Wolfe dual, cf. [22]) 1 αi αj yi yj xi , xj + αi (10) maximize θ(α) = − 2 i,j i subject to C ≥ αi ≥ 0, yi αi = 0 i
i = 1, . . . , n
462
Ioannis Tsochantaridis and Thomas Hofmann
Since the Gram matrix with coefficients kij = xi , xj is symmetric and positive semi-definite, the resulting problem is a convex quadratic minimization problem. Furthermore, in SVM learning, one can take advantage of the fact that the dual function only depends on the gram matrix and replace the inner products between patterns in the input representation by an inner product computed via kernel functions K and simply define a new Gram matrix by kij = K(xi , xj ). One then effectively gets a non-linear classification function in the original input representation. Details on kernel methods can be found in [19]. 3.2
Transductive SVMs
In transductive SVMs (TSVMs), one aims at incorporating additional unlabeled data to get more reliable estimates for the optimal discriminant. The key observation is that a discriminant function which results in small margins for an unlabeled data point will not achieve a good separation, no matter what the true label of the unlabeled data point is. This idea is formalized in TSVMs by introducing additional integer variables y¯i ∈ {−1, 1} to model the unknown labels and to optimize a joint objective over the integer variables and the parameters w, b or - equivalently - the dual parameters α. In the following, we use the primal formulation, mainly because it is more comprehensible for the purpose of this presentation. We assume for simplicity that the labeled patterns are numbered from 1, . . . , n and the unlabeled examples are numbered from n + 1, . . . , m. minimize subject to
n m 1 2 ¯ w + C ξi + C ξi , 2 i=1 i=n+1
yi (w, xi + b) ≥ 1 − ξi , y¯i (w, xi + b) ≥ 1 − ξi ,
over w, ξ
(11)
i = 1, . . . , n i = n + 1, . . . , m
ξi ≥ 0, i = 1, . . . , m i = n + 1, . . . , m y¯i ∈ {−1, 1}, Alternative formulations of the TSVM problem that avoid the use of integer variables and result in non-convex optimization have been investigated in [4]. For large problems, there is (currently) no hope to find the exact solution to the above mixed integer quadratic program. Instead, one has to resort to optimization heuristics to compute an approximate solution. The heuristic proposed in [14] optimizes the integer variables in an outer loop and then solves the standard SVM-QP in the inner loop. Since [14] proposes to keep the proportion of = y¯j are swapped positive and negative labels constant, labels y¯i and y¯j with y¯i
between pairs of unlabeled examples xi , xj , if this reduces the overall objective function. Finally, there is yet another outer loop which employs a continuation method to reduce the sensitivity of the optimization heuristic with respect to ¯ C¯ is iteratively local minima. Starting from a small value for the penalty C, ∗ ¯ increased until it reaches a given final value C ≤ C. Notice that for small val¯ the labeled data dominate the objective function, so that the TSVM ues of C, solution will be close to the SVM solution which can be computed exactly. As C¯
Support Vector Machines for Polycategorical Classification
463
is increased, the penalty for having unlabeled data points close to the decision boundary increases and more attention will be paid to the configuration of the unlabeled data points and the imputed labels y¯i . 3.3
SVM with Probabilistic Labels
In order to use the prior label estimates derived from inter-label dependencies, we propose to generalize TSVMs in a way that they can handle “uncertain” labels, where we think of labeled and unlabeled patterns as extreme cases of uncertain labels. Hence let us assume label probabilities yˆij for i = 1, . . . , n are given, where yˆij = yij for observed labels. We will drop the superscript j to refer to a generic labeling process. Let us introduce binary integer variables y¯i ∈ {−1, 1} as in TSVMs and define the following optimization problem (using the same numbering convention as before) m n 1 2 ¯ ˆ) w + C ξi + C ξi + DH(y, y (12) minimize 2 i=1 i=n+1 m
1 + y¯i 1 + yˆi 1 − y¯i 1 − yˆi ˆ) = − log + log where H(y, y (13) 2 2 2 2 i=n+1 subject to
yi (w, xi + b) ≥ 1 − ξi ,
i = 1, . . . , n
y¯i (w, xi + b) ≥ 1 − ξi ,
i = n + 1, . . . , m
ξi ≥ 0, i = 1, . . . , m i = n + 1, . . . , m y¯i ∈ {−1, 1},
(14)
The function H measures the cross entropy between the (deterministically) imputed labels y¯i and the predictions derived from the inter-label model, yˆi . It acts as a soft penalty that penalizes labels that deviate from the prior predictions. The relative weight D ∈ + controls the influence of this penalty relative to the margin penalty encoded in the slack variables ξi , thereby trading off the inter-label information encoded in yˆi with the information encoded in the feature representation xi . In practice, one can use a cross-validation scheme to determine the optimal value for D. Notice that in the special case of yˆi = 0, i.e. in the case of a maximally entropic prior with a label uncertainty of one bit, our formulation reduces to TSVM, since the corresponding log-ratio term in H will reduce to a constant. For a given hyperplane, the update step for the labels y¯i is simple. First notice that the slack variable ξi will depend on y¯i , because the associated constraint involves y¯i . Since ξi is non-negative and large values are penalized, the optimal choice is given by ξi = max{0, 1 − y¯i γi } = 1 − min{1, y¯i γi },
where γi ≡ w, xi + b . (15)
Notice that for data points that are strictly inside the margin tube, this value will be positive for both, y¯i = 1 and y¯i = −1. It is now straightforward to
464
Ioannis Tsochantaridis and Thomas Hofmann ¯ to a small value Initialize C yi ) Initialize integer variables y¯i = sign(ˆ ¯ = C¯ ∗ Repeat until C Repeat until convergence, i.e. no integer variable needs to be changed Compute the optimal hyperplane w, b, given the integer variables {¯ yi } Re-compute the integer variables {¯ yi } for given parameters w, b end ¯ C¯ = 2 ∗ C end
Fig. 2. Generalized SVM algorithm for polycategorical classification compute the optimal value for y¯i by comparing the cost induced by the two possible choices, ˆi )/2 h+ i = max{0, 1 − γi } − D log(1 + y
(16)
= max{0, 1 + γi } − D log(1 − yˆi )/2
(17)
h− i
1 + yˆi + h− i − hi = min{1, γi } − min{1, −γi } + D log 1 − yˆi 1+ˆ yi (γi + 1) + D log 1−ˆyi , for γi ≥ 1 yi = (γi − 1) + D log 1+ˆ 1−ˆ yi , for γi ≤ −1 2γ + D log 1+ˆyi , otherwise 1−ˆ yi + y¯i∗ = sign(h− i − hi )
(18)
(19)
Notice that if yˆi γi ≥ 0, both contributions are in agreement, i.e. the data point is on the +1/-1 side of the hyperplane and the prior probability for a +1/-1 label is higher. However, if yˆi γi < 0, these two contributions are in conflict, in which case the weighting factor D determines how to compare the log ratio with the margin difference and which one to favor, the prior belief or the location of the feature vector relative to the current decision boundary. The complete algorithm is described in pseudo-code in Fig. 2
4 4.1
Experiments and Results Data Generation and Preprocessing
In order to experimentally verify the proposed method for polycategorical classification, we have used the well–known EachMovie [6] data set which contains about 1600 movies and more than 60,000 user profiles with a total number of approximately 2.8 million labels/votes. We have augmented this data set with movie synopses based on descriptions provided at [1]. The movie pages have been automatically crawled, parsed and indexed. Movies have then been represented as vectors xi in the standard term frequency representation used in the vector space model [18] for information retrieval. We have been able to obtain
Support Vector Machines for Polycategorical Classification
465
Table 1. Classification accuracy results on the augmented EachMovie data set. The first row denotes accuracies obtained by ignoring the feature representation, the second row summarizes the results by using the SVM. The first columns refers to the case of no model for inter-label dependencies, the second column to the popularity model and the third column to the pLSA model independent popularity pLSA model classification baseline no features 66.0% 73.0% SVM 63.0% 68.6% 74.3%
descriptions for 1217 movies which constitutes the set of pattern used in the experiments. For computational reasons, we have subsampled the database and randomly selected a subset of 1000 user profiles among the profiles with at least 100 votes. The actual votes have been converted into binary labels by thresholding the ratings: 4-5 stars have been mapped to a +1 and 0-3 stars to a -1 label. For each user, the available labels have been randomly split into a training set (90%) and a test set (10%). 4.2
Experiments
We have performed the following experimental comparison. For each of the 1000 users, we have trained a SVM just based on the feature representation. In all the experiments we have restricted our attention to linear kernels. This provides a benchmark that is purely based on the extracted content information. Moreover, we have trained a pLSA model to predict unobserved labels based on the observed label matrix Y. We have chosen a model with R = 200, by coarse optimization based on the predictive log-likelihood. This provides a benchmark that is purely based on label dependencies. In addition, we have investigated the use of a simple popularity baseline model which estimates the expected label by uniformly averaging over the population of users. Finally, the pLSA predictions as well as the popularity predictions have been used as prior predictions for the polycategorical SVM algorithm. Table 1 summarizes the results in terms of classification accuracy. First of all, notice that the use of inter-label dependencies leads to a significant absolute improvement of more than 11% in terms of classification accuracy compared to the SVM learning. This clearly demonstrates that a lot can be gained by the polycategorical treatment compared to the straightforward approach of independently solving each classification problem. It also shows that in this particular example, the content features are relatively weak for discrimination between movies, at least given the available training sample size. It seems that individual words occurring in short movie summaries are rather weakly correlated with most users’ preferences. Secondly, notice that using the features representation
466
Ioannis Tsochantaridis and Thomas Hofmann
yields a small yet consistent improvement in the performance of both, the simple popularity model as well as the pLSA model for inter-label dependencies. Despite the fact that the “collaborative” information between labels seems to be more precise than the information encoded in the content descriptions, there is still extra information that can be gained from the feature representation. One also sees that the difference is larger in the case of the popularity baseline - 2.6% vs. 1.1% gain in accuracy. One also has to consider that previous experiments [9] have shown that pLSA is a highly competitive collaborative filtering technique, so improving upon it is not trivial. We speculate that the improvement will be larger in cases, where both the feature representation and the inter-label dependencies yield predictions of comparable accuracy. Along these lines, we are currently investigating ways to extract stronger features like genre information for movies.
5
Conclusion
We have presented a novel approach to jointly solving large scale classification problems with many interdependent concepts. The proposed method uses state-of-the-art classification methods, namely SVMs, to learn from feature representations. In order to incorporate inter-label dependencies the transductive SVM framework has been extended to deal with weak label information. An efficient optimization heuristic has been proposed to compute approximate solutions of the resulting mixed integer program. On a real-world data set, the proposed method outperforms both, methods that are purely based on a feature representation and methods that are only taking into account inter-label dependencies.
Acknowledgments We would like to thank the Compaq Equipment Corporation for making the EachMovie data set available. This work has in part be sponsored by an NSFITR grant, award number IIS-0085836.
References [1] http://www.allmovie.com. 464 [2] C. Basu, H. Hirsh, and W. W. Cohen. Recommendation as classification: Using social and content-based information in recommendation. In Proceedings of the AAAI/IAAI, pages 714–720, 1998. 458 [3] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proc. 11th Annual Conf. Computational Learning Theory, pages 92–100, 1998. 456 [4] A. Demiriz and C. Bennett. Optimization approaches to semi-supervised learning. In M. C. Ferris, O. L. Mangasarian, and J. S. Pang, editors, Applications and Algorithms of Complementarity. Kluwer Academic Publishers, Boston, 2000. 462
Support Vector Machines for Polycategorical Classification
467
[5] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B, 39:1–38, 1977. 460 [6] http://research.compaq.com/SRC/eachmovie. 464 [7] D. Goldberg, D. Nichols, B. M. Oki, and D. Terry. Using collaborative filtering to weave an information tapestry. Communications of the ACM, 35(12):61–70, 1992. 458 [8] T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning Journal, 42(1):177–196, 2001. 459 [9] T. Hofmann. What people (don’t) want. In European Conference on Machine Learning (ECML), 2001. 459, 466 [10] T. Hofmann and J. Puzicha. Latent class models for collaborative filtering. In Proceedings of the International Joint Conference in Artificial Intelligence, 1999. 459 [11] C. Kardie J. Breese, D. Heckerman. Empirical analysis of predictive algorithms for collaborative filtering. In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, 1998. 461 [12] T. Jaakkola and D. Haussler. Exploiting generative models in discriminative classifiers. In Advances in Neural Information Processing Systems 11. MIT Press, 1998. 456 [13] T. Jaakkola, M. Meila, and T. Jebara. Maximum entropy discrimination. In Neural Information Processing Systems 12. MIT Press, 1999. 456 [14] T. Joachims. Transductive inference for text classification using support vector machines. In Proc. 16th International Conf. on Machine Learning, pages 200–209. Morgan Kaufmann, San Francisco, CA, 1999. 456, 462 [15] K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2-3):103–134, 2000. 456 [16] A. Popescul, L. H. Ungar, D. M. Pennock, and S. Lawrence. Probabilistic models for unified collaborative and content-based recommendation in sparse-data environments. In 17th Conference on Uncertainty in Artificial Intelligence, 2001. 458 [17] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl. Grouplens: An open architecture for collaborative filtering of netnews. In Proceedings of ACM 1994 Conference on Computer Supported Cooperative Work, 1994. 458 [18] G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, 1983. 464 [19] B. Sch¨ olkopf and A. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond. MIT Press, 2002. 462 [20] U. Shardanand and P. Maes. Social information filtering: Algorithms for automating ’word of mouth’. In Proceedings of the Computer Human Interaction Conference (CHI95), 1995. 458 [21] M. Szummer and T. Jaakkola. Kernel expansions with unlabeled examples. In Advances in Neural Information Processing Systems 13. MIT Press, 2000. 456 [22] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, Berlin, 1995. 461
Learning Classification with Both Labeled and Unlabeled Data Jean-Noël Vittaut, Massih-Reza Amini, and Patrick Gallinari Computer Science Laboratory of Paris 6 (LIP6), University of Pierre et Marie Curie 8 rue du capitaine Scott, 75015 Paris, France {vittaut,amini,gallinari}@poleia.lip6.fr
Abstract. A key difficulty for applying machine learning classification algorithms for many applications is that they require a lot of handlabeled examples. Labeling large amount of data is a costly process which in many cases is prohibitive. In this paper we show how the use of a small number of labeled data together with a large number of unlabeled data can create high-accuracy classifiers. Our approach does not rely on any parametric assumptions about the data as it is usually the case with generative methods widely used in semi-supervised learning. We propose new discriminant algorithms handling both labeled and unlabeled data for training classification models and we analyze their performances on different information access problems ranging from text span classification for text summarization to e-mail spam detection and text classification.
1
Introduction
Semi-supervised learning has been introduced for training classifiers when large amounts of unlabeled data are available together with a much smaller amount of labeled data. It has recently been a subject of growing interest in the machine learning community. This paradigm particularly applies when large sets of data are produced continuously and when hand-labeling is unfeasible or particularly costly. This is the case for example for many of the semantic resources accessible via the web. In this paper, we will be particularly concerned with the application of semi-supervised learning techniques to the classification of textual data. Document and text-span classification has become one of the key techniques for handling and organizing text data. It is used to find relevant information on the web, to filter e-mails, to classify news stories, to extract relevant sentences, etc. Machine Learning techniques have been widely used for classifying textual data. Most algorithms rely on the supervised learning paradigm and require the labeling of very large amounts of documents or text-spans which is unrealistic for large corpora and for on-line learning. We introduce here a new semi-supervised approach for classification. The originality of our work lies in the design of a discriminant approach to semi-supervised learnT. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 468-479, 2002. Springer-Verlag Berlin Heidelberg 2002
Learning Classification with Both Labeled and Unlabeled Data
469
ing whereas others mainly rely on generative classifiers. Compared to the latter approach, our method does not rely on any parametric assumptions about the data, and it allows for better performance than generative methods especially when there are few training data. It finally leads to very simple and fast implementation. This approach is generic and can be used with any type of data, however, we focus here in our experiments on textual data which is of particular interest to us. In our previous work, we were interested on the classification of text spans and more particularly sentences for text summarization using semi-supervised algorithms [1,2,3,4]. In [1,2], we have shown the link between CEM and the mean-squared error classification for simple linear classifiers and a sequential representation of documents. In [1] we gave a semi-supervised version of the algorithm and in [2] we extended the idea for unsupervised learning techniques. In [3] we adopted a vectorial representation of sentences rather than a sequential representation and presented a discriminant semi-supervised algorithm in a more general setting of logistic classifiers and considered a binary decision problem. In [4] we studied, in detail, the application of this method to text summarization seen as a sentence extraction task and analyzed its performances on two real world data sets. In this work, we extend this idea for any discriminant classifiers and for multi-class problems, and give results on more various information retrieval classification tasks such as e-mail filtering and web page classification. The paper is organized as follows; we first make a brief review of work on text classification, ranking and filtering, and on recent work in semi-supervised learning (section 2). In section 3, we describe our semi-supervised approach for classification and present the formal framework of the model. Finally we present a series of experiments on the UCI e-mail spam collection, on a collection from NIPS-2001 workshop for text classification and on TIPSTER SUMMAC collection for text summarization by sentence extraction. For the latter, text summarization can be considered as an instance of sentence classification where text spans are labeled relevant or irrelevant for the summary [4].
2 2.1
Related Work Text Classification and Summarization
The field of text classification has been and is still very active. One of the early application of text classification was for author identification. The seminal work by [26] examined authorship of the different Federalist papers. More recently text classification has been applied to a wide variety of practical problems ranging from cataloging news articles to classifying web pages and book recommendation. An early and popular machine learning technique for text classification is the naive Bayes approach [20]. But a variety of other machine learning techniques has been applied to text classification. Recently support vector machines have attracted much work for this task [15,17]. Other authors have used neural networks [34] and a variety of boosting approaches [29]. But until now, no single technique has emerged as
470
Jean-Noël Vittaut et al.
clearly better than the others, though some recent evidence suggests that kNN and SVMs perform at least as well as other algorithms when there is a lot of labeled data for each class of interest [35]. Most studies use the simple bag-of-words document representation. Automated text summarization dates back to the fifties [21]. The different attempts in this field have shown that human-quality text summarization was very difficult since it encompasses discourse understanding, abstraction, and language generation [30]. Simpler approaches have been explored which consist in extracting representative text-spans, using statistical and/or techniques based on surface domain-independent linguistic analyses. Within this context, summarization can be defined as the selection of a subset of the document sentences which is representative of its content. This is done by ranking document sentences and selecting those with higher score and minimum overlap. This bears a close relationship to text classification. Text extracting for summarization has been cast in the framework of supervised learning for the first time in the seminal work of [19]. The authors propose a generic summarization model, which is based on a naive Bayes classifier operating on a synthetic representation of sentences. Different authors built on this idea [10,22]. All these approaches to text classification, ranking and filtering rely on the supervised learning paradigm and require labeling of text spans or documents which is performed manually. Manual tagging is often unrealistic or too costly for building textual resources so that semi-supervised learning is probably well adapted to many information retrieval tasks. 2.2
Semi-supervised Learning
The idea of combining labeled and unlabeled data came from the statistician community at the end of the 60's. The seminal paper [12] presents an iterative EM-like approach for learning the parameters of a mixture of two normals with known covariances from unlabeled data. Similar iterative algorithms for building maximum likelihood classifiers from labeled and unlabeled data followed [23,32]. [13] presented the theory of the EM algorithm, unifying in a common framework many of the previously suggested iterative techniques for likelihood maximization with missing data. All these approaches are generative, they start from a mixture density model where mixture components are identified to classes and attempt at maximizing the joint likelihood of labeled and unlabeled data. Since direct optimization is usually unfeasible, the EM algorithm is used to perform maximum likelihood estimation. Usually, for continuous variables, density components are assumed to be gaussian especially for performing asymptotic analysis. Practical algorithms may be used for more general settings, as soon as the different statistics needed for EM may be estimated, e.g. for discrete variables, non parametric techniques (e.g. histograms) are often used in practice. Using likelihood maximization of mixture models for combining labeled and unlabeled data for classification has only been recently rediscovered by the machine learning community and many papers now deal with this subject. [25] consider a mixture of experts when it is usually assumed that there is a one to one correspondence between classes and components. They propose different models
Learning Classification with Both Labeled and Unlabeled Data
471
and an EM implementation. [27] propose an algorithm which is a particular case of the general semi-supervised EM described in [24], and present an empirical evaluation for text classification, they also extend their model to multiple components per class. [28] propose a kernel discrimination analysis which can be used for semi-supervised classification. [16] use EM to fill in missing feature values of examples when learning from incomplete data by assuming a mixture model. There have been considerably fewer works on discriminant semi-supervised approaches. [5] suggests to modify logistic regression, a well known classifier to incorporate unlabeled data. To do so, he maximizes the joint likelihood of labeled and unlabeled data. The co-training paradigm [8] which has been proposed independently is also related to discriminant semi-supervised training. In this approach it is supposed that data x may be described by two modalities which are conditionally independent given the class of x. Two classifiers are used, one for each modality, they operate alternatively as teacher and student. [11] present an interesting extension of a boosting algorithm which incorporates cotraining. The work of [14] also bears similarities with this technique. A transductive support vector machine [33] finds parameters for a linear separator when given labeled data and the data it will be tested on. [18] demonstrates the efficiency of this approach for several text classification tasks. [7] find small improvements on some UCI datasets with simpler variants of transduction. [36] argue both theoretically and experimentally that transductive SVMs are unlikely to be helpful for classification in general.
3
A New Discriminant Semi-supervised Algorithm for Classification
In this section, we present a new discriminant algorithm for semi-supervised learning. This algorithm is generic in the sense that it can be used with any classifier which estimates a posteriori class probabilities. We describe our algorithm in the general framework of the Classification EM (CEM) algorithm. For this we first introduce the general framework of the Classification Maximum Likelihood (CML) approach and the CEM algorithm [9,24] in section 3.2, we then show in section 3.3 how this framework can lead to a natural discriminant formulation. In the following we introduce briefly the framework of our work. 3.1
Framework
We consider a c-class decision problem and suppose available a set of m unlabeled data Du={xi | i = n+1,…,n+m} together with a set of labeled data Dl ={(xi,ti) | i = 1,…,n} where xi∈ℝd and ti = (t1i, …, tci) is the class indicator vector for xi. Data from Du are supposed drawn from a mixture of densities with c components {Ck}k=1,…, c in some unknown proportions {πk}k=1,…, c. We suppose that the unlabeled data have an associated missing indicator vector ti=(t1i, …, tci) for (i=n+1, …, n+m) which is a class indicator vector. We further consider that data is partitioned iteratively into c components {Ck}k=1,…, c. We will denote {Ck(j)}k=1,…, c the partition into c clusters computed by the algorithm at iteration j.
472
3.2
Jean-Noël Vittaut et al.
Classification Maximum Likelihood Approach
The classification maximum likelihood (CML) approach [31] is a general framework which encompasses many clustering algorithms [9,24]. It is only concerned with unsupervised learning. In section (3.3) we will extend the CML criteria to semi-supervised learning. Samples are supposed to be generated by a mixture density: c
p(x, Θ)=
∑π
k . f k ( x, θ k )
(1)
k =1
where the {fk}k=1,…c are parametric densities with unknown parameters θk and πk is the mixture proportion of kth component. The goal here is to cluster the samples into c components {Ck}k=1,…,c. Under the mixture sampling scheme, samples xi are taken from the mixture density p, and the CML criterion is [9, 24]: c
log LCML (C , π ,θ ) =
n+ m
∑ ∑t
{π k . f k ( xi .θ k )}
ki . log
(2)
k =1 i = n +1
The CEM algorithm [9] is an iterative technique, which has been proposed for maximizing (2), it is similar to the classical EM except for an additional C-step where each xi is assigned to one and only one component of the mixture. The algorithm is described below. CEM Initialization: start from an initial partition P(0) For jth iteration, j≥0 E-step. Estimate the posterior class probability that xi belongs to Ck (i=n+1, …,n+m, k=1, …, c):
E[tki( j ) / xi ; C ( j ) ,π ( j ) ,θ ( j ) ] =
π k( j ) . f k ( xi ,θ k( j ) ) c
∑
(3)
π k( j ) . f k ( xi ,θ k( j ) )
k =1
C-step. Assign each xi to the cluster C k( j +1) with maximal posterior probability ac-
cording to E[t/x]. M-step. Estimate the new parameters (π(j+1), θ (j+1)) which maximize log LCML(C(j+1), π(j), θ ( j)) Since the tki for the labeled data are known, this parameter is either 0 or 1 for examples in Dl, CML can be easily modified to handle both labeled and unlabeled data [24]. The new criterion - denoted here LC - becomes: c
log LC (C , π , θ ) =
∑ ∑ log{π k =1
xi ∈Ck
k .fk
( x i , θ k )} +
n+ m
∑t
i = n +1
{π k . f k ( xi .θ k )}
ki . log
(4)
Learning Classification with Both Labeled and Unlabeled Data
473
In this expression the first summation is over the labeled samples and the second is over the unlabeled samples. 3.3
Semi-supervised Discriminant-CEM
The above generative approach indirectly computes posterior class probabilities {p(Ck/x)}k=1,…,c via conditional density estimation. This could lead to poor estimates in high dimensions or when only few data are labeled which is usually the case for semisupervised learning. On the other hand, in high dimensions, the estimation is carried on a large number of parameters which is time consuming. Since we are dealing with a classification problem, a more natural approach is to directly estimate the posterior class probabilities p(C/x). This is known as the discriminant approach to classification. In this section, we first rewrite the semi-supervised CML criterion (4) in a suitable form which puts in evidence the role of posterior probabilities. We then show how it is possible to maximize this likelihood with discriminant classifiers. This leads to a modified CEM algorithm. Using Bayes rule, the CML criterion (4) can be rewritten as: ~ log LC (C , Θ ) = log LC (C , Θ) +
n+ m
∑ log p( x , Θ) i
(5)
i =1
where n+ m log { p ( C / x , π , θ ) } t ki . log{p(C k / xi , π k ,θ k )} (6) + k i k k k =1 i = n +1 xi∈Ck When using a discriminant classifier with parameters ω to estimate the posterior probabilities of classes, we make no assumption about the marginal distribution p(x,Θ), ~ therefore the maximum likelihood estimate of ω is the same for LC than for LC [5,24]. In this case (6) can be written using only the parameters ω of the model:
~ log LC (C , Θ) =
c
∑ ∑
~ log LC (C , ω ) =
∑
n+m + log { p ( C / x ) } t . log { p ( C / x ) } k i ki k i ω ω k =1 i = n +1 xi ∈Ck c
∑ ∑
∑
(7)
where pω(C/x) is the estimation of the posterior probability computed by the model. The maximum likelihood of parameters ω, (7), can be used as a learning criterion for a discriminant classifier. The advantage of this expression is that it is simply expressed upon the output of the classifier which gives suitable properties for the maximization of (7). In the following we present a new semi-supervised algorithm for discriminant classifiers. With this algorithm, the aim is to maximize (7) with regard to the parameters ω of the classifier. Because we are interested only in classification, the E-step, which estimates the posterior probabilities using conditional densities in the generative approach is no more necessary. The discriminant-CEM algorithm is summarized below:
474
Jean-Noël Vittaut et al.
Discriminant-CEM Initialization: Train a discriminant model M estimating the posterior class probabilities with parameters ω (0) over Dl, Let O(j)k be the output of the classifier for the kth class at the jth iteration. For jth iteration, j≥0 C-step. Assign each xi∈Du to the cluster C k( j +1) with maximal posterior probability
according to O(j)k = pω ( j ) (C k( j ) / xi ) . M-step. Train M over Dl ∪Du with new parameters ω (j+1) which maximize ~ log LC (C(j+1),ω (j)).
We have used in our experiments, a stochastic gradient algorithm to maximize (7) in the M-step. An advantage of this method is that it requires only the first order derivatives at each iteration. It can easily be proved that this algorithm converges to a local maximum of the likelihood function (7) for semi-supervised training. The main difference here with the generative method is that instead of estimating class conditional densities, discriminant-CEM algorithm directly attempts to estimate the posterior class probabilities, which is the quantity we are interested in for classification. The above algorithm can be used with any discriminant classifiers which estimate the posterior class probabilities. We have performed experiments using neural networks and support vector machines but for the classification tasks considered here, they did not show any improvement over a simple logistic unit, which in turn performed slightly better than a pure linear classifier. In the following section we will present results for a series of text classification, filtering and ranking tasks, by using a simple logistic unit with the discriminant-CEM algorithm.
4
Experiments
In the following we will briefly present the data sets we have used (section 4.1) and in section 4.2 we describe our results. 4.1
Data Sets
We have used three datasets for text classification tasks: a) one of the classification’s collection of the NIPS-2001 competition consisting of Web pages. This corpus is composed of 1000 documents where each document is encoded in a fixed vector size. b) the e-mail spam classification problem from the UCI repository, this collection is composed of 4601 e-mails, where each e-mail is represented using 57 terms. (features are once again fixed).
Learning Classification with Both Labeled and Unlabeled Data
475
For text summarization we have used the Computation and Language (cmp_lg) collection of TIPSTER SMMAC1. This corpus is composed of 183 scientific articles. To generate extract-based summaries from the abstract of each article in the collection, we have used the text span alignment method described by [6]. The evaluation is performed by generating a query q corresponding to the most frequent words in the training set. To represent each sentence, we considered a continuous version of the features proposed by Kupiec [19]: each sentence i, with length l(i), is represented by ! a 5 feature vector, x i : ! xi ={ϕ1, ϕ2, ϕ3, ϕ4, ϕ5} where, ϕ 1 is the normalized sentence length:
l (i ) , ϕ 2 is the normalized frequency ∑ l( j) j
of cue words in sentence i:
frequency of cue words , ϕ 3 is the normalized number of l (i )
terms within the query q and i, ϕ 4 is the normalized frequency of acronyms in i: frequency of acronyms and ϕ5 is the same paragraph feature as in [19]. l (i) In all cases, the data was randomly split into a training and a test set whose size was respectively 1/3 and 2/3 of the available data. 4.2
Results
For text summarization a compression ratio must be defined for extractive summaries. For the cmp_lg collection we followed the SUMMAC evaluation by using 10% compression ratio. For evaluation we compared the extract of a system with the desired summary and used the average precision measure to evaluate our system. Where the precision is defined as: Precision =
# of sentences extracted by the system which are in the target summaries total # of sentences extracted by the system
For Text classification, we followed the NIPS’s workshop evaluation which considered the Percentage of Good Classification (PGC) defined as: PGC =
# of examples of the test set well classified by the system # of examples in the test set
For our experiments we used a logistic unit as the baseline classifier. Figure 1 and 2 show performance on the test sets respectively for text classification and text summarization tasks. These figures plot a score for different ratio of labeled-unlabeled data in the training set. On the x-axis, 5% means that 5% of data in the training set were labeled for training, the 95% remaining being used as unlabeled training data. 1
http://www.itl.nist.gov/iaui/894.02/related_projects/tipster_summac/cmp_lg.html
476
Jean-Noël Vittaut et al.
For comparison, we have also performed tests with the logistic classifier trained in a supervised way using the same x% labeled data in the training set. Email spam detection 100
95
90
90
80 Performance in %
Performance in %
Web page classification 100
85
80
75
70
60
50 Discriminant-CEM Logistic supervised
Discriminant-CEM Logisitic supervised 70
0
10
20
30 40 50 60 70 % of labeled data in the training set
80
90
100
40
0
10
20
30 40 50 60 70 % of labeled data in the training set
(a)
80
90
100
(b)
Fig. 1. Web Pages classification (a) and e-mail Spam detection (b) - Performance of two classifiers with respect to the ratio x of labeled data in the training set. The classifier is a logistic unit trained with labeled data in a supervised scheme (dashed bottom curves) and using the semi-supervised discriminant-CEM algorithm (solid top curves)
For both classification tasks, the logistic classifier trained only on x% labeled data performs well but is clearly below the discriminant-CEM algorithm particularly in the regions of interest where the ratio of labeled data is small. For example, for web pages (figure 1-a) at 10% labeled data, semi supervised training reduces the classification error by more than 12% compared to the same classifier trained without unlabeled data. This shows empirically that unlabeled data do indeed contain relevant information and that the semi-supervised learning algorithm proposed here allows extracting part of this information. For text summarization, we have also compared in figure 2, discriminant-CEM to the generative-CEM algorithm presented in section 3.2. For the latter, we assume that the conditional density functions {fk}k=1,…,c are normal distributions. 80
Average Precision %
75
70
65
60
55
50
Logistic-supervised Discriminant-CEM Generative-CEM 0
10
20
30 40 50 60 70 80 % of the labeled data in the training set
90
100
Fig. 2. Average precision of 3 trainable summarizers with respect to the ratio of labeled sentences in the training set for the cmp_lg collection
Learning Classification with Both Labeled and Unlabeled Data
477
This comparison was not possible for text classification, due to the numerical inversion problems of covariance matrices. This problem is frequently seen, with sparse matrices of document representations in high dimensions. Discriminant-CEM uniformly outperforms generative-CEM in all regions for text summarization. This is particularly clear for SUMMAC cmp_lg, which is a small document set. In this case, the discriminant approach is clearly superior to the generative approach which suffers from estimation problems. Table 1 compares the Kupiec et al. summarizer system with the generative and discriminant CEM algorithms, all trained in a fully supervised way on the whole training set. Table 1. Comparison between kupiec et al.’s summarizer system and discriminant and generative CEM algorithms for the cmp_lg collection. All classifiers are trained in a fully supervised way Average Precision (%)
PGC(%)
Kupiec’s system
61,83
63,48
Generative-CEM
74,12
74,79
Discriminant-CEM
75,26
76,92
System
The two CEM classifiers allow approximately 10% increase both in average precision and in accuracy over Kupiec et al’s system. Another interesting result is that both discriminant and generative CEM trained on semi-supervised learning scheme (using 10% of labeled sentences together with 90% of unlabeled sentences in the training set) gave similar performances to the Kupiec et al.’s summarizer system fully supervised.
5
Conclusion
We have introduced a new discriminant algorithm for training classifiers in presence of labeled and unlabeled data. This algorithm has been derived in the framework of CEM algorithms and is pretty general in the sense that it can be used with any discriminant classifier. We have provided experimental analysis of the proposed method for text classification and text summarization with regard to ratio of labeled data in the training set, and we have shown that the use of the unlabeled data for supervised learning can indeed increase the classifier accuracy. We have also compared discriminant and generative approaches to semi-supervised learning and the former has been found clearly superior to the latter especially for small collections.
References 1.
Amini, M.-R., Gallinari P.: Learning for Text Summarization using labeled and unlabeled sentences. Proceedings of the 11th International Conference of Artificial Neural Networks, (2001), 1177-1184.
478
2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18.
Jean-Noël Vittaut et al.
Amini, M.-R., Gallinari P.: Automatic Text Summarization using Unsupervised and Semi-supervised Learning. Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery, (2001) 16-28. Amini, M.-R., Gallinari P.: Semi-supervised Logistic Regression. Proceedings of the 15th European Conference on Artificial Intelligence, (2002), to appear. Amini, M.-R., Gallinari P.: The Use of the labeled data to Improve Supervised Learning for Text Summarization. Proceedings of the 25th International ACM SIGIR, (2002), to appear. Anderson, J. A., Richardson, S. C.: Logistic Discrimination and Bias correction in maximum likelihood estimation. Technometrics, Vol. 21. (1979) 71-78. Banko, M. Mittal V., Kantrowitz, M., Goldstein, J.: Generating Extraction-Based Summaries from Hand-written done by text alignment. Pac. Rim Conf. On Comp. (1999). Bennet, K., Demirez, A.: Semi-supervised Support Vector machines. In Kearns, Solla, and Cohn, editors. Advances in Neural Information Processing Systems 11. MIT Press (1998) 368-374. Blum, A., Mitchell, T.: Combining Labeled and unlabeled Data with CoTraining. Proceedings of the Conference on Computational Learning Theory (1998) 92-100. Celeux, G., Govaert, G.: A Classification EM algorithm for clustering and two stochastic versions. Computational Statistic and Data Analysis Vol. 14 (1992) 351-332. Chuang, W. T., Yang, J.: Extracting sentence segments for text summarization: a machine learning approach. Proceedings of the 23rd ACM SIGIR. (2000) 152159. Collins, M., Singer, Y.: Unsupervised models for named entity classification. In Proceedings of EMNLP (1999) Day N. E., Estimating the components of a mixture of normal distributions. Biometrika, Vol. 56, N° 3. (1969) 463-474. Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via EM algorithm. Journal of the Royal Statistical Society, Vol. B, n°39 (1977) 1-38. De Sa, V. R.: Learning Classification with Unlabeled Data. Neural Information Processing Systems, Vol. 6 (1993) 112-119. Dumais, S. T., Platt J., Heckerman, D., Sahami M.: Inductive learning algorithms and representations for text categorization. CIKM. (1998) 148-155. Ghahramani, Z., Jordan M. I.: Supervised learning from incomplete data via EM approach. Advances in Neural Information Processing Systems, Vol. 6, (1994) 120-127. Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. Tenth European Conference in Machine Learning (1998) 137-142. Joachims, T.: Transductive inference for text classification using support vector machines. Proceedings of sixteenth International Conference on Machine Learning (1999) 200-209.
Learning Classification with Both Labeled and Unlabeled Data
479
19. Kupiec J., Pderson J., Chen F. A.: Trainable Document Summarizer. Proceedings of the 18th ACM SIGIR (1995) 68-73. 20. Lewis, D. D.: Naive (Bayes) at forty: The independence assumption in information retrieval. Tenth European Conference in Machine Learning (1998) 4-15. 21. Luhn, P. H.: Automatic creation of literature abstracts. IBM Journal (1958) 159165. 22. Mani, I., Bloedorn, E.: Machine Learning of Generic and User-Focused Summarization. Proceedings of the Fifteenth National Conference on AI. (1998) 821826. 23. McLachlan, G. J.: Iterative reclassification procedure for constructing asymptotically optimal rule of allocation in discriminant analysis. Journal of the American Statistical Association. Vol. 70, N° 350, (1975) 365-369. 24. McLachlan, G.J.: Discriminant Analysis and Statistical Pattern Recognition. John Willey and Sons, New York (1992) 25. Miller, D., Uyar, H.: A Mixture of Experts classifier with learning based on both labeled and unlabeled data. Advances in Neural Information Processing Systems 9 (1996) 571-577. 26. Mosteller, F., Wallace, D. L.: Inference and disputed authorship: The Federalist. Massachusetts: Addison-Wesley, (1964) 27. Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning, Vol 39, N° 2/3, 103-104 (2000) 28. Roth V., Steinhage, V.: Nonlinear Discriminant Analysis using Kernel Functions. Advances in Neural Information Processing Systems, Vol. 12, (1999). 29. Schapire, R. E., Singer, Y.: BoosTexter: A Boosting-based system for text categorization. Machine Learning, Vol. 39, N° 2/3. (2000) 135-168. 30. Sparck Jones, K.: Discourse modeling for automatic summarizing. Technical report 29D, Computer laboratory, university of Cambridge. (1993). 31. Symons, M. J.: Clustering criteria and Multivariate Normal Mixture. Biometrics. Vol. 37 (1981) 35-43. 32. Titterington, D. M.: Updating a diagnostic system using unconfirmed cases. Applied Statistics, Vol. 25, N° 3, (1976) 238-247. 33. Vapnik, V.: Statistical learning theory. John Wiley, New York. 34. Wiener, E., Pederson, J. O., Weigend, A. S.: A neural network approach to topic spotting. Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval. (1995) 317-332. 35. Yang Y: An evaluation of statistical approaches to text categorization. Information Retrieval, Vol. 1, N° 2/3. (1999) 67-88. 36. Zhang, T., Oles, F. J.: A probability analysis on the value of unlabeled data for classification problems. Proceedings of the Seventeenth International Conference on Machine Learning (2000) 1191-1198.
An Information Geometric Perspective on Active Learning Chen-Hsiang Yeang Artificial Intelligence Lab, MIT, Cambridge, MA 02139, USA {chyeang}@ai.mit.edu
Abstract. The Fisher information matrix plays a very important role in both active learning and information geometry. In a special case of active learning (nonlinear regression with Gaussian noise), the inverse of the Fisher information matrix – the dispersion matrix of parameters – induces a variety of criteria for optimal experiment design. In information geometry, the Fisher information matrix defines the metric tensor on model manifolds. In this paper, I explore the intrinsic relations of these two fields. The conditional distributions which belong to exponential families are known to be dually flat. Moreover, the author proves for a certain type of conditional models, the embedding curvature in terms of true parameters also vanishes. The expected Riemannian distance between current parameters and the next update is proposed to be the loss function for active learning. Examples of nonlinear and logistic regressions are given in order to elucidate this active learning scheme.
1
Introduction
Active learning is a subcategory of machine learning. The learner seeks new examples from a specific region of input space instead of passively taking the examples generated by an unknown oracle. It is crucial when the effort of acquiring output information is much more demanding than collecting the input data. When the objective is to learn the parameters of an unknown distribution, a data point (x, y) contains input variables x and output variables y. Actively choosing x distorts the natural distribution of p(x, y) but generates no bias on the conditional distribution p(y|x). Therefore, parameters of p(y|x) can be estimated without bias. One of the most well-known active learning schemes for parameter estimation 2 is optimal experiment design ([7]). Suppose y ∼ N (θ · f (x), nσ ). Define theTFisher information matrix of n inputs (x1 , · · · , xn ) as M = t=1 f (xt ) · f (xt ) , and Y = nt=1 yt f (xt ). Then the maximum likelihood estimator of θ is the linear estimator θˆ = M −1 Y , and the dispersion matrix of estimated parameters θˆ ˆ = E{(θˆ − θ)(θˆ − θ)T } = M −1 . The dispersion matrix measures the is V (θ) deviation of the estimated parameters from the true parameters. A variety of loss functions based on the dispersion matrix are proposed. An optimal experiment design scheme selects an input configuration (x1 , · · · , xn ) which minimizes the loss function. T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 480–492, 2002. c Springer-Verlag Berlin Heidelberg 2002
An Information Geometric Perspective on Active Learning
481
The Fisher information matrix of a probability density function p(x; ζ) is gij (ζ) = Ex {
∂ ∂ log p(x; ζ) log p(x; ζ)}. ∂ζi ∂ζj
(1)
where ζ is the parameter of the probability density function. This quantity constitutes the foundation of information geometry. A statistical model can be viewed as a manifold S imbedded in the high dimensional Euclidean space with the coordinate system ζ. The metric tensor on S in terms of coordinates ζ is the Fisher information matrix (equation 1). The metric tensor defines the distance of distributions on the manifold. A manifold in which the metric tensor is defined is called a Riemannian manifold. The presence of Fisher information matrix in both contexts is no coincidence. The inverse Fisher information matrix asymptotically tends to the dispersion matrix according to Cramer-Rao theorem ([7]). This matrix characterizes the deviation of the estimated parameters from the real one. Similarly, the metric tensor characterizes the “distance” between two distributions on the model manifold. Therefore, it is strongly motivated to devise a loss function for active learning from a geometric perspective. Previous works on information geometry focus on applying the notion of projection on various problems (for instance, [6], [3], and [1]). On the other hand, previous works on active learning are aimed at extending the scheme beyond parameter estimation (for example, [12]) or adopting a Bayesian approach to learning (for example, [11]). The contribution of this paper is to treat active learning under the framework of information geometry. I define the loss function as the expected Riemannian distance on the manifold between the current and the new estimated parameters. This loss function is closely related to the Kullback-Leibler divergence for exponential families. The structure of this paper is organized as follows. Section 2 introduces basic notions of information geometry and sets the form of input-output models in terms of information geometry. Section 3 states a sufficient condition for vanishing embedding curvature in terms of true parameters on the model manifold. Section 4 introduces the loss function and scheme for active learning. Section 5 illustrates the active learning by examples of nonlinear and logistic regressions. Section 6 summarizes the works in this paper and points out future works.
2
Information Geometry of Input-Output Models
In this paper, I focus on a special type of active learning problem: estimating the parameters of a conditional probability density p(y|x, θ). θ is the true parameter to be learned. The learner is free to choose input x. The output is generated according to the conditional density p(y|x, θ) (for convenience I write p(y|x, θ) = p(y; x, θ)). y, x, θ are all vectors. The conditional density in an exponential family can be written in the following form: p(y; x, θ) = exp{T(y) · F(x, θ) − ψ(x, θ) + k (y)},
(2)
482
Chen-Hsiang Yeang
where T(y) is the sufficient statistic of y, F(x, θ) = ζ is the natural parameter and ψ(x, θ) is the partition function of p(y; x, θ), k(y) is the Jacobi matrix of transforming from y to the sufficient statistic T(y). The natural parameter is a function of both inputs x and true parameters θ. We can view p(y; x, θ) as a curved exponential family where its natural parameters ζ = F(x, θ) are characterized by inputs x and true parameters θ. In order to study the properties on the model manifold some notions of differential geometry need to be introduced. The metric tensor gij (ζ) has the same form of equation 1 (substitute p(x, ζ) with p(y; ζ)). It defines the Riemannian distance between p(y; ζ) and an infinitesimally close density function p(y; ζ + dζ) on the manifold S: ds2 = gij (ζ)d ζ i d ζ j , where the Einstein’s summation of index convention is used in tensorial equations: summing over the indices which appear more than once in the formula. The value of the metric tensor changes when different coordinate systems are used, but the square distance ds2 is invariant under coordinate transformations. Another important quantity in information geometry is connection. The coni is a three-index quantity which defines the correspondence between nection Γjk vectors in different tangent spaces of the manifold ([5]). Notice this is a coordinate system-specific quantity. The actual rate of change of a vector field X on the manifold is the change of X along some coordinate curve, plus the change of the coordinate curve itself: ∂X i (ζ) DX i i = + Γjk X k (ζ). dζ j ∂ζ j This is called the covariant derivative of X. The geodesic on a manifold is a curve whose rate of change of its tangent vectors along the curve is zero. It corresponds to the notion of a straight line in a Euclidean space. Setting the ˙ covariant derivative of ζ(t) equal to zero, the equation of a geodesic curve ζ(t) is d2 ζ i (t) i ˙j + Γjk (3) ζ (t)ζ˙k (t) = 0. dt2 i = 0 equation 3 reduces to an ordinary second order differential Notice when Γjk equation, and the solution becomes
ζ(t) = ζ0 + t(ζ1 − ζ0 ). This corresponds to a straight line in an Euclidean space. Therefore, a manifold is flat when there exists some coordinate system which makes the connection vanish. The α-connection of an statistical manifold is defined by Amari ([5]): (α)
Γijk = Ey {∂i ∂j (y; ζ)∂k (y; ζ)} +
1−α 2 Ey {∂i (y; ζ)∂j (y; ζ)∂k (y; ζ)},
(4)
An Information Geometric Perspective on Active Learning
where (.) = log p(.), ∂i =
483
∂ m ∂ζi , and Γijk = Γij gmk , gmk is the inverse matrix (α) Γijk (ζ) = 0 for an exponential family p(y; ζ), thus
of g mk . When α = 1, then ζ(t) = ζ0 + t(ζ1 − ζ0 ) is a geodesic on the manifold. The manifold is called e-flat (α) in this case. Similarly, we can express Γijk in terms of expectation parameters (α)
η, where ηi = E{T (x)i }. Γijk (η) = 0 for mixture family distributions when α = −1. This is called m-flat ([5]). The dual flatness theorem was proved by Amari ([4]): a statistical manifold is e-flat if and only if it is m-flat. Therefore, both exponential and mixture families are dually flat manifolds. This theorem is important because it allow us to treat an exponential family manifold as a Euclidean space. However, it does not guarantee that the connection of the manifold vanishes for every coordinate system. The connection under a particular coordinate system is called the embedding curvature of the manifold. The embedding curvature of some coordinate system may not vanish when the manifold is both e-flat and m-flat. The manifold of multiple data points is the product space of manifolds of individual data points. For ordinary exponential families, let X = (x1 , · · · , xn ) be n samples drawn from the same distribution, then the joint probability p(X) forms a manifold Sn = S × · · · × S which is imbedded in an n × m dimensional Euclidean space. Here we are interested in the conditional densities of input-output models. Let X = (x1 , · · · , xn ) = (a1 , · · · , an ) be n fixed inputs and Y = (y1 , · · · , yn ) be their responses. The joint density of the n samples is n p(y 1 , · · · , yn , x1 , · · · , x1 , · · · , xn ; θ) = t=1 p(yt |xt )p(xt ) n = t=1 p(yt |xt = at )p(xt = at ) = exp{ nt=1 T(yt ) · F(xt , θ) − ψ(xt , θ) + k (yt ) + log p(xt = at )}. conditional Each fixed input xt = at induces a manifold S(at ) and the product of n densities forms a submanifold M of the product manifold Sn (a) = t=1 S(at ). Each S (at ) is imbedded in an r-dimensional Euclidean space because it is parameterized by θ. However, M is also parameterized by θ, thus it lives in a much more compact space than Sn (a).
3
Embedding Curvature on the True Parameter Coordinate Systems
The dual flatness theorem affirms the flatness of an exponential family manifold. Since M is a submanifold of an exponential family manifold, it is also flat. However, the α-connection of an exponential family vanishes only under the natural parameter coordinate systems; namely, ζ = (ζ1 , · · · , ζn ) = (F(x1 , θ), · · · , F(xn , θ)) = F(X, θ). When we use the coordinate system of true parameters θ = (θ1 , · · · , θr ), the α-connection does not guarantee to vanish. This means the curve θ(t) = θ0 + (θ1 − θ0 )t
484
Chen-Hsiang Yeang
is no longer a geodesic. Knowing the geodesics yields the advantages of evaluating distances efficiently. Therefore, it is important to understand the condition when the embedding curvature vanishes. Theorem 1 Let p(y; x, θ) be the conditional probability density in equation 2. ζ = F(x, θ) has the following form: F (x, θ) = (θ1 f1 (x), · · · , θr fr (x)).
(5)
where r is the dimension of y’s sufficient statistic. Let Bai = δai
∂(θi fi (x)) , ∂θa
(6)
be the Jacobian from θ to ζ. Then both metric tensor and α-connection are invariant under coordinate transformation: gab (x, θ) = Bai Bbj gij (ζ),
(7)
Γabc (x, θ) = Bai Bbj Bck Γijk (ζ).
(8)
(α)
(α)
i, j, k are indices of natural parameter ζ components and a, b, c are indices of true parameter θ components. Proof The first statement holds for any coordinate transformation (see [5]). From the definition of α-connection, (α)
Γabc (x, θ) = Ey {∂a ∂b ∂c } +
1 −α Ey {∂a ∂b ∂c } 2
Since ∂a = Bai ∂i , the second term conforms with equation 8 after coordinate transformation. Apply coordinate transformation to the first term. ∂a ∂b = ∂a (Bbj ∂j ) ∂(θ f (x)) = ∂a (δbj j∂θjb ∂j ) ∂θ f (x) = Bbj ∂a ∂j + δbj ∂a ( j∂θjb )∂j j = Bb ∂a ∂j + ∂a (fb (x))∂b .
(9)
Since fb (x) is a constant for θ, the second term of equation 9 vanishes. Thus both terms of the α-connection follow the form of equation 8. The theorem holds. Q.E.D. Theorem 1 preserves the α-connection form of the manifold under coordinate transformation θ = F −1 (x, ζ). This theorem holds under a specific type of inputoutput model (equation 5). Under this model each component of the natural parameter is decoupled into the effect of input fi (x) and the effect of parameter θi , and natural parameters ζ and true parameters θ have the same dimension. While the transformation is linear in true parameters, it does not need to be (α) linear in inputs. Since the connection on natural parameters Γijk (ζ) = 0, the (α)
connection on true parameters Γabc (x, θ) also vanishes. Therefore, the curve θ(t) = θ0 + t(θ1 − θ0 ) is a geodesic on the manifold. This property allows us to evaluate the Riemannian distance on the manifold efficiently.
An Information Geometric Perspective on Active Learning
4
485
Active Learning Schemes on Manifolds
With the notion of information geometry, active learning procedures can be viewed from a geometric perspective. The goal is to find the true parameter θ∗ of the underlying process. The learner selects inputs of the samples. Here the conventional myopic learning scheme is used ([13]). Under this scheme the learner chooses an input that optimizes the immediate loss based on current estimate of θ. Procedures of an active learning scheme are as follows: 1. Start with a collection of random samples D0 . Repeat the following steps. ˆ n ) based on Dn . 2. Find the maximum likelihood estimate θ(D The problem of maximum likelihood parameter estimation can be viewed as the projection from the data points to the model manifold along the mgeodesic ([6] and [3]). The manifold of θ changes with inputs (x1 , · · · , xn ) (denoted as M(θ; x1 , · · · , xn )). 3. For each candidate input for the next step xn+1 , evaluate some expected ˆ n ), θ(D ˆ n , (x, y)))}}. loss function Ex {Ey {L(θ(D This routine is usually the most time-consuming part of active learning. Suppose a new input-output pair (xn+1 , yn+1 ) is given, then one can compute ˆ n , (x, y)). A loss function L(θ(D ˆ n ), θ(D ˆ n , (xn+1 , yn+1 ))) the ML estimate θ(D is constructed to capture the deviation between the new estimated parameter and the original one. Since yn+1 is generated from the unknown process, ˆ n , (xn+1 , yn+1 )) is a random variable. The expected loss the resultant θ(D function is evaluated under the current estimate of the distribution of yn+1 and x: ˆ n ), θ(D ˆ n , (xn+1 , yn+1 )))}} = E {E {L(θ(D x yn+1 ˆ n ), θ(D ˆ n , (xn+1 , yn+1 )))dyn+1 dx. ˆ q(x)p(yn+1 ; xn+1 , θ(Dn ))L(θ(D ˆ n+1 which minimizes the expected loss function 4. Find the input x ˆ n ), θ(D ˆ n , (xn+1 , y)))}}. Ideally, the expected loss function of Ex {Ey {L(θ(D all inputs should be evaluated. Since it may not have simple analytic forms, a sampling strategy is usually adopted. ˆ . Incorporate this 5. Generate a sample by querying the output with input x sample into Dn to form Dn+1 . The crux of this scheme is the choice of the loss function. Various loss funcˆ (the inverse of Fisher information tions related to the dispersion matrix V (θ) matrix) are proposed ([7]): for example, the determinant of V (D-optimal), the trace of V , or the maximum of any ψV ψ T among normalized vector ψ (min-max optimal). While these loss functions might capture the dispersion of θˆ evaluated by adding new possible samples, they do not explicitly bear geometrical interpretations on the manifold. One sensible choice of the loss function is the Riemmanian (square) distance between the conditional density according to curˆ n )) and the new estimate by incorporating (xn+1 , yn+1 ) rent estimate p(y; x, θ(D ˆ (p(y; x, θ(Dn , (xn+1 , yn+1 )))) on the manifold M (x1 , · · · , xn+1 ) of conditional
486
Chen-Hsiang Yeang
distributions. The Riemannian distance between two points p0 and p1 on a Riemannian manifold is the square length of the geodesic C(t) connecting them: D(θ0 , θ1 ) = 0
1
dC i (t) dC j (t) dt gij (C(t)) dt dt
2 ,
(10)
where C(t) is parameterized such that C(0) = p0 and C(1) = p1 . This is in general a non-trivial task. On a dually flat manifold of statistical models, however, the Kullback-Leibler divergence is usually treated as a (quasi) distance metric. Amari ([5]) has proved the KL divergence between two infinitesimally close distributions is half of their Riemannian distance: DKL (p(x; θ) p(x; θ + dθ)) =
1 gij (θ)dθi dθj . 2
Moreover, it is also known that the KL divergence is the Riemannian distance under Levii-Civita connection ([4]). The computation of this distance uses different geodesic paths when traversing in opposite directions. From a point P to another point R it firstly projects P to a point Q along an m-geodesic then connects Q and R via an e-geodesic. Conversely from R to P it firstly finds the projection Q of R along the m-geodesic then connects Q and P by an e-geodesic. This property makes the KL divergence asymmetric. Here we are interested in the distance between two distributions on the model manifold of curved exponential families. Therefore the Riemannian distance under the connection in terms of true parameters θ is more appropriate. In most conditions when evaluating the Riemannian distance on the manifold is cumbersome, the KL divergence is a reasonable substitute for the distance between distributions. The Riemannian distance between the current estimator and the new estimator is a random variable because the next output value has not sampled yet. Therefore the true loss function is the expected Riemannian distance over potential output values. There are two possible ways of evaluating the expected loss. A local expectation fixes previous data Dn = {(x1 , y1 ), · · · , (xn , yn )} and varies ˆ n ) and xn+1 when performing parameter the next output yn+1 according to θ(D estimation at the next step: ˆ ˆ ˜n+1 ))))}. (11) Ey˜n+1 ∼p(y;xn+1 ,θ(D ˆ n )) {D(p(y; x, θ(Dn )) p(y; x, θ(Dn , (xn+1 , y A global expectation varies all the output values up to step n+1 when performing parameter estimation at the next step: ˆ Ey˜1 ,···,˜yn+1 ∼p(y;x1 ,···,xn+1 ,θ(D ˆ n )) {D(p(y; x, θ(Dn )) ˆ 1, y ˜1 ), · · · , (xn+1 , y ˜n+1 ))))}. p(y; x, θ((x
(12)
In both scenarios, the output values are assumed to be generated from the disˆ n ). tribution with parameters θ(D While the local expectation is much easier to compute, it has an inherent problem. The myopic nature of the learning procedure makes it minimizes (in
An Information Geometric Perspective on Active Learning
487
expectation) the distance between estimated parameter and previous parameters. Since all previous input-output pairs are fixed, the only possibility to make the estimated parameter at the next step differ from the current estimate is the fluctuation of yn+1 . However, as the number of data points grow, a single data point becomes immaterial. Empirically I found the local expectation scenario ends up sampling the same input over and over after a few steps. On the contrary, this problem is less serious in global expectation since we “pretend” to estimate the parameter at the next step using the regenerated samples. The expected distance in equation 12 is a function of input x. This is undesirable because we can enlarge or reduce the distance between two conditional densities by varying the input values even if their parameters are fixed. This discrepancy is due to the fact that D(p(y; x, θ1 ) p(y; x, θ2 )) is the Riemannian distance on the manifold M (x) which depends on the input values. To resolve this problem the true loss function is the expected loss in equation 12 over input values. L(θ1 , θ2 ) = Ex∼q(x) {D(p(y; x, θ1 ) p(y; x, θ2 ))}, where q(x) is the empirical distribution of input x. The sampling procedure in an active learning scheme distorts the input distribution, thus q(x) can only be obtained either from the observations independent of the active sampling or be arbitrarily determined (for instance, by setting it uniformly distributed). To sum up the active learning scheme can be expressed as the following optimization equation: ˆ ˆ ˜ x ˆn+1=arg min Ex∼q(x) {EY∼p(Y;X, ˆ n )) {D(p(y; x, θ(Dn )) p(y; x, θ(Dn+1 )))}}, ˜ θ(D xn+1
˜ = (˜ ˜ n+1 = {(x1 , y where X = (x1 , · · · , xn+1 ), Y y1 , · · · , y ˜n+1 ), and D ˜1 ), · · · , ˜n+1 )}. (xn+1 , y
5
Examples
In this section I use two examples to illustrate the new active learning scheme. 5.1
Nonlinear Regression
The first example is nonlinear regression with Gaussian noise. Assume the scalar output variable y is a nonlinear function of a vector input variable x plus a Gaussian noise e: y=
r
θi fi (x) + e,
i=1
where e ∼ N (0, σ 2 ) and σ is known. fi s comprise basis functions used to model y, for instance, polynomials. Here the index notation is a little abused such that
488
Chen-Hsiang Yeang
tensor indices (superscript and subscript), example indices (subscript) and iteration steps (subscript) are mixed. The distribution of y is Gaussian parametrized by θ and x: 1 −1 p(y; x, θ) = √ exp{ 2 (y − θ · f (x ))2 }. 2 2σ 2πσ To save space I write the sufficient statistics and the parameters in vector forms. The sufficient statistic, natural parameter, and partition function of y are T (y) = (y, y 2 )
ζ =(
ψ(ζ) =
θ · f (x ) −1 , 2 ). σ2 2σ
−1 2 −1 1 1 ζ1 ζ2 − log(−ζ2 ) + log π. 4 2 2
The Fisher information matrix in terms of ζ is −1 ζ1 gij (ζ) =
2ζ2 2ζ22 2 ζ1 1−ζ1 2ζ22 2ζ23
.
By applying theorem 1, the Fisher information matrix in terms of θ is gij (x, θ) =
1 fi (x)fj (x), σ2
which is a pure effect of x. Hence the metric tensor is a constant when the inputs are fixed. The differential Riemannian distance is ds2 = gij dθi dθj =
1 fi (x)fj (x)dθi dθj . σ2
Plugging it into equation 10, the (square) curve length along the geodesic θ(t) = θ0 + t(θ1 − θ0 ) becomes D2 (θ0 , θ1 ) =
1 1 fi (x)fj (x)(θ1i − θ0i )(θ1j − θ0j ) = 2 f T (x)(θ1 − θ0 )(θ1 − θ0 )T f (x), σ2 σ
where the second equation is written in the matrix form. It can be easily verified that this is the KL divergence of conditional Gaussian distributions. Let Mn =
n t=1
f (xt )f T (xt ), Yn =
n
yt f (xt )
t=1
be obtained from input-output pairs Dn up to step n. The maximum likelihood estimator is ˆ n ) = M −1 Yn . θˆn = θ(D n
An Information Geometric Perspective on Active Learning
489
˜ n+1 = {(x1 , y˜1 ), · · · , (xn+1 , y˜n+1 )} are the virtual data resampled from Assume D the distribution with parameter θˆn . Then the maximum likelihood estimator at ˆD ˜ n+1 ) satisfies the following conditions ([7]): the next step θˆn+1 = θ( E{θˆn+1 } = θˆn . −1 V {θˆn+1 } = σ 2 Mn+1 . The expected Riemannian distance between θˆn and θˆn+1 thus becomes Ey˜1 ,···,˜yn+1 {D(p(y; x, θˆn ) p(y; x, θˆn+1 ))} 1 = E{f T (x)(θˆn+1 − θˆn )(θˆn+1 − θˆn )T f (x)} 2σ 2 1 T f (x)V (θˆn+1 )f (x) = 2σ 2 1 −1 = f T (x)Mn+1 f (x). 2 The learning criteria becomes x ˆn+1 = arg min q(x)f T (x)(Mn + f (xn+1 )f T (xn+1 ))−1 f (x)dx. xn+1
(13)
Figure 1 shows the experiment results on the regression of the function y = θ0 + θ1 sin( π6 x) + θ2 sin( π4 x2 ) + θ3 sin( π3 x3 ) + e, where θ =[15 -13 -3 1]T and σ = 4. The average square error of the estimated parameters at each iteration over 500 experiments is plotted. The initial dataset D0 contains 5 random samples, and the learning curve of the 5 initial samples is not plotted. The results clearly indicate active learning schemes outperform passive learning when a small number of samples are allowed to draw from the unknown distribution. As the size of the data grows, their difference tends to decrease. I also compare the difference in terms of the loss function in active learning. The Riemannian distance loss function (the solid curve) performs slightly better than the trace of the dispersion matrix (the dash-dot curve), although the difference is not as significant as the difference between active and passive learning schemes. 5.2
Logistic Regression
Logistic regression is a standard distribution of modeling the influence of continuous inputs on discrete outputs. For simplicity here I only discuss the case of binary variables. Suppose y is a binary random variable which is affected by continuous input variables x. The conditional probability mass function of y can be expressed as p(y; x, θ) = exp{f T (x) · θδ(y = 1 ) − log(1 + ef
T
(x)·θ
)}.
By treating ζ = f T (x) · θ as the natural parameter of the exponential family, the metric tensor in terms of the true parameters θ can be obtained by coordinate
490
Chen-Hsiang Yeang
35 passive Riemannian distance trace D 30
square error
25
20
15
10
5
0
0
2
4
6
8
10 iteration step
12
14
16
18
20
Fig. 1. Active and passive learnings on nonlinear regression
transformation: gab (θ, x) =
∂ζ ∂ζ ∂ 2 ψ(ζ) eζ(x,θ) g11 (ζ) = fa (x)fb (x) = fa (x)fb (x) . a b 2 ∂θ ∂θ ∂ζ (1 + eζ(x,θ) )2
ζ(x,θ)
e Notice (1+e ζ(x,θ) )2 is a symmetric function of θ. The Riemannian distance between two parameters θ0 and θ1 can be computed from equation 10: 1 D2 (θ0 , θ1 ) = ( 0 gij (θ(t), x)(θ1 − θ0 )i (θ1 − θ0 )j dt)2 1 eζ(θ(t),x) T (x)(θ − θ )(θ − θ )T f (x)]dt)2 =( 0 1 0 1 0 ζ(θ(t),x) )2 [f (1+e
1 ζ(θ(t),x) e =( 0 dt)2 [f T (x)(θ1 − θ0 )(θ1 − θ0 )T f (x)] (1+eζ(θ(t),x) )2
= ( a2 [arctan(e
a+b 2
b
) − arctan(e 2 )])2 [f T (x)(θ1 − θ0 )(θ1 − θ0 )T f (x)],
r r where a = i=1 f (x)i (θ1i − θ0i ) and b = i=1 f (x)i θ0i . The key for simplification is because the metric tensor is a symmetric function of θ(t), hence t does not appear in individual components. If y takes more than two values, then the square distance has a complicated form. Although the distance function is considerably simplified, the active learning of logistic regression is still cumbersome. Unlike nonlinear regression with Gaussian noise, there is no analytic solution for maximum likelihood estimators in logistic regression. It is usually obtained by numerical or approximation methods such as gradient descent, Newton’s method or variational methods ([9]). Therefore, the expectation of the square distance between current estimate and the next estimate over possible outputs can only be computed by approximation or sampling. Due to the lack of time the numerical experiment for logistic regression is left for future works.
An Information Geometric Perspective on Active Learning
6
491
Conclusion
In this paper I propose an active learning scheme from the perspective of information geometry. The deviation between two distributions is measured by the Riemannian distance on the model manifold. The model manifold of exponential families is dually flat. Moreover, for the distributions whose log densities are linear in terms of parameters, the embedding curvature of their manifolds in terms of the true coordinate systems also vanishes. The active learning loss function is the expected Riemannian distance over the input and the output data (equation 13). This scheme is illustrated by two examples: nonlinear regression and logistic regression. There are abundant future works to be pursued. The active learning scheme is computationally intensive. More efficient algorithms for evaluating expected loss function need to be developed. Secondly, the Bayesian approach for parameter estimation is not yet incorporated into the framework. Moreover, for the model manifolds which are not dually flat, the KL divergence is no longer proportional to the Riemannian distance. How to evaluate the Riemannian distance efficient on a curved manifold needs to be studied.
Acknowledgement The author would like to thank professor Tommi Jaakkola for the discussion about information geometry and Jason Rennie for reviewing and commenting on the paper.
References [1] Amari, S. I. (2001). Information geometry of hierarchy of probability distributions, IEEE transactions on information theory, 47:50, 1701-1711. 481 [2] Amari, S. I. (1996). Information geometry of neural networks – a new Bayesian duality theory, International conference on neural information processing. [3] Amari, S. I. (1995). Information geometry of the EM and em algorithms for neural networks. Neural networks, 9, 1379-1408. 481, 485 [4] Amari, S. I. (1985). Differential geometrical methods in statistics, Springer Lecture Notes in Statistics, 28, Springer. 483, 486 [5] Amari, S. I. (1982). Differential geometry of curved exponential families – curvatures and information loss. Annals of statistics, 10:2, 357-385. 482, 483, 484, 486 [6] Csisz´ ar, I. and Tusn´ ady, G. (1984). Information geometry and alternating minimization procedures, Statistics & Decisions, Supplement Issue, 1, 205-237. 481, 485 [7] Fedorov, V. V. (1972). Theory of optimal experiments. New York: Academic Press. 480, 481, 485, 489 [8] MacKay, D. J. C. (1992). Information-based objective functions for active data selection. Neural computation, 4, 589-603. [9] Minka, T. P. (2001). Algorithms for maximum-likelihood logistic regression. CMU Statistics Technical Report 758. 490
492
Chen-Hsiang Yeang
[10] Sokolnikoff, I. S. (1964). Tensor analysis. New York:John Wiley & Sons. [11] Sung, K. K. and Niyogi, P. (1995). Active learning for function approximation. Advances in neural information processing systems, 7, 593-600. 481 [12] Tong, S. and Koller, D. (2001). Active learning for structure in Bayesian networks. International joint conference on artificial intelligence. 481 [13] Tong, S. and Koller, D. (2000). Active learning for parameter estimation in Bayesian networks. Advances in neural information processing systems, 13, 647653. 485
Stacking with an Extended Set of Meta-level Attributes and MLR ˇ Bernard Zenko and Saˇso Dˇzeroski Department of Intelligent Systems, Joˇzef Stefan Institute Jamova 39, SI-1000 Ljubljana, Slovenia {Bernard.Zenko,Saso.Dzeroski}@ijs.si
Abstract. We propose a new set of meta-level features to be used for learning how to combine classifier predictions with stacking. This set includes the probability distributions predicted by the base-level classifiers and a combination of these with the certainty of the predictions. We use these features in conjunction with multi-response linear regression (MLR) at the meta-level. We empirically evaluate the proposed approach in comparison to several state-of-the-art methods for constructing ensembles of heterogeneous classifiers with stacking. Our approach performs better than existing stacking approaches and also better than selecting the best classifier from the ensemble by cross validation (unlike existing stacking approaches, which at best perform comparably to it).
1
Introduction
An ensemble of classifiers is a set of classifiers whose individual predictions are combined in some way (typically by voting) to classify new examples. One of the most active areas of research in supervised learning has been to study methods for constructing good ensembles of classifiers [3]. The attraction that this topic exerts on machine learning researchers is based on the premise that ensembles are often much more accurate than the individual classifiers that make them up. Most of the research on classifier ensembles is concerned with generating ensembles by using a single learning algorithm [5], such as decision tree learning or neural network training. Different classifiers are generated by manipulating the training set (as done in boosting or bagging), manipulating the input features, manipulating the output targets or injecting randomness in the learning algorithm. The generated classifiers are then typically combined by voting or weighted voting. Another approach is to generate classifiers by applying different learning algorithms (with heterogeneous model representations) to a single data set (see, e.g., [8]). More complicated methods for combining classifiers are typically used in this setting. Stacking [15] is often used to learn a combining method in addition to the ensemble of classifiers. Voting is then used as a baseline method for combining classifiers against which the learned combiners are compared. Typically, much better performance is achieved by stacking as compared to voting. T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 493–504, 2002. c Springer-Verlag Berlin Heidelberg 2002
494
ˇ Bernard Zenko and Saˇso Dˇzeroski
The work presented in this paper is set in the stacking framework. We propose a new set of meta-level features. We use them in conjunction with multi-response linear regression at the meta-level, and show that this combination does perform better than other combining approaches. We argue that selecting the best of the classifiers in an ensemble generated by applying different learning algorithms should be considered as a baseline to which the stacking performance should be compared. Our empirical evaluation of several recent stacking approaches shows that they perform comparably to the best of the individual classifiers as selected by cross validation, but not better. The approach we propose here performs better than selecting the best individual classifier. Section 2 first summarizes the stacking framework, then surveys some recent results and finally introduces our stacking approach based on classification via linear regression. The setup for the experimental comparison of several stacking methods, voting and selecting the best classifier is described in Section 3. Section 4 presents and discusses the experimental results and Section 5 concludes.
2
Stacking
We first give a brief introduction to the stacking framework, introduced by Wolpert [15]. We then summarize the results of several recent studies in stacking [8, 11, 12, 10, 13]. Motivated by these, we introduce a modified stacking approach based on classification via linear regression [11]. 2.1
The Stacking Framework
Stacking is concerned with combining multiple classifiers generated by using different learning algorithms L1 , . . . , LN on a single data set S, which consists of examples si = (xi , yi ), i.e., pairs of feature vectors (xi ) and their classifications (yi ). In the first phase, a set of base-level classifiers C1 , C2 , . . . CN is generated, where Ci = Li (S). In the second phase, a meta-level classifier is learned that combines the outputs of the base-level classifiers. To generate a training set for learning the meta-level classifier, a leave-oneout or a cross validation procedure is applied. For leave-one-out, we apply each of the base-level learning algorithms to almost the entire data set, leaving one example for testing: ∀i = 1, . . . , n : ∀k = 1, . . . , N : Cki = Lk (S −si ). We then use the learned classifiers to generate predictions for si : yˆik = Cki (xi ). The meta-level data set consists of examples of the form ((ˆ yi1 , . . . , yˆiN ), yi ), where the features are the predictions of the base-level classifiers and the class is the correct class of the example at hand. When performing, say, ten-fold cross validation, instead of leaving out one example at a time, subsets of size one-tenth of the original data set are left out and the predictions of the learned classifiers obtained on these. We use ten-fold cross validation in all our experiments for generating the meta-level training set. In contrast to stacking, no learning takes place at the meta-level when combining classifiers by a voting scheme (such as plurality, probabilistic or weighted
Stacking with an Extended Set of Meta-level Attributes and MLR
495
voting). The voting scheme remains the same for all different training sets and sets of learning algorithms (or base-level classifiers). The simplest voting scheme is the plurality vote. According to this voting scheme, each base-level classifier casts a vote for its prediction. The example is classified in the class that collects the most votes. 2.2
Recent Advances
The most important issues in stacking are probably the choice of the features and the algorithm for learning at the meta-level. Below we review some recent research on stacking that addresses the above issues. It is common knowledge that ensembles of diverse base-level classifiers (with weakly correlated predictions) yield good performance. Merz [8] proposes a stacking method called SCANN that uses correspondence analysis to detect correlations between the predictions of base-level classifiers. The original meta-level feature space (the class-value predictions) is transformed to remove the dependencies, and a nearest neighbor method is used as the meta-level classifier on this new feature space. Ting and Witten [11] use base-level classifiers whose predictions are probability distributions over the set of class values, rather than single class values. The meta-level attributes are thus the probabilities of each of the class values returned by each of the base-level classifiers. The authors argue that this allows to use not only the predictions, but also the confidence of the base-level classifiers. Multi-response linear regression (MLR) is recommended for meta-level learning, while several learning algorithms are shown not to be suitable for this task. Seewald and F¨ urnkranz [10] propose a method for combining classifiers called grading that learns a meta-level classifier for each base-level classifier. The metalevel classifier predicts whether the base-level classifier is to be trusted (i.e., whether its prediction will be correct). The base-level attributes are used also as meta-level attributes, while the meta-level class values are + (correct) and − (incorrect). Only the base-level classifiers that are predicted to be correct are taken and their predictions combined by summing up the probability distributions predicted. Todorovski and Dˇzeroski [12] introduce a new meta-level learning method for combining classifiers with stacking: meta decision trees (MDTs) have baselevel classifiers in the leaves, instead of class-value predictions. Properties of the probability distributions predicted by the base-level classifiers (such as entropy and maximum probability) are used as meta-level attributes, rather than the distributions themselves. These properties reflect the confidence of the base-level classifiers and give rise to very small MDTs, which can (at least in principle) be inspected and interpreted. Todorovski and Dˇzeroski [13] report that stacking with MDTs clearly outperforms voting and stacking with decision trees, as well as boosting and bagging of decision trees. On the other hand, MDTs perform only slightly better than ˇ SCANN and selecting the best classifier with cross validation (SelectBest). Zenko et al. [16] report that MDTs perform slightly worse as compared to stacking with
496
ˇ Bernard Zenko and Saˇso Dˇzeroski
MLR. Overall, SCANN, MDTs, stacking with MLR and SelectBest seem to perform at about the same level. It would seem natural to expect that ensembles of classifiers induced by stacking would perform better than the best individual base-level classifier: otherwise the extra work of learning a meta-level classifier doesn’t seem justified. The experimental results mentioned above, however, do not show clear evidence of this. This has motivated us to seek new stacking methods and investigate their performance relative to state-of-the-art stacking methods and SelectBest, in the hope of achieving performance that would be clearly superior to SelectBest. 2.3
Stacking with Multi-response Linear Regression
The experimental evidence mentioned above indicates that although SCANN, MDTs, stacking with MLR and SelectBest seem to perform at about the same level, stacking with MLR has a slight advantage over the other methods. It would thus seem as a suitable starting point in the search for better method for meta-level learning to be used in stacking. MLR is an adaptation of linear regression. For a classification problem with m class values {c1 , c2 , . . . cm }, m regression problems are formulated: for problem j, a linear equation LRj is constructed to predict a binary variable which has value one if the class value is cj and zero otherwise. Given a new example x to classify, LRj (x) is calculated for all j, and the class k is predicted for which LRk (x) is the highest. In seeking to improve upon stacking with MLR, we have explored two possible directions that correspond to the major issues in stacking. Concerning the choice of the algorithm for learning at the meta-level, we have explored the use of model trees instead of LR [6]since model trees naturally extend LR to construct piecewise linear approximations. In this paper, we consider the choice of the meta-level features used for stacking. 2.4
An Extended Set of Meta-level Features for Stacking
We assume that each base-level classifier predicts a probability distribution over the possible class values. Thus, the prediction of the base-level classifier C when applied to example x is a probability distribution: pC (x) = pC (c1 |x), pC (c2 |x), . . . pC (cm |x) , where {c1 , c2 , . . . cm } is the set of possible class values and pC (ci |x) denotes the probability that example x belongs to class ci as estimated (and predicted) by classifier C. The class cj with the highest class probability pC (cj |x) is predicted by classifier C. The meta-level attributes as proposed by [11] are the probabilities predicted for each possible class by each of the base-level classifiers, i.e., pCj (ci |x)
Stacking with an Extended Set of Meta-level Attributes and MLR
497
for i = 1, . . . , m and j = 1, . . . , N . In our approach, we use two additional sets of meta-level attributes: probability distributions multiplied by maximum probability m PCj = pCj (ci |x) × MC = pCj (ci |x) × max pCj (ci |x) i=1
for i = 1, . . . , m and j = 1, . . . , N and entropies of probability distributions EC = −
m
pC (ci |x) · log2 pC (ci |x).
i=1
Therefore the total number of meta-level attributes in our approach is N (2m+1). The motivation for considering these additional meta-level attributes is as follows. Already Ting and Witten [11] state that the use of probability distributions has the advantage of capturing not only the predictions of the base-level classifiers, but also their certainty. The attributes we have added try to capture the certainty of the predictions more explicitly (the entropies EC ) and combine them with the predictions themselves (the products PCj of the individual probabilities and the maximal probabilities MC in a predicted distribution). The attributes MC and EC have been used in the construction of meta decision trees [12]. It should be noted here that we have performed preliminary experiments using only the attributes PCj and EC (without the original probability distributions). The results of these experiments showed no significant improvement over using the original probability distributions only. We can therefore conclude that the synergy of all three sets of attributes is responsible for the performance improvement achieved by our approach.
3
Experimental Setup
In the experiments, we investigate the performance of stacking with multiresponse linear regression and the extended set of meta-level attributes. and in particular its relative performance as compared to existing state-of-the-art stacking methods and SelectBest. The Weka data mining suite [14] was used for all experiments, within which all the base-level and meta-level learning algorithms used in the experiments have been implemented. 3.1
Data Sets
In order to evaluate the performance of the different combining algorithms, we perform experiments on a collection of twenty data sets from the UCI Repository of machine learning databases [2]. These data sets have been widely used in other comparative studies. The data sets and their properties (number of examples, classes, (discrete/continuous) attributes, probability of the majority class, entropy of the class probability distribution) are listed in Table 1.
498
ˇ Bernard Zenko and Saˇso Dˇzeroski
Table 1. The data sets used and their properties (number of examples, classes, (discrete/continuous) attributes, probability of the majority class, entropy of the class probability distribution) Data set
3.2
Exs
Cls
(D/C)
Att
Maj
Ent
australian balance breast-w bridges-td car
690 625 699 102 1728
2 3 2 2 4
(8/6) (0/4) (9/0) (4/3) (6/0)
14 4 9 7 6
0.56 0.46 0.66 0.85 0.70
0.99 1.32 0.92 0.61 1.21
chess diabetes echo german glass
3196 768 131 1000 214
2 2 2 2 6
(36/0) (0/8) (1/5) (13/7) (0/9)
36 8 6 20 9
0.52 0.65 0.67 0.70 0.36
0.99 0.93 0.91 0.88 2.18
heart hepatitis hypo image ionosphere
270 155 3163 2310 351
2 2 2 7 2
(6/7) (13/6) (18/7) (0/19) (0/34)
13 19 25 19 34
0.56 0.79 0.95 0.14 0.64
0.99 0.74 0.29 2.78 0.94
iris soya vote waveform wine
150 683 435 5000 178
3 19 2 3 3
(0/4) (35/0) (16/0) (0/21) (0/13)
4 35 16 21 13
0.33 0.13 0.61 0.34 0.40
1.58 3.79 0.96 1.58 1.56
Base-Level Algorithms
We use three different learning algorithms at the base level: – J4.8: a Java re-implementation of the decision tree learning algorithm C4.5 [9], – IBk: the k-nearest neighbor algorithm of [1], and – NB: the naive Bayes algorithm of [7]. All algorithms are used with their default parameter settings, with the exceptions described below. IBk uses inverse distance weighting and k is selected with cross validation from the range of 1 to 77. The NB algorithm uses the kernel density estimator rather than assume normal distributions for numeric attributes. These settings were chosen in advance and were not tuned to our data sets. 3.3
Meta-level Algorithms
At the meta-level, we evaluate the performance of six different schemes for combining classifiers (listed below).
Stacking with an Extended Set of Meta-level Attributes and MLR
499
Table 2. Error rates (in %) of the learned ensembles of classifiers
Data set
Vote
Selb
Grad
Smdt
Smlr
Smlr-E
australian balance breast-w bridges-td car
13.81 8.91 3.46 15.78 6.49
13.78 8.51 2.69 15.78 5.83
14.04 8.78 3.69 15.10 6.10
13.77 8.51 2.69 16.08 5.02
14.16 9.47 2.73 14.12 5.61
13.93 6.40 2.58 14.80 4.11
chess diabetes echo german glass
1.46 24.01 29.24 25.19 29.67
0.60 25.09 27.63 25.69 32.06
1.16 24.26 30.38 25.41 30.75
0.60 24.74 27.71 25.60 31.78
0.60 23.78 28.63 24.36 30.93
0.60 24.51 27.71 25.53 31.64
heart hepatitis hypo image ionosphere
17.11 17.42 1.32 2.94 7.18
16.04 15.87 0.72 2.85 8.40
17.70 18.39 0.80 3.32 8.06
16.04 15.87 0.79 2.53 8.83
15.30 15.68 0.72 2.84 7.35
15.93 15.87 0.72 2.80 6.87
iris soya vote waveform wine
4.20 6.75 7.10 15.90 1.74
4.73 7.22 3.54 14.42 3.26
4.40 7.38 5.22 17.04 1.80
4.73 7.06 3.54 14.40 3.26
4.47 7.22 3.54 14.33 2.87
4.87 7.35 3.59 13.61 2.02
Average
11.98
11.74
12.19
11.68
11.44
11.27
– Vote: The simple plurality vote scheme (results of preliminary experiments showed that this performs better than the probability vote scheme). – Selb: The SelectBest scheme selects the best of the base-level classifiers by ten-fold cross validation. – Grad: Grading as introduced by Seewald and F¨ urnkranz [10] and briefly described in Section 2.2. – Smdt: Stacking with meta decision-trees as introduced by Todorovski and Dˇzeroski [12] and briefly described in Section 2.2. – Smlr: Stacking with multiple-response regression as used by Ting and Witten [11] and described in Sections 2.2 and 2.3. – Smlr-E: Stacking with multiple-response regression and extended set of meta-level attributes, as proposed by this paper and described in Section 2.3. 3.4
Evaluating and Comparing Algorithms
In all the experiments presented here, classification errors are estimated using ten-fold stratified cross validation. Cross validation is repeated ten times using
500
ˇ Bernard Zenko and Saˇso Dˇzeroski
Table 3. Relative improvement in accuracy (in %) of stacking with multiresponse linear regression (Smlr-E) as compared to other combining algorithms and its significance (+/– means significantly better/worse, x means insignificant) Data set
Vote
Selb
Grad
Smdt
Smlr
australian balance breast-w bridges-td car
-0.84 28.19 25.62 6.21 36.63
x -1.05 x 0.83 x -1.16 x 1.64 x + 24.81 + 27.14 + 24.81 + 32.43 + + 4.26 + 30.23 + 4.26 + 5.76 + x 6.21 x 1.95 x 7.93 x -4.86 x + 29.46 + 32.54 + 17.99 + 26.70 +
chess diabetes echo german glass
59.10 -2.06 5.22 -1.35 -6.61
+ 0.00 x 48.66 + x 2.33 + -1.02 x x -0.28 x 8.79 + x 0.62 x -0.47 x – 1.31 x -2.89 x
heart hepatitis hypo image ionosphere
6.93 8.89 45.35 4.57 4.37
+ 0.69 x 0.00 + 0.00 x 1.82 x 18.31
x x x x +
10.04 13.68 9.13 15.54 14.84
– – + + x
x x x + +
-10.61 0.40 31.28 20.17 -12.50
iris soya vote waveform wine Average W/L
-15.87 -8.89 49.51 14.45 -16.13 15.24
8+/3–
-2.82 -1.83 -1.30 5.63 37.93
0.00 0.95 0.00 0.27 0.44
x 0.00 x x -3.07 – x 3.20 x x -4.80 – x -2.27 x
+ 0.69 x -4.12 x + -0.00 x -1.23 x + 8.77 x 0.00 x + -10.60 – 1.37 x + 22.26 + 6.59 + x -2.82 x -4.15 + -1.30 + 5.53 x 37.93
x x x + +
-8.96 -1.83 -1.30 5.03 29.41
x x x + +
7.11
13.40
6.37
4.76
7+/0–
12+/0–
6+/1–
6+/2–
different random generator seeds resulting in ten different sets of folds. The same folds (random generator seeds) are used in all experiments. The classification error of a classification algorithm C for a given data set as estimated by averaging over the ten runs of ten-fold cross validation is denoted with error(C). For pair-wise comparisons of classification algorithms, we calculate the relative improvement and the paired t-test, as described below. In order to evaluate the accuracy improvement achieved in a given domain by using classifier C1 as compared to using classifier C2 , we calculate the relative improvement: 1−error(C1 )/error(C2 ). In Table 3, we compare the performance of Smlr-E to other approaches: C1 in this table thus refers to ensembles combined with SmlrE. The average relative improvement across all domains is calculated using the geometric mean of error reduction in individual domains: 1 − geometric mean(error(C1 )/error(C2 )). Note that this may be different from geometric mean(error(C2 )/error(C1 )) −1.
Stacking with an Extended Set of Meta-level Attributes and MLR
501
Table 4. The relative performance of ensembles with different combining methods in terms of wins+/loses–. The entry in row X and column Y gives the number of wins+/loses– of X over Y
Vote Selb Grad Smdt Smlr Smlr-E
Vote
Selb
Grad
Smdt
Smlr
Smlr-E
Total
/ 9+/7– 4+/6– 10+/6– 10+/5– 8+/3–
7+/9– / 3+/10– 2+/0– 4+/2– 7+/0–
6+/4– 10+/3– / 11+/1– 13+/2– 12+/0–
6+/10– 0+/2– 1+/11– / 4+/4– 6+/1–
5+/10– 2+/4– 2+/13– 4+/4– / 6+/2–
3+/8– 0+/7– 0+/12– 1+/6– 2+/6– /
27+/41– 21+/23– 10+/42– 28+/17– 33+/19– 39+/6–
The classification errors of C1 and C2 averaged over the ten runs of ten-fold cross validation are compared for each data set (error(C1 ) and error(C2 ) refer to these averages). The statistical significance of the difference in performance is tested using the paired t-test (exactly the same folds are used for C1 and C2 ) with significance level of 95%: +/− to the right of a figure in the tables with results means that the classifier C1 is significantly better/worse than C2 . At this place we have to say that we are fully aware of the weakness of our significance testing method described above. Namely, when we repeat ten-fold cross validation ten times we do not get ten independent accuracy assessments as required by the paired t-test. As a result we have a high risk of committing a type I error (incorrectly rejecting the null hypothesis). This means that it is likely that a smaller number of differences between classifiers are statistically significant than reported by our testing method. Due to this problem we have also tried using two significance testing methods proposed by Dietterich [4]: the tenfold cross validated paired t-test and the 5x2cv paired t-test. The problem with these two tests is that while they have smaller probability of type I error they are much less sensitive. According to these two tests, the differences between the simplest approach (Vote scheme) and a current state-of-the-art approach (stacking with MLR) are hardly significant. Therefore we have decided to use the above described significance testing.
4
Experimental Results
The error rates of the ensembles induced on the twenty data sets and combined with the different combining methods are given in Table 2. However, for the purpose of comparing the performance of different combining methods, Table 4 is of much more interest: it gives the number of significant wins/loses of X over Y for each pair of combining methods X and Y . Table 3 presents a more detailed comparison (per data set) of Smlr-E to the other combining methods. Below we highlight some of our findings.
502
ˇ Bernard Zenko and Saˇso Dˇzeroski
Inspecting Table 4, to examine the relative performance of Smlr-E to the other combining methods, we find that Smlr-E is in a league of its own. It clearly outperforms all the other combining methods, with a wins – loss difference of at least 4 and a relative improvement of at least 5% (see Table 3). As expected, the difference is smallest when compared to Smlr. Returning to Table 4, we find that we can partition the five existing combining algorithms into three groups. Vote and Grad are at the lower end of the performance scale, Selb and Smdt are in the middle, while Smlr performs best. While Smlr clearly outperforms Vote and Grad in one to one comparison, there is no difference when compared to Smdt (equal number of wins and losses). None of the existing stacking methods perform clearly better than Selb. Smlr and Smdt have a slight advantage (two more wins than losses), while Vote and Grad perform worse. Smlr-E, on the other hand, clearly outperforms Selb with seven wins, no losses, and an average relative improvement of 7%.
5
Conclusions and Further Work
We have proposed a new set of meta-level features to be used for combining heterogeneous classifiers with stacking. These include the probability distributions predicted by the base-level classifiers, their certainty (entropy), and a combination of both (the products of the individual probabilities and the maximal probabilities in a predicted distribution). In conjunction with the multi-response linear regression (MLR) algorithm at the meta-level, this approach outperforms existing stacking approaches. While the existing approaches perform (at best) comparably to selecting the best classifier from the ensemble by cross validation, the proposed approach clearly performs better. The use of the certainty features in addition to the probability distributions is obviously the key to the improved performance. A more detailed analysis of which of the new attributes are used and their relative importance is an immediate topic for further work. The same goes for the experimental evaluation of the proposed approach in a setting with seven base-level classifiers (as in [6]. Finally, ˇ combining the approach proposed here with that of Dˇzeroski and Zenko [6] (i.e., using both a new set of meta-level features and a new meta-level learning algorithm) should also be investigated. Some more general topics for further work ˇ are discussed below: these have been also discussed by Dˇzeroski and Zenko [6]. ˇ While conducting this study, the study of Dˇzeroski and Zenko [6], and a few other recent studies [16, 13], we have encountered quite a few contradictions between claims in the recent literature on stacking and our experimental results. For example, Merz [8] claims that SCANN is clearly better than the oracle selecting the best classifier (which should perform even better than SelectBest). Ting and Witten [11] claim that stacking with MLR clearly outperforms SelectBest. Finally, Seewald and F¨ urnkranz [10] claim that both grading and stacking with MLR perform better than SelectBest. A comparative study including the data sets in the recent literature and a few other stacking methods (such as SCANN)
Stacking with an Extended Set of Meta-level Attributes and MLR
503
should resolve these contradictions and provide a clearer picture of the relative performance of different stacking approaches. We believe this is a worthwhile topic to pursue in near-term future work. We also believe that further research on stacking in the context of base-level classifiers created by different learning algorithms is in order, despite the current focus of the machine learning community on creating ensembles with a single learning algorithm with injected randomness or its application to manipulated training sets, input features and output targets. This should include the pursuit for better sets of meta-level features and better meta-level learning algorithms.
Acknowledgements Many thanks to Ljupˇco Todorovski for the cooperation on combining classifiers with meta-decision trees and the many interesting and stimulating discussions related to this paper. Thanks also to Alexander Seewald for providing his implementation of grading in Weka.
References [1] D. Aha, D. W. Kibler, and M. K. Albert. Instance-based learning algorithms. Machine Learning, 6:37–66, 1991. 498 [2] C. L. Blake and C. J. Merz. UCI repository of machine learning databases, 1998. 497 [3] T. G. Dietterich. Machine-learning research: Four current directions. AI Magazine, 18(4):97–136, 1997. 493 [4] T. G. Dietterich. Approximate statistical test for comparing supervised classification learning algorithms. Neural Computation, 10(7):1895–1923, 1998. 501 [5] T. G. Dietterich. Ensemble methods in machine learning. In Proceedings of the First International Workshop on Multiple Classifier Systems, pages 1–15, Berlin, 2000. Springer. 493 ˇ [6] S. Dˇzeroski and B. Zenko. Is combining classifiers better than selecting the best one? In Proceedings of the Nineteenth International Conference on Machine Learning, San Francisco, 2002. Morgan Kaufmann. 496, 502 [7] G. H. John and P. Langley. Estimating continuous distributions in bayesian classifiers. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, pages 338–345, San Francisco, 1995. Morgan Kaufmann. 498 [8] C. J. Merz. Using correspondence analysis to combine classifiers. Machine Learning, 36(1/2):33–58, 1999. 493, 494, 495, 502 [9] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco, 1993. 498 [10] A. K. Seewald and J. F¨ urnkranz. An evaluation of grading classifiers. In Advances in Intelligent Data Analysis: Proceedings of the Fourth International Symposium (IDA-01), pages 221–232, Berlin, 2001. Springer. 494, 495, 499, 502 [11] K. M. Ting and I. H. Witten. Issues in stacked generalization. Journal of Artificial Intelligence Research, 10:271–289, 1999. 494, 495, 496, 497, 499, 502
504
ˇ Bernard Zenko and Saˇso Dˇzeroski
[12] L. Todorovski and S. Dˇzeroski. Combining multiple models with meta decision trees. In Proceedings of the Fourth European Conference on Principles of Data Mining and Knowledge Discovery, pages 54–64, Berlin, 2000. Springer. 494, 495, 497, 499 [13] L. Todorovski and S. Dˇzeroski. Combining classifiers with meta decision trees. Machine Learning, In press, 2002. 494, 495, 502 [14] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco, 1999. 497 [15] D. Wolpert. Stacked generalization. Neural Networks, 5(2):241–260, 1992. 493, 494 ˇ [16] B. Zenko, L. Todorovski, and S. Dˇzeroski. A comparison of stacking with MDTs to bagging, boosting, and other stacking methods. In Proceedings of the First IEEE International Conference on Data Mining, pages 669–670, Los Alamitos, 2001. IEEE Computer Society. 495, 502
Finding Hidden Factors Using Independent Component Analysis Erkki Oja Helsinki University of Technology, Neural Networks Research Centre P.O.B. 5400, 02015 HUT, Finland {Erkki.Oja}@hut.fi
Independent Component Analysis (ICA) is a computational technique for revealing hidden factors that underlie sets of measurements or signals. ICA assumes a statistical model whereby the observed multivariate data, typically given as a large database of samples, are assumed to be linear or nonlinear mixtures of some unknown latent variables. The mixing coecients are also unknown. The latent variables are nongaussian and mutually independent, and they are called the independent components of the observed data. By ICA, these independent components, also called sources or factors, can be found. Thus ICA can be seen as an extension to Principal Component Analysis and Factor Analysis. ICA is a much richer technique, however, capable of nding the sources when these classical methods fail completely. In many cases, the measurements are given as a set of parallel signals or time series. Typical examples are mixtures of simultaneous sounds or human voices that have been picked up by several microphones, brain signal measurements from multiple EEG sensors, several radio signals arriving at a portable phone, or multiple parallel time series obtained from some industrial process. The term blind source separation is used to characterize this problem. The lecture will rst cover the basic idea of demixing in the case of a linear mixing model and then take a look at the recent nonlinear demixing approaches. Although ICA was originally developed for digital signal processing applications, it has recently been found that it may be a powerful tool for analyzing text document data as well, if the documents are presented in a suitable numerical form. A case study on analyzing dynamically evolving text is covered in the talk. Abstract.
T. Elomaa et al. (Eds.): ECML, LNAI 2430, p. 505, 2002. c Springer-Verlag Berlin Heidelberg 2002
Reasoning with Classifiers Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign [email protected]
Abstract. Research in machine learning concentrates on the study of learning single concepts from examples. In this framework the learner attempts to learn a single hidden function from a collection of examples, assumed to be drawn independently from some unknown probability distribution. However, in many cases – as in most natural language and visual processing situations – decisions depend on the outcomes of several different but mutually dependent classifiers. The classifiers’ outcomes need to respect some constraints that could arise from the sequential nature of the data or other domain specific conditions, thus requiring a level of inference on top the predictions. We will describe research and present challenges related to Inference with Classifiers – a paradigm in which we address the problem of using the outcomes of several different classifiers in making coherent inferences – those that respect constraints on the outcome of the classifiers. Examples will be given from the natural language domain.
The emphasis of the research in machine learning has been on the study of learning single concepts from examples. In this framework the learner attempts to learn a single hidden function from a collection of examples, assumed to be drawn independently from some unknown probability distribution, and its performance is measured when classifying future examples. In the context of natural language, for example, work in this direction has allowed researchers and practitioners to address the robust learnability of predicates such as “the part-of-speech of the word can in the given sentence is noun”, “the semantic sense of the word “plant” in the given sentence is “an industrial plant”, or determine, in a given sentence, the word that starts a noun phrase. In fact, a large number of disambiguation problems such as part-of speech tagging, word-sense disambiguation, prepositional phrase attachment, accent restoration, word choice selection in machine translation, context-sensitive spelling correction, word selection in speech recognition and identifying discourse markers have been addressed using machine learning techniques – in each of these problems it is necessary to disambiguate two or more [semantically, syntactically or structurally]-distinct forms which have been fused together into the same representation in some medium; a stand alone classifier can be learned to perform these task quite successfully [10].
Paper written to accompany an invited talk at ECML’02. This research is supported by NSF grants IIS-99-84168,ITR-IIS-00-85836 and an ONR MURI award.
T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 506-510, 2002. c Springer-Verlag Berlin Heidelberg 2002
Reasoning with Classifiers
507
However, in many cases – as in most natural language and visual processing situations – higher level decisions depend on the outcomes of several different but mutually dependent classifiers. Consider, for example, the problem of chunking natural language sentences where the goal is to identify several kinds of phrases (e.g. noun (NP), verb (VP) and prepositional (PP) phrases) in sentences, as in: [NP He ] [VP reckons ] [NP the current account deficit ] [VP will narrow ] [PP to ] [NP only $ 1.8 billion ] [PP in ] [NP September] . A task of this sort involves multiple predictions that interact in some way. For example, one way to address the problem is to utilize two classifiers for each type of phrase, one of which recognizes the beginning of the phrase, and the other its end. Clearly, there are constraints over the predictions; for instance, phrases cannot overlap and there may also be probabilistic constraints over the order of phrases and over their lengths. The goal is to minimize some global measure of accuracy, not necessarily to maximize the performance of each individual classifier involved in the decision [8]. As a second example, consider the problem of recognizing the kill (KFJ, Oswald) relation in the sentence “J. V. Oswald was murdered at JFK after his assassin, R. U. KFJ...”. This task requires making several local decisions, such as identifying named entities in the sentence, in order to support the relation identification. For example, it may be useful to identify that Oswald and KFJ are people, and JFK is a location. In addition, it is necessary to identify that the action kill is described in the sentence. All of this information will help to discover the desired relation and identify its arguments. At the same time, the relation kill constrains its arguments to be people (or at least, not to be locations) and, in turn, helps to enforce that Oswald and KFJ are likely to be people, while JFK is not. Finally, consider the challenge of designing a free-style natural language user interface that allows users to request in-depth information from a large collection of on-line articles, the web, or other semi-structured information sources. Specifically, consider the computational processes required in order to “understand” a simple question of the form “what is the fastest automobile in the world?”, and respond correctly to it. A straight forward key-word search may suggest that the following two passages contain the answer: ... will stretch Volkswagen’s lead in the world’s fastest growing vehicle market. Demand for cars is expected to soar... ... the Jaguar XJ220 is the dearest (415,000 pounds), fastest (217mph) and most sought after car in the world. However, “understanding” the question and the passages to a level that allows a decision as to which in fact contains the correct answer, and extracting it, is a very challenging task. Traditionally, the tasks described above have been viewed as inferential tasks [4, 7]; the hope was that stored knowledge about the language and the world will
508
Dan Roth
allow inferring the syntactic and semantic analysis of the question and the candidate answers; background knowledge (e.g., Jaguar is a car company; automobile is synonymous to car) will then be used to choose the correct passage and to extract the answer. However, it has become clear that many of the difficulties in this task involve problems of context-sensitive ambiguities. These are abundant in natural language and occur at various levels of the processing, from syntactic disambiguation (is “demand” a Noun or a Verb?), to sense and semantic class disambiguation (what is a “Jaguar”?), phrase identification (importantly, “the world’s fastest growing vehicle market” is a noun phrase in the passage above) and others. Resolving any of these ambiguities require a lot of knowledge about the world and the language, but knowledge that cannot be written “explicitly” ahead of time. It is widely accepted today that any robust computational approach to these problems has to rely on a significant component of statistical learning, used both to acquire knowledge and to perform low level predictions of the type mentioned above. The inference component is still very challenging. This view suggests, however, that rather than a deterministic collection of “facts” and “rules”, the inference challenge stems from the interaction of the large number of learned predictors involved. Inference of this sort is needed at the level of determining an answer to the question. An answer to the abovementioned question needs to be a name of a car company (predictor 1: identify the sought after entity; predictor 2: determine if the string Z represents a name of a car company) but also the subject of a sentence (predictor 3) in which a word equivalent to “fastest” (predictor 4) modifies (predictor 5) a word equivalent to “automobile” (predictor 6). Inferences of this sort are necessary also at other, lower levels of the process, as in the abovementioned problem of identifying noun phrases in a given sentence. Thus, decisions typically depend on the outcomes of several predictors and they need to be made in ways that provide coherent inferences that satisfy some constraints. These constraints might arise from the sequential nature of the data, from semantic or pragmatic considerations or other domain specific conditions. The examples described above exemplify the need for a unified theory of learning and inference. The purpose of this talk is to survey research in this direction, present progress and challenges. Earlier works in this direction have developed the Learning to Reason framework - an integrated theory of learning, knowledge representation and reasoning within a unified framework [2, 9, 12]. This framework addresses an important aspect of the fundamental problem of unifying learning and reasoning - it proves the benefits of performing reasoning on top of learned hypotheses. And, by incorporating learning into the inference process it provides a way around some knowledge representation and comprehensibility issues that have traditionally prevented efficient solutions. The work described here – on Inference with Classifiers – can be viewed as a concrete instantiation of the Learning to Reason framework; it addresses a second important aspect of a unified theory of learning and reasoning, the one which stems from the fact that, inherently, inferences in some domains involve
Reasoning with Classifiers
509
a large number of predictors that interact in different ways. The fundamental issue addressed is that of systematically combine, chain and perform inferences with the outcome of a large number of mutually dependent learned predictors. We will discuss several well known inference paradigms, and show how to use those for inference with classifiers. Namely, we will use these inference paradigms to develop inference algorithms that take as input outcomes of classifiers and provide coherent inferences that satisfy some domain or problem specific constraints. Some of the inference paradigms used are hidden Markov models (HMMs), conditional probabilistic models [8, 3], loopy Bayesian networks [6, 11], constraint satisfaction [8, 5] and Markov random fields [1]. Research in this direction may offer several benefits over direct use of classifiers or simply using traditional inference models. One benefit is the ability to directly use powerful classifiers to represent domain variables that are of interest in the inference stage. Advantages of this view have been observed in the speech recognition community when neural network based classifiers were combined within an HMM based inference approach, and have been quantified also in [8]. A second key advantage stems from the fact that only a few of the domain variables are actually of any interest at the inference stage. Performing inference with outcomes of classifiers allows for abstracting away a large number of the domain variables (which will be used only to define the classifiers’ outcomes) and will be beneficial also computationally. Research in this direction offers several challenges to AI and Machine Learning researchers. One of the key challenges of this direction from the machine learning perspective is to understand how the presence of constraints on the outcomes of classifiers can be systematically analyzed and exploited in order to derive better learning algorithms and for reducing the number of labeled examples required for learning.
References 1. D. Heckerman, D. M. Chickering, C. Meek, R. Rounthwaite, and C. M. Kadie. Dependency networks for inference, collaborative filtering, and data visualization. Journal of Machine Learning Research, 1:49–75, 2000. 2. R. Khardon and D. Roth. Learning to reason. Journal of the ACM, 44(5):697–725, Sept. 1997. 3. J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning, Williamstown, MA, 2001. 4. J. McCarthy. Programs with common sense. In R. Brachman and H. Levesque, editors, Readings in Knowledge Representation, 1985. Morgan-Kaufmann, 1958. 5. M. Munoz, V. Punyakanok, D. Roth, and D. Zimak. A learning approach to shallow parsing. In EMNLP-VLC’99, the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pages 168–178, June 1999. 6. K. P. Murphy, Y. Weiss, and M. I. Jordan. Loopy belief propagation for approximate inference: An empirical study. In Proceedings of Uncertainty in AI, pages 467–475, 1999.
510
Dan Roth
7. N. J. Nilsson. Logic and artificial intelligence. Artificial Intelligence, 47:31–56, 1991. 8. V. Punyakanok and D. Roth. The use of classifiers in sequential inference. In NIPS13; The 2000 Conference on Advances in Neural Information Processing Systems, pages 995–1001. MIT Press, 2001. 9. D. Roth. Learning to reason: The non-monotonic case. In Proc. of the International Joint Conference on Artificial Intelligence, pages 1178–1184, 1995. 10. D. Roth. Learning to resolve natural language ambiguities: A unified approach. In Proc. of the American Association of Artificial Intelligence, pages 806–813, 1998. 11. D. Roth and W.-T. Yih. Probabilistic reasoning for entity and relation recognition. In COLING 2002, The 19th International Conference on Computational Linguistics, 2002. 12. L. G. Valiant. Robust logic. In Proceedings of the Annual ACM Symp. on the Theory of Computing, 1999.
A Kernel Approach for Learning from almost Orthogonal Patterns Bernhard Scholkopf1 , Jason Weston1 , Eleazar Eskin2 , Christina Leslie2 , and William Staord Noble2; 3 1
Max-Planck-Institut fur biologische Kybernetik, Spemannstr. 38, D-72076 Tubingen, Germany bernhard.schoelkopf, [email protected]
f
2
Department of Computer Science Columbia University, New York
eeskin,cleslie, [email protected]
f
3 Columbia Genome Center Columbia University, New York
In kernel methods, all the information about the training data is contained in the Gram matrix. If this matrix has large diagonal values, which arises for many types of kernels, then kernel methods do not perform well. We propose and test several methods for dealing with this problem by reducing the dynamic range of the matrix while preserving the positive de niteness of the Hessian of the quadratic programming problem that one has to solve when training a Support Vector Machine. Abstract.
1
Introduction
Support Vector Machines (SVM) and related kernel methods can be considered an approximate implementation of the structural risk minimization principle suggested by Vapnik (1979). To this end, they minimize an objective function containing a trade-o between two goals, that of minimizing the training error, and that of minimizing a regularization term. In SVMs, the latter is a function of the margin of separation between the two classes in a binary pattern recognition problem. This margin is measured in a so-called feature space H which is a Hilbert space into which the training patterns are mapped by means of a map
: X ! H:
(1)
Here, the input domain X can be an arbitrary nonempty set. The art of designing an SVM for a task at hand consist of selecting a feature space with the property that dot products between mapped input points, h(x); (x0 )i, can be computed in terms of a so-called kernel
k(x; x0 ) = h(x); (x0 )i T. Elomaa et al. (Eds.): ECML, LNAI 2430, pp. 511-528, 2002. c Springer-Verlag Berlin Heidelberg 2002
(2)
512
Bernhard Sch¨olkopf et al.
which can be evaluated eÆciently. Such a kernel necessarily belongs to the class of positive de nite kernels (e.g. Berg et al. (1984)), i.e., it satis es m X
ai aj k(xi ; xj ) 0
i;j =1
(3)
for all ai 2 R; xi 2 X; i = 1; : : :; m. The kernel can be thought of as a nonlinear similarity measure that corresponds to the dot product in the associated feature space. Using k , we can carry out all algorithms in H that can be cast in terms of dot products, examples being SVMs and PCA (for an overview, see Scholkopf and Smola (2002)). To train a hyperplane classi er in the feature space,
f (x) = sgn(hw; (x)i + b); where w is expanded in terms of the points (xj ), w=
m X j =1
aj (xj );
(4)
(5)
the SVM pattern recognition algorithm minimizes the quadratic form4
kwk
2
=
m X
i;j =1
subject to the constraints
yi [h(xi ); wi + b] 1; and for all i 2 f1; : : : ; mg. Here,
i.e.,
ai aj Kij
(6)
2m 3 X yi 4 aj Kij + b5 1
(7)
j =1
yi ai 0
(x1 ; y1 ); : : : ; (xm ; ym ) 2 X
(8)
f1g
(9)
Kij := k(xi ; xj ) = h(xi ); (xj )i
(10)
are the training examples, and is the Gram matrix. Note that the regularizer (6) equals the squared length of the weight vector w in H. One can show that kwk is inversely proportional to the margin of 4
We are considering the zero training error case. Nonzero training errors are incorporated as suggested by Cortes and Vapnik (1995). Cf. also Osuna and Girosi (1999).
A Kernel Approach for Learning from almost Orthogonal Patterns
513
separation between the two classes, hence minimizing it amounts to maximizing the margin. Sometimes, a modi cation of this approach is considered, where the regularizer m X i=1
ai
(11)
2
is used instead of (6). Whilst this is no longer the squared length of a weight vector in the feature space H, it is instructive to re-interpret it as the squared length in a dierent feature space, namely in Rm . To this end, we consider the feature map
m (x) := (k(x; x ); : : : ; k(x; xm ))> ; 1
(12)
sometimes called the empirical kernel map (Tsuda, 1999; Scholkopf and Smola, 2002). In this case, the SVM optimization problem consists in minimizing
kak
(13)
2
subject to
yi [hm (xi ); ai + b] 1 (14) for all i 2 f1; : : :; mg, where a = (a ; : : :h; am )> 2 Rm . In iview of (12), however, Pm a K + b 1, i.e. to (7), while the constraints (14) are equivalent to yi j ij j the regularizer kak equals (11). 1
=1
2
Therefore, using the regularizer (11) and the original kernel essentially5 corresponds to using a standard SVM with the empirical kernel map. This SVM operates in an m-dimensional feature space with the standard SVM regularizer, i.e., the squared weight of the weight vector in the feature space. We can thus train a classi er using the regularizer (11) simply by using an SVM with the kernel km (x; x0 ) := hm (x); m (x0 )i ; (15) and thus, by de nition of m , using the Gram matrix
Km = KK >;
(16)
where K denotes the Gram matrix of the original kernel. The last equation shows that when employing the empirical kernel map, it is not necessary to use a positive de nite kernel. The reason is that no matter what K is, the Gram matrix KK > is always positive de nite,6 which is suÆcient for an SVM. The remainder of the paper is structured as follows. In Section 2, we introduce the problem of large diagonals, followed by our proposed method to handle it (Section 3). Section 4 presents experiments, and Section 5 summarizes our conclusions. 5 6
disregarding the positivity constraints (8) Here, as in (3), we allow for a nonzero null space in our usage of the concept of positive de niteness.
514
2
Bernhard Sch¨olkopf et al.
Orthogonal Patterns in the Feature Space
An important feature of kernel methods is that the input domain X does not have to be a vector space. The inputs might just as well be discrete objects such as strings. Moreover, the map might compute rather complex features of the inputs. Examples thereof are polynomial kernels (Boser et al., 1992), where computes all products (of a given order) of entries of the inputs (in this case, the inputs are vectors), and string kernels (Watkins, 2000; Haussler, 1999; Lodhi et al., 2002), which, for instance, can compute the number of common substrings (not necessarily contiguous) of a certain length n 2 N of two strings x; x0 in O(njxjjx0 j) time. Here, we assume that x and x0 are two nite strings over a nite alphabet . For the string kernel of order n, a basis for the feature space consists of the set of all strings of length n, n . In this case, maps a string x into a vector whose entries indicate whether the respective string of length n occurs as a substring in x. By construction, these will be rather sparse vectors | a large number of possible substrings do not occur in a given string. Therefore, the dot product of two dierent vectors will take a value which is much smaller than the dot product of a vector with itself. This can also be understood as follows: any string shares all substrings with itself, but relatively few substrings with another string. Therefore, it will typically be the case that we are faced with large diagonals. By this we mean that, given some training inputs x1 ; : : : ; xm ; we have7
k(xi ; xi ) >> jk(xi ; xj )j for xi 6= xj ; i; j 2 f1; : : : ; mg:
(17)
In this case, the associated Gram matrix will have large diagonal elements.8 Let us next consider an innocuous application which is rather popular with SVMs: handwritten digit recognition. We suppose that the data are handwritten characters represented by images in [0; 1]N (here, N 2 N is the number of pixels), and that only a small fraction of the images is ink (i.e. few entries take the value 1). In that case, we typically have hx; xi > hx; x0 i for x 6= x0 , and thus the polynomial kernel (which is what most commonly is used for SVM handwritten digit recognition) k(x; x0 ) = hx; x0 id (18)
satis es k (x; x) >> jk (x; x0 )j already for moderately large d | it has large diagonals. Note that as in the case of the string kernel, one can also understand this phenomenon in terms of the sparsity of the vectors in the feature space. It is 7 8
The diagonal terms k(xi ; xi ) are necessarily nonnegative for positive de nite kernels, hence no modulus on the left hand side. In the machine learning literature, the problem is sometimes referred to as diagonal dominance. However, the latter term is used in linear algebra for matrices where the absolute value of each diagonal element is greater than the sum of the absolute values of the other elements in its row (or column). Real diagonally dominant matrices with positive diagonal elements are positive de nite.
A Kernel Approach for Learning from almost Orthogonal Patterns
515
known that the polynomial kernel of order d eectively maps the data into a feature space whose dimensions are spanned by all products of d pixels. Clearly, if some of the pixels take the value zero to begin with, then an even larger fraction of all possible products of d pixels (assuming d > 1) will be zero. Therefore, the sparsity of the vectors will increase with d. In practice, it has been observed that SVMs do not work well in this situation. Empirically, they work much better if the images are scaled such that the individual pixel values are in [ 1; 1], i.e., that the background value is 1. In this case, the data vectors are less sparse and thus further from being orthogonal. Indeed, large diagonals correspond to approximate orthogonality of any two dierent patterns mapped into the feature space. To see this, assume that x 6= x0 and note that due to k (x; x) >> jk (x; x0 )j, cos(\((x); (x0 ))) = =
0
(x); (x )i ph(x)h ; (x)i h(x0 ); (x0 )i
0
pk(x;k(xx;)kx(x) 0 ; x0 ) 0
In some cases, an SVM trained using a kernel with large diagonals will memX as data matrix and Y as label vector, respectively:
orize the data. Let us consider a simple toy example, using
01 0 0 9 0 0 0 0 0 01 0 +1 1 BB 1 0 0 0 0 8 0 0 0 0 CC BB +1 CC B C B +1 C 1000000900 X=B BB 0 0 9 0 0 0 0 0 0 0 CCC ; Y = BBB 1 CCC @0 0 0 0 0 0 8 0 0 0A @ 1A 0000000009
1
The Gram matrix for these data (using the linear kernel k (x; x0 ) = hx; x0 i) is
0 82 1 1 0 0 0 1 BB 1 65 1 0 0 0 CC B 1 1 82 0 0 0 CC : K=B BB 0 0 0 81 0 0 CC @ 0 0 0 0 64 0 A 0 0 0 0 0 81
A standard SVM nds the solution f (x) = sgn(hw; xi + b) with
w = (0:04; 0; 0:11; 0:11; 0; 0:12; 0:12; 0:11; 0; 0:11)>; b = 0:02: It can be seen from the coeÆcients of the weight vector w that this solution has but memorized the data: all the entries which are larger than 0:1 in absolute value correspond to dimensions which are nonzero only for one of the training points. We thus end up with a look-up table. A good solution for a linear classi er, on the other hand, would be to just choose the rst feature, e.g., f (x) = sgn(hw; xi + b), with w = (2; 0; 0; 0; 0; 0; 0; 0; 0; 0)>; b = 1.
516
3
Bernhard Sch¨olkopf et al.
Methods to Reduce Large Diagonals
The basic idea that we are proposing is very simple indeed. We would like to use a nonlinear transformation to reduce the size of the diagonal elements, or, more generally, to reduce the dynamic range of the Gram matrix entries. The only diÆculty is that if we simply do this, we have no guarantee that we end up with a Gram matrix that is still positive de nite. To ensure that it is, we can use methods of functional calculus for matrices. In the experiments we will mainly use a simple special case of the below. Nevertheless, let us introduce the general case, since we think it provides a useful perspective on kernel methods, and on the transformations that can be done on Gram matrices. Let K be a symmetric m m matrix with eigenvalues in [min ; max ], and f a continuous function on [min ; max ]. Functional calculus provides a unique symmetric matrix, denoted by f (K ), with eigenvalues in [f (min ); f (max )]. It can be computed via a Taylor series expansion in K , or using the eigenvalue decomposition of K : If K = S > DS (with D diagonal and S unitary), then f (K ) = S > f (D)S , where f (D) is the diagonal matrix with f (D)ii = f (Dii ). The convenient property of this procedure is that we can treat functions of symmetric matrices just like functions on R; in particular, we have, for 2 R, and real continuous functions f; g de ned on [min ; max ],9 (f + g )(K ) = f (K ) + g (K ) (fg )(K ) = f (K )g (K ) = g (K )f (K ) kfk1;(K ) = kf (K )k (f (K )) = f ((K )):
In technical terms, the C -algebra generated by K is isomorphic to the set of continuous functions on (K ). For our problems, functional calculus can be applied in the following way. We start o with a positive de nite matrix K with large diagonals. We then reduce its dynamic range by elementwise application of a nonlinear function, such as '(x) = log(x + 1) or '(x) = sgn(x) jxjp with 0 < p < 1. This will lead to a matrix which may no longer be positive de nite. However, it is still symmetric, and hence we can apply functional calculus. As a consequence of (f (K )) = f ((K )), we just need to apply a function f which maps to R+0 . This will ensure that all eigenvalues of f (K ) are nonnegative, hence f (K ) will be positive de nite. One can use these observations to design the following scheme. For positive de nite K ,
p
1. compute the positive de nite matrix A := K 2. reduce the dynamic range of the entries of A by applying an elementwise transformation ', leading to a symmetric matrix A' 3. compute the positive de nite matrix K 0 := (A' )2 and use it in subsequent processing. The entries of K 0 will be the \eective kernel," which in this case is no longer given in analytic form.
9
Below, (K ) denotes the spectrum of K .
A Kernel Approach for Learning from almost Orthogonal Patterns
517
Note that in this procedure, if ' is the identity, then we have K = K 0 . Experimentally, this scheme works rather well. However, it has one downside: since we no longer have the kernel function in analytic form, our only means of evaluating it is to include all test inputs (not the test labels, though) into the matrix K . In other words, K should be the Gram matrix computed from the observations x1 ; : : : ; xm+n where xm+1 ; : : : ; xm+n denote the test inputs. We thus need to know the test inputs already during training. This setting is sometimes referred to as transduction (Vapnik, 1998). If we skip the step of taking the square root of K , we can alleviate this problem. In that case, the only application of functional calculus left is a rather trivial one, that of computing the square of K . The m m submatrix of K 2 which in this case would have to be used for training then equals the Gram matrix when using the empirical kernel map
m n (x) = (k(x; x ); : : : ; k(x; xm +
1
+n
))> :
(19)
For the purposes of computing dot products, however, this can approximately be replaced by the empirical kernel map in terms of the training examples only, i.e., (12). The justi cation for this is that for large r 2 N , 1r hr (x); r (x0 )i R kby 00 0 00 00 to be the distribution of the X (x; x )k(x ; x ) dP (x );1 where P is assumed 1 0 inputs. Therefore, we have m hm (x); m (x0 )i m+ n hm+n (x); m+n (x )i. Altogether, the procedure then boils down to simply training an SVM using the empirical kernel map in terms of the training examples and the transformed kernel function '(k (x; x0 )). This is what we will use in the experiments below.10
4
Experiments
4.1 Arti cial Data We rst constructed a set of arti cial experiments which produce kernels exhibiting large diagonals. The experiments are as follows: a string classi cation problem, a microarray cancer detection problem supplemented with extra noisy features and a toy problem whose labels depend upon hidden variables; the visible variables are nonlinear combinations of those hidden variables.
String Classi cation We considered the following classi cation problem. Two
classes of strings are generated with equal probability by two dierent Markov models. Both classes of strings consist of letters from the same alphabet of a = 20 letters, and strings from both classes are always of length n = 20. Strings from the negative class are generated by a model where transitions from any letter to any other letter are equally likely. Strings from the positive class are generated by a model where transitions from one letter to itself (so the next letter is the same as the last) have probability 0:43, and all other transitions have probability 0:03. For both classes the starting letter of any string is equally likely to be any 10
For further experimental details, cf. Weston and Scholkopf (2001).
518
Bernhard Sch¨olkopf et al.
letter of the alphabet. The task then is to predict which class a given string belongs to. To map these strings into a feature space, we used the string kernel described above, computing a dot product product in a feature space consisting of all subsequences of length l. In the present application, the subsequences are weighted by an exponentially decaying factor of their full length in the text, hence emphasizing those occurrences which are close to contiguous. A method of computing this kernel eÆciently using a dynamic programming technique is described by Lodhi et al. (2002). For our problem we chose the parameters l = 3 and = 14 . We generated 50 such strings and used the string subsequence kernel with = 0:25.11 We split the data into 25 for training and 25 for testing in 20 separate trials. We measured the success of a method by calculating the mean classi cation loss on the test sets. Figure 1 shows four strings from the dataset and the computed kernel matrix for these strings12 . Note that the diagonal entries are much larger than the o-diagonals because a long string has a large number of subsequences that are shared with no other strings in the dataset apart from itself. However, information relevant to the classi cation of the strings is contained in the matrix. This can be seen by computing the mean kernel value between two examples of the positive class which is equal to 0:0003 0:0011, whereas the mean kernel value between two examples of opposite classes is 0:00002 0:00007. Although the numbers are very small, this captures that the positive class have more in common with each other than with random strings (they are more likely to have repeated letters). string class qqbqqnshrtktfhhaahhh +ve abajahnaajjjjiiiittt +ve sdolncqni mmpcrioog -ve reaqhcoigealgqjdsdgs -ve Fig. 1.
0 0:6183 0:0133 0:0000 0:0000 1 B 0:0133 1:0000 0:0000 0:0000 C K=B @ 0:0000 0:0000 0:4692 0:0002 C A 0:0000 0:0000 0:0002 0:4292
Four strings and their kernel matrix using the string subsequence kernel with
= 0:25. Note that the diagonal entries are much larger than the o-diagonals because a long string has a large number of subsequences that are shared with no other strings in the dataset apart from itself.
If the original kernel is denoted as a dot product k (x; y ) = h(x); (y )i, then we employ the kernel k (x; y ) = h(x); (y )ip where 0 < p < 1 to solve the diagonal dominance problem. We will refer to this kernel as a subpolynomial one. As this kernel may no longer be positive de nite we use the method described in 11
12
We note that introducing nonlinearities using an RBF kernel with respect to the distances generated by the subsequence kernel can improve results on this problem, but we limit our experiments to ones performed in the linear space of features generated by the subsequence kernel. Note, the matrix was rescaled by dividing by the largest entry.
A Kernel Approach for Learning from almost Orthogonal Patterns
519
Results of using the string subsequence kernel on a string classi cation problem (top row). The remaining rows show the results of using the subpolynomial kernel to deal with the large diagonal. Table 1.
classi cation loss kernel method original k, k(x; y ) = h(x); (y )i 0:36 0:13 kemp (x; y) = h(x); (y)ip p=1 0:30 0:08 p=0.9 0:25 0:09 p=0.8 0:20 0:10 p=0.7 0:15 0:09 p=0.6 0:13 0:07 p=0.5 0:14 0:06 p=0.4 0:15 0:07 p=0.3 0:15 0:06 p=0.2 0:17 0:07 p=0.1 0:21 0:09
Section 1, employing the empirical kernel map to embed our distance measure into a feature space. Results of using our method to solve the problem of large diagonals is given in Table 1. The method provides, with the optimum choice of the free parameter, a reduction from a loss of 0:36 0:13 with the original kernel to 0:13 0:07 with p=0.6. Although we do not provide methods for choosing this free parameter, it is straight-forward to apply conventional techniques of model selection (such as cross validation) to achieve this goal. We also performed some further experiments which we will brie y discuss. To check that the result is a feature of kernel algorithms, and not something peculiar to SVMs, we also applied the same kernels to another algorithm, kernel 1-nearest neighbor. Using the original kernel matrix yields a loss of 0:43 0:06 whereas the subpolynomial method again improves the results, using p = 0:6 yields 0:22 0:08 and p = 0:3 (the optimum choice) yields 0:17 0:07. Finally, we tried some alternative proposals for reducing the large diagonal eect. We tried using Kernel PCA to extract features as a pre-processing to training an SVM. The intuition behind using this is that features contributing to the large diagonal eect may have low variance and would thus be removed by KPCA. KPCA did improve performance a little, but did not provide results as good as the subpolynomial method. The best result was found by extracting 15 features (from the kernel matrix of 50 examples) yielding a loss of 0:23 0:07.
Microarray Data With Added Noise We next considered the microarray classi cation problem of Alon et al. (1999) (see also Guyon et al. (2001) for a treatment of this problem with SVMs). In this problem one must distinguish between cancerous and normal tissue in a colon cancer problem given the expression of genes measured by microarray technology. In this problem one does not encounter large diagonals, however we augmented the original dataset with extra noisy features to simulate such a problem. The original data has 62 ex-
520
Bernhard Sch¨olkopf et al.
amples (22 positive, 40 negative) and 2000 features (gene expression levels of the tissues samples). We added a further 10,000 features to the dataset, such that for each example a randomly chosen 100 of these features are chosen to be nonzero (taking a random value between 0 and 1) and the rest are equal to zero. This creates a kernel matrix with large diagonals. In Figure 2 we show the rst 4 4 entries of the kernel matrix of a linear kernel before and after adding the noisy features. The problem is again an arti cial one demonstrating the problem of large diagonals, however this time the feature space is rather more explicit rather than the implicit one induced by string kernels. In this problem we can clearly see the large diagonal problem is really a special kind of feature selection problem. As such, feature selection algorithms should be able to help improve generalize ability, unfortunately most feature selection algorithms work on explicit features rather than implicit ones induced by kernels. Performance of methods was measured using 10 fold cross validation, which was repeated 10 times. Due to the unbalanced nature of the number of positive and negative examples in this data set we measured the error rates using a balanced loss function with the property that chance level is a loss of 0.5, regardless of the ratio of positive to negative examples. On this problem (with the added noise) an SVM using the original kernel does not perform better than chance. The results of using the original kernel and the subpolynomial method are given in Table 2. The subpolynomial kernel leads to a large improvement over using the original kernel. Its performance is close to that of an SVM on the original data without the added noise, which in this case is 0:18 0:15.
Hidden Variable Problem We then constructed an arti cial problem where
the labels can be predicted by a linear rule based upon some hidden variables. However, the visible variables are a nonlinear combination of the hidden variables combined with noise. The purpose is to show that the subpolynomial kernel is not only useful in the case of matrices with large diagonals: it can also improve results in the case where a linear rule already over ts. The data are generated as follows. There are 10 hidden variables: each class y 2 f1g is generated by a 10 dimensional normal distribution N (; ) with variance 2 = 1, and mean = y(0:5; 0:5; : : :; 0:5). We then add 10 more (noisy) features for each example, each generated with N (0; 1). Let us denote the 20-dimensional vector obtained
0 1:00 0:41 0:33 0:42 1 B 0:41 1:00 0:17 0:39 C K=B @ 0:33 0:17 1:00 0:61 C A; 0:42 0:39 0:61 1:00
K
0
0 39:20 0:41 0:33 B 0:41 37:43 0:26 =B @ 0:33 0:26 31:94
1
0:73 0:88 C C 0:61 A 0:73 0:88 0:61 35:32
The rst 4 4 entries of the kernel matrix of a linear kernel on the colon cancer problem before (K ) and after (K ) adding 10,000 sparse, noisy features. The added features are designed to create a kernel matrix with a large diagonal. Fig. 2.
0
A Kernel Approach for Learning from almost Orthogonal Patterns
521
Results of using a linear kernel on a colon cancer classi cation problem with added noise (top row). The remaining rows show the results of using the subpolynomial kernel to deal with the large diagonal. Table 2.
balanced loss kernel method original k, k(x; y ) = hx; y i 0:49 0:05 kemp (x; y) = sgn hx; yi j hx; yi jp p=0.95 0:35 0:17 p=0.9 0:30 0:17 p=0.8 0:25 0:18 p=0.7 0:22 0:17 p=0.6 0:23 0:17 p=0.5 0:25 0:19 p=0.4 0:28 0:19 p=0.3 0:29 0:18 p=0.2 0:30 0:19 p=0.1 0:31 0:18
this wasy for example i as hi . The visible variables xi are then constructed by taking all monomials of degree 1 to 4 of hi . It is known that dot products between such vectors can be computed using polynomial kernels (Boser et al., 1992), thus the dot product between two visible variables is
k(xi ; xj ) = (hhi ; hj i + 1) : 4
We compared the subpolynomial method to a linear kernel using balanced 10fold cross validation, repeated 10 times. The results are shown in Table 3. Again, the subpolynomial kernel gives improved results. One interpretation of these results is that if we know that the visible variables are polynomials of some hidden variables, then it makes sense to use a subpolynomial transformation to obtain a Gram matrix closer to the one we could compute if we were given the hidden variables. In eect, the subpolynomial kernel can (approximately) extract the hidden variables.
4.2 Real Data Thrombin Binding Problem In the thrombin dataset the problem is to pre-
dict whether a given drug binds to a target site on thrombin, a key receptor in blood clotting. This dataset was used in the KDD (Knowledge Discovery and Data Mining) Cup 2001 competition and was provided by DuPont Pharmaceuticals. In the training set there are 1909 examples representing dierent possible molecules (drugs), 42 of which bind. Hence the data is rather unbalanced in this respect. Each example has a xed length vector of 139,351 binary features (variables) in f0; 1g which describe three-dimensional properties of the molecule. An important characteristic of the data is that very few of the feature entries are nonzero (0.68% of the 1909 139351 training matrix, see (Weston et al., 2002) for
522
Bernhard Sch¨olkopf et al.
Results of using a linear kernel on the hidden variable problem (top row). The remaining rows show the results of using the subpolynomial kernel to deal with the large diagonal. Table 3.
classi cation loss kernel method original k, k(x; y ) = hx; y i 0:26 0:12 kemp (x; y) = sgn hx; yi j hx; yi jp p=1 0:25 0:12 p=0.9 0:23 0:13 p=0.8 0:19 0:12 p=0.7 0:18 0:12 p=0.6 0:16 0:11 p=0.5 0:16 0:11 p=0.4 0:16 0:11 p=0.3 0:18 0:11 p=0.2 0:20 0:12 p=0.1 0:19 0:13
further statistical analysis of the dataset). Thus, many of the features somewhat resemble the noisy features that we added on to the colon cancer dataset to create a large diagonal in Section 4.1. Indeed, constructing a kernel matrix of the training data using a linear kernel yields a matrix with a mean diagonal element of 1377:9 2825 and a mean o-diagonal element of 78:7 209. We compared the subpolynomial method to the original kernel using 8-fold balanced cross validation (ensuring an equal number of positive examples were in each fold). The results are given in Table 4. Once again the subpolynomial method provides improved generalization. It should be noted that feature selection and transduction methods have also been shown to improve results, above that of a linear kernel on this problem (Weston et al., 2002). Results of using a linear kernel on the thrombin binding problem (top row). The remaining rows show the results of using the subpolynomial kernel to deal with the large diagonal. Table 4.
kernel method balanced loss original k, k(x; y ) = hx; y i 0:30 0:12 kemp (x; y) = hx; yip p=0.9 0:24 0:10 p=0.8 0:24 0:10 p=0.7 0:18 0:09 p=0.6 0:18 0:09 p=0.5 0:15 0:09 p=0.4 0:17 0:10 p=0.3 0:17 0:10 p=0.2 0:18 0:10 p=0.1 0:22 0:15
A Kernel Approach for Learning from almost Orthogonal Patterns
523
Results of using a linear kernel on the Lymphoma classi cation problem (top row). The remaining rows show the results of using the subpolynomial kernel to deal with the large diagonal. Table 5.
balanced loss kernel method original k, k(x; y ) = hx; y i 0:043 0:08 kemp (x; y) = sgn hx; yi j hx; yip p=1 0:037 0:07 p=0.9 0:021 0:05 p=0.8 0:016 0:05 p=0.7 0:015 0:05 p=0.6 0:022 0:06 p=0.5 0:022 0:06 p=0.4 0:042 0:07 p=0.3 0:046 0:08 p=0.2 0:083 0:09 p=0.1 0:106 0:09
Lymphoma Classi cation We next looked at the problem of identifying large
B-Cell Lymphoma by gene expression pro ling (Alizadeh et al., 2000). In this problem the gene expression of 96 samples is measured with microarrays to give 4026 features. Sixty-one of the samples are in classes "DLCL", "FL" or "CLL" (malignant) and 35 are labelled \otherwise" (usually normal). Although the data does not induce a kernel matrix with a very large diagonal it is possible that the large number of features induce over tting even in a linear kernel. To examine if our method would still help in this situation we applied the same techniques as before, this time using balanced 10-fold cross validation, repeated 10 times, and measuring error rates using the balanced loss. The results are given in Table 5. The improvement given by the subpolynomial kernel suggests that over tting in linear kernels when the number of features is large may be overcome by applying special feature maps. It should be noted that (explicit) feature selection methods have also been shown to improve results on this problem, see e.g Weston et al. (2001).
Protein Family Classi cation We then focussed on the problem of classifying
protein domains into superfamilies in the Structural Classi cation of Proteins (SCOP) database version 1.53 (Murzin et al., 1995). We followed the same problem setting as Liao and Noble (2002): sequences were selected using the Astral database (astral.stanford.edu cite), removing similar sequences using an E-value threshold of 10 25 . This procedure resulted in 4352 distinct sequences, grouped into families and superfamilies. For each family, the protein domains within the family are considered positive test examples, and the protein domains outside the family but within the same superfamily are taken as positive training examples. The data set yields 54 families containing at least 10 family members (positive training examples). Negative examples are taken from outside of the positive sequence's fold, and are randomly split into train and test sets in
524
Bernhard Sch¨olkopf et al.
the same ratio as the positive examples. Details about the various families are listed in (Liao and Noble, 2002), and the complete data set is available at www. cs.columbia.edu/compbio/svm-pairwise. The experiments are characterized by small positive (training and test) sets and large negative sets. Note that this experimental setup is similar to that used by Jaakkola et al. (2000), except the positive training sets do not include additional protein sequences extracted from a large, unlabeled database, which amounts to a kind of \transduction" (Vapnik, 1998) algorithm.13 An SVM requires xed length vectors. Proteins, of course, are variable-length sequences of amino acids and hence cannot be directly used in an SVM. To solve this task we used a sequence kernel, called the spectrum kernel, which maps strings into a space of features which correspond to every possible k -mer (sequence of k letters) with at most m mismatches, weighted by prior probabilities (Leslie et al., 2002). In this experiment we chose k = 3 and m = 0. This kernel is then normalized so that each vector has length 1 in the feature space; i.e.,
0 k(x; x0 ) = p hx; x i0 0 : hx; xi hx ; x i
(20)
An asymmetric soft margin is implemented by adding to the diagonal of the kernel matrix a value 0:02 , where is the fraction of training set sequences that have the same label as the current sequence (see Cortes and Vapnik (1995); Brown et al. (2000) for details). For comparison, the same SVM parameters are used to train an SVM using the Fisher kernel (Jaakkola and Haussler (1999); Jaakkola et al. (2000), see also Tsuda et al. (2002)), another possible kernel choice. The Fisher kernel is currently considered one of the most powerful homology detection methods. This method combines a generative, pro le hidden Markov model (HMM) and uses it to generate a kernel for training an SVM. A protein's vector representation induced by the kernel is its gradient with respect to the pro le hidden Markov model, the parameters of which are found by expectation-maximization. For each method, the output of the SVM is a discriminant score that is used to rank the members of the test set. Each of the above methods produces as output a ranking of the test set sequences. To measure the quality of this ranking, we use two dierent scores: receiver operating characteristic (ROC) scores and the median rate of false positives (RFP). The ROC score is the normalized area under a curve that plots true positives as a function of false positives for varying classi cation thresholds. A perfect classi er that puts all the positives at the top of the ranked list will receive an ROC score of 1, and for these data, a random classi er will receive an ROC score very close to 0. The median RFP score is the fraction of negative test sequences that score as high or better 13
We believe that it is this transduction step which may be responsible for much of the success of using the methods described by Jaakkola et al. (2000)). However, to make a fair comparison of kernel methods we do not include this step which could potentially be included in any of the methods. Studying the importance of transduction remains a subject of further research.
A Kernel Approach for Learning from almost Orthogonal Patterns
525
Results of using the spectrum kernel with k = 3; m = 0 on the SCOP dataset (top row). The remaining rows (apart from the last one) show the results of using the subpolynomial kernel to deal with the large diagonal. The last row, for comparison, shows the performance of an SVM using the Fisher kernel. Table 6.
kernel method RFP ROC original k, k((x); (y )) = hx; y i 0.1978 0.7516 kemp (x; y) = h(x); (y)ip p=0.5 0.1697 0.7967 p=0.4 0.1569 0.8072 p=0.3 0.1474 0.8183 p=0.2 0.1357 0.8251 p=0.1 0.1431 0.8213 p=0.05 0.1489 0.8156 SVM-FISHER 0.2946 0.6762
than the median-scoring positive sequence. RFP scores were used by Jaakkola et al. in evaluating the Fisher-SVM method. The results of using the spectrum kernel, the subpolynomial kernel applied to the spectrum kernel and the sher kernel are given in Table 6. The mean ROC and RFP scores are superior for the subpolynomial kernel. We also show a family-by-family comparison of the subpolynomial spectrum kernel with the normal spectrum kernel and the Fisher kernel in Figure 3. The coordinates of each point in the plot are the ROC scores for one SCOP family. The subpolynomial kernel uses the parameter p = 0:2. Although the subpolynomial method does not improve performance on every single family over the other two methods, there are only a small number of cases where there is a loss in performance. Note that explicit feature selection cannot readily be used in this problem, unless it is possible to integrate the feature selection method into the construction of the spectrum kernel, as the features are never explicitely represented. Thus we do not know of another method that can provide the improvements described here. Note though that the improvements are not as large as reported in the other experiments (for example, the toy string kernel experiment of Section 4.1). We believe this is because this application does not suer from the large diagonal problem as much as the other problems. Even without using the subpolynomial method, the spectrum kernel is already superior to the Fisher kernel method. Finally, note that while these results are rather good, they do not represent the record results on this dataset: in (Liao and Noble, 2002), a dierent kernel (Smith-Waterman pairwise scores)14 is shown to provide further improvements (mean RFP: 0.09, mean ROC: 0.89). It is also possible to choose other parameters of the spectrum kernel to improve its results. Future work will continue to investigate these kernels. 14
The Smith-Waterman score technique is closely related to the empirical kernel map, where the (non-positive de nite) eective \kernel" is the Smith-Waterman algorithm plus p-value computation.
Bernhard Sch¨olkopf et al.
1
1
0.9
0.9 subpolynomial spectrum kernel
subpolynomial spectrum kernel
526
0.8 0.7 0.6 0.5 0.4 0.3
0.8 0.7 0.6 0.5 0.4 0.3
0.2 0.2
0.3
0.4
0.5 0.6 0.7 spectrum kernel
0.8
0.9
1
0.2 0.2
0.3
0.4
0.5
0.6 0.7 fisher kernel
0.8
0.9
1
Family-by-family comparison of the subpolynomial spectrum kernel with: the normal spectrum kernel (left), and the Fisher kernel (right). The coordinates of each point in the plot are the ROC scores for one SCOP family. The spectrum kernel uses k = 3 and m = 0, and the subpolynomial kernel uses p=0.2. Points above the diagonal indicate problems where the subpolynomial kernel performs better than the other methods. Fig. 3.
5
Conclusion
It is a diÆcult problem to construct useful similarity measures for non-vectorial data types. Not only do the similarity measures have to be positive de nite to be useable in an SVM (or, more generally, conditionally positive de nite, see e.g. Scholkopf and Smola (2002)), but, as we have explained in the present paper, they should also lead to Gram matrices whose diagonal values are not overly large. It can be diÆcult to satisfy both needs simultaneously, a prominent example being the much celebrated (but so far not too much used) string kernel. However, the problem is not limited to sophisticated kernels. It is common to all situations where the data are represented as sparse vectors and then processed using an algorithm which is based on dot products. We have provided a method to deal with this problem. The method's upside is that it turns kernels such as string kernels into kernels that work very well on real-world problems. Its main downside so far is that the precise role and the choice of the function we apply to reduce the dynamic range has yet to be understood.
Acknowledgements We would like to thank Olivier Chapelle and Andre Elissee for very helpful discussions. We moreover thank Chris Watkins for drawing our attention to the problem of large diagonals.
A Kernel Approach for Learning from almost Orthogonal Patterns
527
Bibliography A. A. Alizadeh et al. Distinct types of diuse large b-cell lymphoma identi ed by gene expression pro ling. Nature, 403:503{511, 2000. Data available from http://llmpp.nih.gov/lymphoma. U. Alon, N. Barkai, D. Notterman, K. Gish, S. Ybarra, D. Mack, and A. Levine. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon cancer tissues probed by oligonucleotide arrays. Cell Biology, 96:6745{6750, 1999. C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic Analysis on Semigroups. Springer-Verlag, New York, 1984. B. E. Boser, I. M. Guyon, and V. Vapnik. A training algorithm for optimal margin classi ers. In D. Haussler, editor, Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, pages 144{152, Pittsburgh, PA, July 1992. ACM Press. M. P. S. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. Sugnet, T. S. Furey, M. Ares, and D. Haussler. Knowledge-based analysis of microarray gene expression data using support vector machines. Proceedings of the National Academy of Sciences, 97(1):262{267, 2000. C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20: 273{297, 1995. I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classi cation using support vector machines. Machine Learning, 2001. D. Haussler. Convolutional kernels on discrete structures. Technical Report UCSC-CRL-99-10, Computer Science Department, University of California at Santa Cruz, 1999. T. S. Jaakkola, M. Diekhans, and D. Haussler. A discriminative framework for detecting remote protein homologies. Journal of Computational Biology, 7: 95{114, 2000. T. S. Jaakkola and D. Haussler. Exploiting generative models in discriminative classi ers. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing Systems 11, Cambridge, MA, 1999. MIT Press. C. Leslie, E. Eskin, and W. S. Noble. The spectrum kernel: A string kernel for SVM protein classi cation. Proceedings of the Paci c Symposium on Biocomputing, 2002. To appear. L. Liao and W. S. Noble. Combining pairwise sequence similarity and support vector machines for remote protein homology detection. Proceedings of the Sixth International Conference on Computational Molecular Biology, 2002. H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text classi cation using string kernels. Journal of Machine Learning Research, 2: 419{444, 2002. A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia. SCOP: A structural classi cation of proteins database for the investigation of sequences and structures. Journal of Molecular Biology, pages 247:536{540, 1995.
528
Bernhard Sch¨olkopf et al.
E. Osuna and F. Girosi. Reducing the run-time complexity in support vector machines. In B. Scholkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods | Support Vector Learning, pages 271{284, Cambridge, MA, 1999. MIT Press. B. Scholkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002. K. Tsuda. Support vector classi er with asymmetric kernel function. In M. Verleysen, editor, Proceedings ESANN, pages 183{188, Brussels, 1999. D Facto. K. Tsuda, M. Kawanabe, G. Ratsch, S. Sonnenburg, and K.R. Muller. A new discriminative kernel from probabilistic models. In T.G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems, volume 14. MIT Press, 2002. To appear. V. Vapnik. Estimation of Dependences Based on Empirical Data [in Russian]. Nauka, Moscow, 1979. (English translation: Springer Verlag, New York, 1982). V. Vapnik. Statistical Learning Theory. John Wiley and Sons, New York, 1998. C. Watkins. Dynamic alignment kernels. In A. J. Smola, P. L. Bartlett, B. Scholkopf, and D. Schuurmans, editors, Advances in Large Margin Classi ers, pages 39{50, Cambridge, MA, 2000. MIT Press. J. Weston, A. Elissee, and B. Scholkopf. Use of the `0 -norm with linear models and kernel methods. Biowulf Technical report, 2001. http://www.conclu.de/jason/. J. Weston, F. Perez-Cruz, O. Bousquet, O. Chapelle, A. Elissee, and B. Scholkopf. Feature selection and transduction for prediction of molecular bioactivity for drug design, 2002. http://www.conclu.de/jason/kdd/kdd. html. J. Weston and B. Scholkopf. Dealing with large diagonals in kernel matrices. In New Trends in Optimization and Computational algorithms (NTOC 2001), Kyoto, Japan, 2001.
Learning with Mixture Models: Concepts and Applications
Padhraic Smyth Information and Computer Science, University of California Irvine, CA 92697-3425, USA
{smyth}@ics.uci.edu
Abstract. Probabilistic mixture models have been used in statistics for
well over a century as exible data models. More recently these techniques have been adopted by the machine learning and data mining communities in a variety of application settings. We begin this talk with a review of the basic concepts of nite mixture models: what can they represent? how can we learn them from data? and so on. We will then discuss how the traditional mixture model (dened in a xed dimensional vector space) can be usefully generalized to model non-vector data, such as sets of sequences and sets of curves. A number of real-world applications will be used to illustrate how these techniques can be applied to large-scale real-world data exploration and prediction problems, including clustering of visitors to a Web site based on their sequences of page requests, modeling of sparse high-dimensional market basket data for retail forecasting, and clustering of storm trajectories in atmospheric science.
T. Elomaa et al. (Eds.): ECML, LNAI 2430, p. 529, 2002. c Springer-Verlag Berlin Heidelberg 2002
Author Index
Amini, Massih-Reza . . . . . . . . . . . 468 Banerjee, Bikramjit . . . . . . . . . . . . . . 1 Bay, Stephen D. . . . . . . . . . . . . . . . . 10 Blockeel, Hendrik . . . . . . . . . . . . . . 444 Buntine, Wray . . . . . . . . . . . . . . . . . . 23 Carreras, Xavier . . . . . . . . . . . . . . . . 35 Dai, Honghua . . . . . . . . . . . . . . . . . . . 48 Dengel, Andreas . . . . . . . . . . . . . . . 195 Derbeko, Philip . . . . . . . . . . . . . . . . . 60 Dupont, Pierre . . . . . . . . . . . . . . . . 185 Dˇzeroski, Saˇso . . . . . . . . . . . . 444, 493 Eibl, G¨ unther . . . . . . . . . . . . . . . . . . . 72 El-Yaniv, Ran . . . . . . . . . . . . . . . . . . 60 Engel, Yaakov . . . . . . . . . . . . . . . . . . 84 Eskin, Eleazar . . . . . . . . . . . . . . . . . 511 Ezequel, Philippe . . . . . . . . . . . . . . 431 Frank, Eibe . . . . . . . . . . . . . . . . . . . . 161 F¨ urnkranz, Johannes . . . . . . . . . . . .97 Gallinari, Patrick . . . . . . . . . . . . . . 468 Gammerman, Alex . . . . . . . . 345, 381 G´ ora, Grzegorz . . . . . . . . . . . . . . . . 111 Halck, Ole Martin . . . . . . . . .124, 207 Hall, Mark . . . . . . . . . . . . . . . . . . . . .161 Harris, Harlan D. . . . . . . . . . . . . . . 135 Hoche, Susanne . . . . . . . . . . . . . . . . 148 Hofmann, Thomas . . . . . . . . . . . . . 456 Holmes, Geoffrey . . . . . . . . . . . . . . 161 H¨ ullermeier, Eyke . . . . . . . . . . . . . .173 Hust, Armin . . . . . . . . . . . . . . . . . . . 195 Junker, Markus . . . . . . . . . . . . . . . . 195 Kaski, Samuel . . . . . . . . . . . . . . . . . 418 Kermorvant, Christopher . . . . . . 185 Kirkby, Richard . . . . . . . . . . . . . . . 161 Klink, Stefan . . . . . . . . . . . . . . . . . . 195 Kononenko, Igor . . . . . . . . . . . . . . . 219 Kramer, Stefan . . . . . . . . . . . . . . . . 405
Kr˚ akenes, Tony . . . . . . . . . . . . . . . . 207 Kukar, Matjaˇz . . . . . . . . . . . . . . . . . 219 Kushmerick, Nicholas . . . . . . . . . . 232 Kwek, Stephen . . . . . . . . . . . . . . . . 245 Langley, Pat . . . . . . . . . . . . . . . . . . . . 10 Lefaucheur, Patrice . . . . . . . . . . . . 319 Leslie, Christina . . . . . . . . . . . . . . . 511 Li, Gang . . . . . . . . . . . . . . . . . . . . . . . . 48 Ludl, Marcus-Christopher . . . . . .258 Mannor, Shie . . . . . . . . . . . . . . 84, 295 M` arquez, Llu´ıs . . . . . . . . . . . . . . . . . . 35 Margineantu, Dragos D. . . . . . . . 270 Martin, Mario . . . . . . . . . . . . . . . . . 282 Meir, Ron . . . . . . . . . . . . . . . . . . . 60, 84 Menache, Ishai . . . . . . . . . . . . . . . . .295 Morik, Katharina . . . . . . . . . . . . . . 307 Nguyen, Chau . . . . . . . . . . . . . . . . . 245 Nikkil¨ a, Janne . . . . . . . . . . . . . . . . . 418 Nock, Richard . . . . . . . . . . . . . . . . . 319 Nouretdinov, Ilia . . . . . . . . . . . . . . 381 Oja, Erkki . . . . . . . . . . . . . . . . . . . . . 505 Onta˜ no´n, Santiago . . . . . . . . . . . . . 331 Papadopoulos, Harris . . . . . . . . . . 345 Pe˜ na Castillo, Lourdes . . . . . . . . . 357 Peng, Jing . . . . . . . . . . . . . . . . . . . . . . . 1 Pfahringer, Bernhard . . . . . . . . . . 161 Pfeiffer, Karl Peter . . . . . . . . . . . . . .72 Plaza, Enric . . . . . . . . . . . . . . . . . . . 331 Precup, Doina . . . . . . . . . . . . . . . . . 391 Preux, Philippe . . . . . . . . . . . . . . . . 369 Proedrou, Kostas . . . . . . . . . 345, 381 Punyakanok, Vasin . . . . . . . . . . . . . .35 Raedt, Luc De . . . . . . . . . . . . . . . . . 405 Ratitch, Bohdana . . . . . . . . . . . . . . 391 Roth, Dan . . . . . . . . . . . . . . . . . 35, 506 R¨ uckert, Ulrich . . . . . . . . . . . . . . . . 405 R¨ uping, Stefan . . . . . . . . . . . . . . . . .307
532
Author Index
Sch¨ olkopf, Bernhard . . . . . . . . . . . 511 Sebban, Marc . . . . . . . . . . . . . . . . . . 431 Shapiro, Daniel G. . . . . . . . . . . . . . . 10 Shimkin, Nahum . . . . . . . . . . . . . . . 295 Sinkkonen, Janne . . . . . . . . . . . . . . 418 Smyth, Padhraic . . . . . . . . . . . . . . . 529 Stafford Noble, William . . . . . . . . 511 Thollard, Franck . . . . . . . . . . . . . . . 431 Todorovski, Ljupˇco . . . . . . . . . . . . 444 Tsochantaridis, Ioannis . . . . . . . . 456 Tu, Yiqing . . . . . . . . . . . . . . . . . . . . . . 48
Vittaut, Jean-No¨el . . . . . . . . . . . . . 468 Vovk, Volodya . . . . . . . . . . . . 345, 381 Weston, Jason . . . . . . . . . . . . . . . . . 511 Widmer, Gerhard . . . . . . . . . . . . . . 258 Wojna, Arkadiusz . . . . . . . . . . . . . .111 Wrobel, Stefan . . . . . . . . . . . . 148, 357 Yeang, Chen-Hsiang . . . . . . . . . . . 480 ˇ Zenko, Bernard . . . . . . . . . . . . . . . . 493