Lecture Notes in Artificial Intelligence Edited by R. Goebel, J. Siekmann, and W. Wahlster
Subseries of Lecture Notes in Computer Science
5809
Ricard Gavaldà Gábor Lugosi Thomas Zeugmann Sandra Zilles (Eds.)
Algorithmic Learning Theory 20th International Conference, ALT 2009 Porto, Portugal, October 3-5, 2009 Proceedings
13
Series Editors Randy Goebel, University of Alberta, Edmonton, Canada Jörg Siekmann, University of Saarland, Saarbrücken, Germany Wolfgang Wahlster, DFKI and University of Saarland, Saarbrücken, Germany Volume Editors Ricard Gavaldà Universitat Politècnica de Catalunya LARCA Research Group, Departament de Llenguatges i Sistemes Informàtics Jordi Girona Salgado 1-3, 08034 Barcelona, Spain E-mail:
[email protected] Gábor Lugosi Pompeu Fabra Universitat, ICREA and Department of Economics Ramon Trias Fargas 25-27, 08005 Barcelona, Spain E-mail:
[email protected] Thomas Zeugmann Hokkaido University, Division of Computer Science N-14, W-9, Sapporo 060-0814, Japan E-mail:
[email protected] Sandra Zilles University of Regina, Department of Computer Science Regina, Saskatchewan, Canada S4S 0A2 E-mail:
[email protected] Library of Congress Control Number: 2009934440 CR Subject Classification (1998): I.2, I.2.6, K.3.1, F.2, G.2, I.2.2, I.5.3 LNCS Sublibrary: SL 7 – Artificial Intelligence
ISSN ISBN-10 ISBN-13
0302-9743 3-642-04413-1 Springer Berlin Heidelberg New York 978-3-642-04413-7 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2009 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12760312 06/3180 543210
Preface
This volume contains the papers presented at the 20th International Conference on Algorithmic Learning Theory (ALT 2009), which was held in Porto, Portugal, October 3–5, 2009. The conference was co-located with the 12th International Conference on Discovery Science (DS 2009). The technical program of ALT 2009 contained 26 papers selected from 60 submissions, and 5 invited talks. The invited talks were presented during the joint sessions of both conferences. ALT 2009 was the 20th in the ALT conference series, established in Japan in 1990. The series Analogical and Inductive Inference is a predecessor of this series: it was held in 1986, 1989 and 1992, co-located with ALT in 1994, and subsequently merged with ALT. ALT maintains its strong connections to Japan, but has also been held in other countries, such as Australia, Germany, Hungary, Italy, Singapore, Spain, and the USA. The ALT series is supervised by its Steering Committee: Naoki Abe (IBM Thomas J. Watson Research Center, Yorktown, USA), Shai Ben-David (University of Waterloo, Canada), Phil Long (Google, Mountain View, USA), Gábor Lugosi (Pompeu Fabra University, Barcelona, Spain), Akira Maruoka (Ishinomaki Senshu University, Japan), Takeshi Shinohara (Kyushu Institute of Technology, Iizuka, Japan), Frank Stephan (National University of Singapore, Republic of Singapore), Einoshin Suzuki (Kyushu University, Fukuoka, Japan), Eiji Takimoto (Kyushu University, Fukuoka, Japan), György Turán (University of Illinois at Chicago, USA, and University of Szeged, Hungary), Osamu Watanabe (Tokyo Institute of Technology, Japan), Thomas Zeugmann (Chair, Hokkaido University, Japan), and Sandra Zilles (Publicity Chair, University of Regina, Canada). The ALT web pages have been set up (together with Frank Balbach and Jan Poland) and are maintained by Thomas Zeugmann. The present volume contains the texts of the 26 papers presented at ALT 2009, divided into groups of papers on online learning, learning graphs, active learning and query learning, statistical learning, inductive inference, and semisupervised and unsupervised learning. The volume also contains abstracts of the invited talks: – Sanjoy Dasgupta (University of California, San Diego, USA): The Two Faces of Active Learning – Hector Geffner (Universitat Pompeu Fabra, Barcelona, Spain) Inference and Learning in Planning – Jiawei Han (University of Illinois at Urbana-Champaign, USA) Mining Heterogeneous Information Networks by Exploring the Power of Links – Yishay Mansour (Tel Aviv University, Israel) Learning and Domain Adaptation – Fernando C.N. Pereira (Google, Mountain View, USA) Learning on the Web Papers presented at DS 2009 are contained in the DS 2009 proceedings.
VI
Preface
The E. Mark Gold Award has been presented annually at the ALT conferences since 1999, for the most outstanding student contribution. This year, the award was given to Hanna Mazzawi for the paper Reconstructing Weighted Graphs with Minimal Query Complexity, co-authored by Nader Bshouty. We would like to thank the many people and institutions who contributed to the success of the conference. Thanks to the authors of the papers for their submissions, and to the invited speakers for presenting exciting overviews of important recent research developments. We are very grateful to the sponsors of the conference for their generous financial support: University of Porto, Artificial Intelligence and Decision Support Laboratory, Center for Research in Advanced Computing Systems, Portuguese Science and Technology Foundation, Portuguese Artificial Intelligence Association, SAS, Alberta Ingenuity Centre for Machine Learning, and Division of Computer Science, Hokkaido University. We are grateful to the members of the Program Committee for ALT 2009. Their hard work in reviewing and discussing the papers made sure that we had an interesting and strong program. We also thank the subreferees assisting the Program Committee. Special thanks go to the local arrangement chair João Gama (University of Porto). We would like to thank the Discovery Science conference for its ongoing collaboration with ALT, which makes it possible to provide a well-rounded picture of the current theoretical and practical advances in machine learning and the related areas. In particular, we are grateful to the conference chair João Gama (University of Porto) and Program Committee chairs Vítor Santos Costa (University of Porto) and Alípio Jorge (University of Porto) for their cooperation. Last but not least, we thank Springer for their support in preparing and publishing this volume of the Lecture Notes in Artificial Intelligence series. August 2009
Ricard Gavaldà Gábor Lugosi Thomas Zeugmann Sandra Zilles
Organization
Conference Chair Ricard Gavaldà
Universitat Politècnica de Catalunya, Barcelona, Spain
Program Committee Peter Auer José L. Balcázar Shai Ben-David Avrim Blum Nader Bshouty Claudio Gentile Peter Grünwald Roni Khardon Phil Long Gábor Lugosi Massimiliano Pontil Alexander Rakhlin Shai Shalev-Shwartz Hans Ulrich Simon Frank Stephan Csaba Szepesvári Eiji Takimoto Sandra Zilles
University of Leoben, Austria Universitat Politècnica de Catalunya, Barcelona, Spain University of Waterloo, Canada Carnegie Mellon University, Pittsburgh, USA Technion, Haifa, Israel Università degli Studi dell’Insubria, Varese, Italy Centrum voor Wiskunde en Informatica (CWI), Amsterdam, The Netherlands Tufts University, Medford, USA Google, Mountain View, USA ICREA and Pompeu Fabra University, Barcelona, Spain (Chair) University College London, UK UC Berkeley, USA Toyota Technological Institute at Chicago, USA Ruhr-Universität Bochum, Germany National University of Singapore, Singapore University of Alberta, Edmonton, Canada Kyushu University, Fukuoka, Japan University of Regina, Canada (Chair)
Local Arrangements João Gama
University of Porto, Portugal
Subreferees Jacob Abernethy Andreas Argyriou Marta Arias John Case
Nicolò Cesa-Bianchi Jiang Chen Alexander Clark Sanjoy Dasgupta
VIII
Organization
Tom Diethe Ran El-Yaniv Tim van Erven Steve Hanneke Kohei Hatano Tamir Hazan Colin de la Higuera Jeffrey Jackson Sanjay Jain Sham Kakade Jyrki Kivinen Wouter Koolen Timo Kötzing Lucy Kuncheva Steffen Lange Alex Leung Guy Lever Tyler Lu Eric Martin
Mario Martin Samuel Moelius III Rémi Munos Francesco Orabona Ronald Ortner Dávid Pál Joel Ratsaby Nicola Rebagliati Lev Reyzin Sivan Sabato Ohad Shamir Robert Sloan Jun’ichi Takeuchi Christino Tamon György Turán Vladimir Vovk Yiming Ying Thomas Zeugmann
Sponsoring Institutions University of Porto Artificial Intelligence and Decision Support Laboratory Center for Research in Advanced Computing Systems Portuguese Science and Technology Foundation Portuguese Artificial Intelligence Association SAS Alberta Ingenuity Centre for Machine Learning Division of Computer Science, Hokkaido University
Table of Contents
Invited Papers The Two Faces of Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sanjoy Dasgupta
1
Inference and Learning in Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hector Geffner
2
Mining Heterogeneous Information Networks by Exploring the Power of Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jiawei Han
3
Learning and Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yishay Mansour
4
Learning on the Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fernando C.N. Pereira
7
Regular Contributions Online Learning Prediction with Expert Evaluators’ Advice . . . . . . . . . . . . . . . . . . . . . . . . . . Alexey Chernov and Vladimir Vovk
8
Pure Exploration in Multi-armed Bandits Problems . . . . . . . . . . . . . . . . . . S´ebastien Bubeck, R´emi Munos, and Gilles Stoltz
23
The Follow Perturbed Leader Algorithm Protected from Unbounded One-Step Losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vladimir V. V’yugin
38
Computable Bayesian Compression for Uniformly Discretizable Statistical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . L ukasz Debowski
53
Calibration and Internal No-Regret with Random Signals . . . . . . . . . . . . . Vianney Perchet
68
St. Petersburg Portfolio Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . L´ aszl´ o Gy¨ orfi and P´eter Kevei
83
Learning Graphs Reconstructing Weighted Graphs with Minimal Query Complexity . . . . . Nader H. Bshouty and Hanna Mazzawi
97
X
Table of Contents
Learning Unknown Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nicol` o Cesa-Bianchi, Claudio Gentile, and Fabio Vitale
110
Completing Networks Using Observed Data . . . . . . . . . . . . . . . . . . . . . . . . . Tatsuya Akutsu, Takeyuki Tamura, and Katsuhisa Horimoto
126
Active Learning and Query Learning Average-Case Active Learning with Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . Andrew Guillory and Jeff Bilmes
141
Canonical Horn Representations and Query Learning . . . . . . . . . . . . . . . . . Marta Arias and Jos´e L. Balc´ azar
156
Learning Finite Automata Using Label Queries . . . . . . . . . . . . . . . . . . . . . . Dana Angluin, Leonor Becerra-Bonache, Adrian Horia Dediu, and Lev Reyzin
171
Characterizing Statistical Query Learning: Simplified Notions and Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bal´ azs Sz¨ or´enyi An Algebraic Perspective on Boolean Function Learning . . . . . . . . . . . . . . Ricard Gavald` a and Denis Th´erien
186 201
Statistical Learning Adaptive Estimation of the Optimal ROC Curve and a Bipartite Ranking Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . St´ephan Cl´emen¸con and Nicolas Vayatis
216
Complexity versus Agreement for Many Views: Co-regularization for Multi-view Semi-supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Odalric-Ambrym Maillard and Nicolas Vayatis
232
Error-Correcting Tournaments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alina Beygelzimer, John Langford, and Pradeep Ravikumar
247
Inductive Inference Difficulties in Forcing Fairness of Polynomial Time Inductive Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . John Case and Timo K¨ otzing
263
Learning Mildly Context-Sensitive Languages with Multidimensional Substitutability from Positive Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ryo Yoshinaka
278
Table of Contents
Uncountable Automatic Classes and Learning . . . . . . . . . . . . . . . . . . . . . . . Sanjay Jain, Qinglong Luo, Pavel Semukhin, and Frank Stephan Iterative Learning from Texts and Counterexamples Using Additional Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sanjay Jain and Efim Kinber
XI
293
308
Incremental Learning with Ordinal Bounded Example Memory . . . . . . . . Lorenzo Carlucci
323
Learning from Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sanjay Jain, Frank Stephan, and Nan Ye
338
Semi-supervised and Unsupervised Learning Smart PAC-Learners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hans Ulrich Simon
353
Approximation Algorithms for Tensor Clustering . . . . . . . . . . . . . . . . . . . . . Stefanie Jegelka, Suvrit Sra, and Arindam Banerjee
368
Agnostic Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maria Florina Balcan, Heiko R¨ oglin, and Shang-Hua Teng
384
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
399
The Two Faces of Active Learning Sanjoy Dasgupta University of California, San Diego
The active learning model is motivated by scenarios in which it is easy to amass vast quantities of unlabeled data (images and videos off the web, speech signals from microphone recordings, and so on) but costly to obtain their labels. Like supervised learning, the goal is ultimately to learn a classifier. But like unsupervised learning, the data come unlabeled. More precisely, the labels are hidden, and each of them can be revealed only at a cost. The idea is to query the labels of just a few points that are especially informative about the decision boundary, and thereby to obtain an accurate classifier at significantly lower cost than regular supervised learning. There are two distinct narratives for explaining when active learning is helpful. The first has to do with efficient search through the hypothesis space: perhaps one can always explicitly select query points whose labels will significantly shrink the set of plausible classifiers (those roughly consistent with the labels seen so far)? The second argument for active learning has to do with exploiting cluster structure in data. Suppose, for instance, that the unlabeled points form five nice clusters; with luck, these clusters will be pure and only five labels will be necessary! Both these scenarios are hopelessly optimistic. But I will show that they each motivate realistic models that can effectively be exploited by active learning algorithms. These algorithms have provable label complexity bounds that are in some cases exponentially lower than for supervised learning. I will also present experiments with these algorithms, to illustrate their behavior and get a sense of the gulf that still exists between the theory and practice of active learning. This is joint work with Alina Beygelzimer, Daniel Hsu, John Langford, and Claire Monteleoni.
R. Gavald` a et al. (Eds.): ALT 2009, LNAI 5809, p. 1, 2009. c Springer-Verlag Berlin Heidelberg 2009
Inference and Learning in Planning Hector Geffner ICREA & Universitat Pompeu Fabra C/Roc Boronat 138, E-08018 Barcelona, Spain
[email protected] http://www.tecn.upf.es/~hgeffner
Abstract. Planning is concerned with the development of solvers for a wide range of models where actions must be selected for achieving goals. In these models, actions may be deterministic or not, and full or partial sensing may be available. In the last few years, significant progress has been made, resulting in algorithms that can produce plans effectively in a variety of settings. These developments have to do with the formulation and use of general inference techniques and transformations. In this invited talk, I’ll review the inference techniques used for solving individual planning instances from scratch, and discuss the use of learning methods and transformations for obtaining more general solutions.
R. Gavald` a et al. (Eds.): ALT 2009, LNAI 5809, p. 2, 2009. c Springer-Verlag Berlin Heidelberg 2009
Mining Heterogeneous Information Networks by Exploring the Power of Links Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign
[email protected]
Abstract. Knowledge is power but for interrelated data, knowledge is often hidden in massive links in heterogeneous information networks. We explore the power of links at mining heterogeneous information networks with several interesting tasks, including link-based object distinction, veracity analysis, multidimensional online analytical processing of heterogeneous information networks, and rank-based clustering. Some recent results of our research that explore the crucial information hidden in links will be introduced, including (1) Distinct for object distinction analysis, (2) TruthFinder for veracity analysis, (3) Infonet-OLAP for online analytical processing of information networks, and (4) RankClus for integrated ranking-based clustering. We also discuss some of our on-going studies in this direction.
R. Gavald` a et al. (Eds.): ALT 2009, LNAI 5809, p. 3, 2009. c Springer-Verlag Berlin Heidelberg 2009
Learning and Domain Adaptation Yishay Mansour Blavatnik School of Computer Science, Tel Aviv University Tel Aviv, Israel
[email protected]
Abstract. Domain adaptation is a fundamental learning problem where one wishes to use labeled data from one or several source domains to learn a hypothesis performing well on a different, yet related, domain for which no labeled data is available. This generalization across domains is a very significant challenge for many machine learning applications and arises in a variety of natural settings, including NLP tasks (document classification, sentiment analysis, etc.), speech recognition (speakers and noise or environment adaptation) and face recognition (different lighting conditions, different population composition). The learning theory community has only recently started to analyze domain adaptation problems. In the talk, I will overview some recent theoretical models and results regarding domain adaptation. This talk is based on joint works with Mehryar Mohri and Afshin Rostamizadeh.
1
Introduction
It is almost standard in machine learning to assume that the training and test instances are drawn from the same distribution. This assumption is explicit in the standard PAC model [19] and other theoretical models of learning, and it is a natural assumption since when the training and test distributions substantially differ there can be no hope for generalization. However, in practice, there are several crucial scenarios where the two distributions are similar but not identical, and therefore effective learning is potentially possible. This is the motivation for domain adaptation. The problem of domain adaptation arises in a variety of applications in natural language processing [6,3,9,4,5], speech processing [11,7,16,18,8,17], computer vision [15], and many other areas. Quite often, little or no labeled data is available from the target domain, but labeled data from a source domain somewhat similar to the target as well as large amounts of unlabeled data from the target domain are at one’s disposal. The domain adaptation problem then consists of leveraging the source labeled and target unlabeled data to derive a hypothesis performing well on the target domain. The first theoretical analysis of the domain adaptation problem was presented by [1], who gave VC-dimension-based generalization bounds for adaptation in R. Gavald` a et al. (Eds.): ALT 2009, LNAI 5809, pp. 4–6, 2009. c Springer-Verlag Berlin Heidelberg 2009
Learning and Domain Adaptation
5
classification tasks. Perhaps, the most significant contribution of that work was the definition and application of a distance between distributions, the dA distance, that is particularly relevant to the problem of domain adaptation and which can be estimated from finite samples for a finite VC dimension, as previously shown by [10]. This work was later extended by [2] who also gave a bound on the error rate of a hypothesis derived from a weighted combination of the source data sets for the specific case of empirical risk minimization. More refined generalization bounds which apply to more general tasks, including regression and general loss functions appear in [12]. From an algorithmic perspective, it is natural to re-weight the empirical distribution to better reflect the target distribution; efficient algorithms for this re-weighting task were given in [12]. A more complex variant of this problem arises in sentiment analysis and other text classification tasks where the learner receives information from several domain sources that he can combine to make predictions about a target domain. As an example, often appraisal information about a relatively small number of domains such as movies, books, restaurants, or music may be available, but little or none is accessible for more difficult domains such as travel. This is known as the multiple source adaptation problem. Instances of this problem can be found in a variety of other natural language and image processing tasks. The problem of adaptation with multiple sources was introduced and analyzed [13,14]. The problem is formalized as follows. For each source domain i ∈ [1, k], the learner receives the distribution of the input points Qi , as well as a hypothesis hi with loss at most on that source. The task consists of combining the k hypotheses hi , i ∈ [1, k], to derive a hypothesis h with a loss as small as possible with respect to the target distribution P . Unfortunately, a simple convex combination of the k source hypotheses hi can perform very poorly; for example, there are cases where any such convex combination would incur a classification error of a half, even when each source hypothesis hi makes no error on its domain Qi (see [13]). In contrast, distribution weighted combinations of the source hypotheses, which are combinations of source hypotheses weighted by the source distributions, perform very well. In [13] it was shown that, remarkably, for any fixed target function, there exists a distribution weighted combination of the source hypotheses whose loss is at most with respect to any mixture P of the k source distributions Qi . For the case that the target distribution P is arbitrary, generalization bounds, based on R´enyi divergence between the sources and the target distributions, were derived in [14].
References 1. Ben-David, S., Blitzer, J., Crammer, K., Pereira, F.: Analysis of representations for domain adaptation. In: Proceedings of NIPS 2006 (2006) 2. Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., Wortman, J.: Learning bounds for domain adaptation. In: Proceedings of NIPS 2007 (2007) 3. Blitzer, J., Dredze, M., Pereira, F.: Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. In: ACL 2007 (2007)
6
Y. Mansour
4. Chelba, C., Acero, A.: Adaptation of maximum entropy capitalizer: Little data can help a lot. Computer Speech & Language 20(4), 382–399 (2006) 5. Daum´e III, H., Marcu, D.: Domain adaptation for statistical classifiers. Journal of Artificial Intelligence Research 26, 101–126 (2006) 6. Dredze, M., Blitzer, J., Talukdar, P.P., Ganchev, K., Graca, J., Pereira, F.: Frustratingly Hard Domain Adaptation for Parsing. In: CoNLL 2007 (2007) 7. Gauvain, J.-L., Lee, C.-H.: Maximum a posteriori estimation for multivariate gaussian mixture observations of markov chains. IEEE Transactions on Speech and Audio Processing 2(2), 291–298 (1994) 8. Jelinek, F.: Statistical Methods for Speech Recognition. MIT Press, Cambridge (1998) 9. Jiang, J., Zhai, C.X.: Instance Weighting for Domain Adaptation in NLP. In: Proceedings of ACL 2007 (2007) 10. Kifer, D., Ben-David, S., Gehrke, J.: Detecting change in data streams. In: Proceedings of the 30th International Conference on Very Large Data Bases (2004) 11. Legetter, C.J., Woodland, P.C.: Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models. Computer Speech and Language, 171–185 (1995) 12. Mansour, Y., Mohri, M., Rostamizadeh, A.: Domain adaptation: Learning bounds and algorithms. In: COLT (2009) 13. Mansour, Y., Mohri, M., Rostamizadeh, A.: Domain adaptation with multiple sources. In: Proceedings of NIPS 2008 (2008) 14. Mansour, Y., Mohri, M., Rostamizadeh, A.: Multiple source adaptation and the R´enyi divergence. In: Uncertainty in Artificial Inteligence, UAI (2009) 15. Mart´ınez, A.M.: Recognizing imprecisely localized, partially occluded, and expression variant faces from a single sample per class. IEEE Trans. Pattern Anal. Mach. Intell. 24(6), 748–763 (2002) 16. Della Pietra, S., Della Pietra, V., Mercer, R.L., Roukos, S.: Adaptive language modeling using minimum discriminant estimation. In: HLT 1991: Proceedings of the workshop on Speech and Natural Language, pp. 103–106 (1992) 17. Roark, B., Bacchiani, M.: Supervised and unsupervised PCFG adaptation to novel domains. In: Proceedings of HLT-NAACL (2003) 18. Rosenfeld, R.: A Maximum Entropy Approach to Adaptive Statistical Language Modeling. Computer Speech and Language 10, 187–228 (1996) 19. Valiant, L.G.: A theory of the learnable. Communication of the ACM 27(11), 1134–1142 (1984)
Learning on the Web Fernando C.N. Pereira University of Pennsylvania, USA
It is commonplace to say that the Web has changed everything. Machine learning researchers often say that their projects and results respond to that change with better methods for finding and organizing Web information. However, not much of the theory, or even the current practice, of machine learning take the Web seriously. We continue to devote much effort to refining supervised learning, but the Web reality is that labeled data is hard to obtain, while unlabeled data is inexhaustible. We cling to the iid assumption, while all the Web data generation processes drift rapidly and involve many hidden correlations. Many of our theory and algorithms assume data representations of fixed dimension, while in fact the dimensionality of data, for example the number of distinct words in text, grows with data size. While there has been much work recently on learning with sparse representations, the actual patterns of sparsity on the Web are not paid much attention. Those patterns might be very relevant to the communication costs of distributed learning algorithms, which are necessary at Web scale, but little work has been done on this. Nevertheless, practical machine learning is thriving on the Web. Statistical machine translation has developed non-parametric algorithms that learn how to translate by mining the ever-growing volume of source documents and their translations that are created on the Web. Unsupervised learning methods infer useful latent semantic structure from the statistics of term co-occurrences in Web documents. Image search achieves improved ranking by learning from user responses to search results. In all those cases, Web scale demanded distributed algorithms. I will review some of those practical successes to try to convince you that they are not just engineering feats, but also rich sources of new fundamental questions that we should be investigating.
R. Gavald` a et al. (Eds.): ALT 2009, LNAI 5809, p. 7, 2009. c Springer-Verlag Berlin Heidelberg 2009
Prediction with Expert Evaluators’ Advice Alexey Chernov and Vladimir Vovk Computer Learning Research Centre, Department of Computer Science Royal Holloway, University of London, Egham, Surrey TW20 0EX, UK {chernov,vovk}@cs.rhul.ac.uk
Abstract. We introduce a new protocol for prediction with expert advice in which each expert evaluates the learner’s and his own performance using a loss function that may change over time and may be different from the loss functions used by the other experts. The learner’s goal is to perform better or not much worse than each expert, as evaluated by that expert, for all experts simultaneously. If the loss functions used by the experts are all proper scoring rules and all mixable, we show that the defensive forecasting algorithm enjoys the same performance guarantee as that attainable by the Aggregating Algorithm in the standard setting and known to be optimal. This result is also applied to the case of “specialist” experts. In this case, the defensive forecasting algorithm reduces to a simple modification of the Aggregating Algorithm.
1
Introduction
We consider the problem of online sequence prediction. A process generates outcomes ω1 , ω2 , . . . step by step. At each step t, a learner tries to guess this step’s outcome announcing his prediction γt . Then the actual outcome ωt is revealed. The quality of the learner’s prediction is measured by a loss function: the learner’s loss at step t is λ(γt , ωt ). Prediction with expert advice is a framework that does not make any assumptions about the generating process. The performance of the learner is compared to the performance of several other predictors called experts. At each step, each expert gives his prediction γtn , then the learner produces his own prediction γt (possibly based on the experts’ predictions at the last step and the experts’ predictions and outcomes at all the previous steps), and the accumulated losses are updated for the learner and for the experts. There are many algorithms for the learner in this framework; for a review, see [1]. In practical applications of the algorithms for prediction with expert advice, choosing the loss function is often difficult. There may be no natural quantitative measure of loss, just the vague concept that the closer the prediction to the outcome the better. In such cases one usually selects from among several common loss functions, such as the square loss function (reflecting the idea of least squares methods) or the log loss function (which has an information theory background). A similar issue arises when experts themselves are prediction algorithms that optimize some losses internally. Then it is unfair to these experts when the learner competes with them according to a “foreign” loss function. R. Gavald` a et al. (Eds.): ALT 2009, LNAI 5809, pp. 8–22, 2009. c Springer-Verlag Berlin Heidelberg 2009
Prediction with Expert Evaluators’ Advice
9
This paper introduces a new version of the framework of prediction with expert advice where there is no single fixed loss function but some loss function is linked to each expert. The performance of the learner is compared to the performance of each expert according to the loss function linked to that expert. Informally speaking, each expert has to be convinced that the learner performs almost as well as, or better than, that expert himself. We prove that a known algorithm for the learner, the defensive forecasting algorithm [2], can be applied in the new setting and gives the same performance guarantee as that attainable in the standard setting, provided all loss functions are proper scoring rules. Another framework to which our methods can be fruitfully applied is that of “specialist experts”: see, e.g., [3]. We generalize some of the known results in the case of mixable loss functions. To keep presentation as simple as possible, we restrict ourselves to binary outcomes from {0, 1}, predictions from [0, 1], and a finite number of experts. We formulate our results for mixable loss functions only. However, these results can be easily transferred to more general settings (non-binary outcomes, arbitrary prediction spaces, countably many experts, second-guessing experts, etc.) where the methods of [2] work. For a fuller version of this paper, see [4].
2
Prediction with Simple Experts’ Advice
In this preliminary section we recall the standard protocol of prediction with expert advice and some known results. Let {0, 1} be the set of possible outcomes ω, [0, 1] be the set of possible predictions γ, and λ : [0, 1] × {0, 1} → [0, ∞] be the loss function. The loss function λ and parameter N (the number of experts) specify the game of prediction with expert advice. The game is played by Learner, Reality, and N experts, Expert 1 to Expert N , according to the following protocol. Prediction with expert advice L0 := 0. Ln0 := 0, n = 1, . . . , N . FOR t = 1, 2, . . . : Expert n announces γtn ∈ [0, 1], n = 1, . . . , N . Learner announces γt ∈ [0, 1]. Reality announces ωt ∈ {0, 1}. Lt := Lt−1 + λ(γt , ωt ). Lnt := Lnt−1 + λ(γtn , ωt ), n = 1, . . . , N . END FOR The goal of Learner is to keep his loss Lt smaller or at least not much greater than the loss Lnt of Expert n, at each step t and for all n = 1, . . . , N .
10
A. Chernov and V. Vovk
We only consider loss functions that have the following properties: Assumption 1: λ(γ, 0) and λ(γ, 1) are continuous in γ ∈ [0, 1] and for the standard (Aleksandrov’s) topology on [0, ∞]. Assumption 2: There exists γ ∈ [0, 1] such that λ(γ, 0) and λ(γ, 1) are both finite. Assumption 3: There exists no γ ∈ [0, 1] such that λ(γ, 0) and λ(γ, 1) are both infinite. The superprediction set for a loss function λ is Σλ := (x, y) ∈ [0, ∞)2 | ∃γ λ(γ, 0) ≤ x and λ(γ, 1) ≤ y .
(1)
By Assumption 2, this set is non-empty. For each learning rate η > 0, let Eη : [0, ∞]2 → [0, 1]2 be the homeomorphism defined by Eη (x, y) := (e−ηx , e−ηy ). The loss function λ is called η-mixable if the set Eη (Σλ ) is convex. It is called mixable if it is η-mixable for some η > 0. Theorem 1 (Vovk and Watkins). If a loss function λ is η-mixable, then there exists a strategy for Learner that guarantees that in the game of prediction with expert advice with N experts and the loss function λ it holds, for all T and for all n = 1, . . . , N , that LT ≤ LnT +
1 ln N . η
(2)
The bound is optimal: if λ is not η-mixable, then no strategy for Learner can guarantee (2). For the proof and other details, see [1], [5], [6], or [7, Theorem 8]; one of the algorithms guaranteeing (2) is the Aggregating Algorithm (AA). As shown in [2], one can take the defensive forecasting algorithm instead of the AA in the theorem.
3
Proper Scoring Rules
A loss function λ is a proper scoring rule if for any π, π ∈ [0, 1] it holds that πλ(π, 1) + (1 − π)λ(π, 0) ≤ πλ(π , 1) + (1 − π)λ(π , 0) . The interpretation is that the prediction π is an estimate of the probability that ω = 1. The definition says that the expected loss with respect to a probability distribution is minimal if the prediction is the true probability of 1. Informally, a proper scoring rule encourages a forecaster (Learner or one of the experts) to announce his true subjective probability that the next outcome will be 1. (See [8] and [9] for detailed reviews.) Simple examples of proper scoring rules are provided by two most common loss functions: the log loss function λ(γ, ω) := − ln(ωγ + (1 − ω)(1 − γ))
Prediction with Expert Evaluators’ Advice
11
(i.e., λ(γ, 0) = − ln(1 − γ) and λ(γ, 1) = − ln γ) and the square loss function λ(γ, ω) := (ω − γ)2 . A trivial but important for us generalization of the log loss function is 1 λ(γ, ω) := − ln(ωγ + (1 − ω)(1 − γ)) , η
(3)
where η is a positive constant. The generalized log loss function is also a proper scoring rule (in general, multiplying a proper scoring rule by a positive constant we again obtain a proper scoring rule). It is well known that the log loss function is 1-mixable and the square loss function is 2-mixable (see, e.g., [1], Section 3.6), and it is easy to check that the generalized log loss function (3) is η-mixable. We will often say “proper loss function” meaning a loss function that is a proper scoring rule. Our main interest will be in loss functions that are both mixable and proper. Let L be the set of all such loss functions. It is geometrically obvious that any mixable loss function can be made proper by removing inadmissible predictions (i.e., predictions γ that are strictly worse than some other predictions) and reparameterizing the admissible predictions.
4
Prediction with Expert Evaluators’ Advice
In this section we consider a very general protocol of prediction with expert advice. The intuition behind special cases of this protocol will be discussed in the following sections. Prediction with expert evaluators’ advice FOR t = 1, 2, . . . : Expert n announces γtn ∈ [0, 1], ηtn > 0, and ηtn -mixable λnt ∈ L, n = 1, . . . , N . Learner announces γt ∈ [0, 1]. Reality announces ωt ∈ {0, 1}. END FOR The main mathematical result of this paper is the following. Theorem 2. Learner has a strategy (e.g., the defensive forecasting algorithm described below) that guarantees that in the game of prediction with N expert evaluators’ advice it holds, for all T and for all n = 1, . . . , N , that T
ηtn λnt (γt , ωt ) − λnt (γtn , ωt ) ≤ ln N .
t=1
The description of the defensive forecasting algorithm and the proof of the theorem will be given in Sect. 7.
12
A. Chernov and V. Vovk
Corollary 1. For any η > 0, Learner has a strategy that guarantees T t=1
λnt (γt , ωt ) ≤
T t=1
λnt (γtn , ωt ) +
ln N , η
(4)
for all T and all n = 1, . . . , N , in the game of prediction with N expert evaluators’ advice in which the experts are required to always choose η-mixable loss functions λnt . This corollary is more intuitive than Theorem 2 as (4) compares the cumulative losses suffered by Learner and each expert. In the following sections we will discuss two interesting special cases of Theorem 2 and Corollary 1.
5
Prediction with Constant Expert Evaluators’ Advice
In the game of this section, as in the previous one, the experts are “expert evaluators”: each of them measures Learner’s and his own performance using his own loss function, supposed to be mixable and proper. The difference is that now each expert is linked to a fixed loss function. The game is specified by N loss functions λ1 , . . . , λN . Prediction with constant expert evaluators’ advice (n) L0 := 0, n = 1, . . . , N . Ln0 := 0, n = 1, . . . , N . FOR t = 1, 2, . . . : Expert n announces γtn ∈ [0, 1], n = 1, . . . , N . Learner announces γt ∈ [0, 1]. Reality announces ωt ∈ {0, 1}. (n) (n) Lt := Lt−1 + λn (γt , ωt ), n = 1, . . . , N . n n Lt := Lt−1 + λn (γtn , ωt ), n = 1, . . . , N . END FOR There are two changes in the protocol as compared to the basic protocol of prediction with expert advice in Sect. 2. The accumulated loss Lnt of each expert is now calculated according to his own loss function λn . For Learner, there is no (n) single accumulated loss anymore. Instead, the loss Lt of Learner is calculated separately against each expert, according to that expert’s loss function λn . Informally speaking, each expert evaluates his own performance and the performance of Learner according to the expert’s own (but publicly known) criteria. In the standard setting of prediction with expert advice it is often said that Learner’s goal is to compete with the best expert in the pool. In the new setting, we cannot speak about the best expert: the experts’ performance is evaluated by different loss functions and thus the losses may be measured on different scales. (n) But it still makes sense to consider bounds on the regret Lt − Lnt for each n.
Prediction with Expert Evaluators’ Advice
13
Theorem 2 immediately implies the following performance guarantee for the defensive forecasting algorithm in our current setting. Corollary 2. Suppose that each λn is a proper loss function that is η n -mixable for some η n > 0, n = 1, . . . , N . Then Learner has a strategy that guarantees that in the game of prediction with N experts’ advice and loss functions λ1 , . . . , λN it holds, for all T and for all n = 1, . . . , N , that LT ≤ LnT + (n)
ln N . ηn
Notice that Corollary 2 contains the bound (2) of Theorem 1 as a special case (the assumption that λ is proper is innocuous in the context of Theorem 1). Multiobjective Prediction with Expert Advice To conclude this section, let us consider another variant of the protocol with several loss functions. As mentioned in the introduction, sometimes we have experts’ predictions, and we are not given a single loss function, but have several possible candidates. The most cautious way to generate Learner’s predictions is to ensure that the regret is small against all experts and according to all loss functions. The following protocol formalizes this task. Now we have N experts and M loss functions λ1 , . . . , λM . Multiobjective prediction with expert advice (m) L0 := 0, m = 1, . . . , M . Ln,m := 0, n = 1, . . . , N and m = 1, . . . , M . 0 FOR t = 1, 2, . . . : Expert n announces γtn ∈ [0, 1], n = 1, . . . , N . Learner announces γt ∈ [0, 1]. Reality announces ωt ∈ {0, 1}. (m) (m) Lt := Lt−1 + λm (γt , ωt ), m = 1, . . . , M . n,m m n Lt := Ln,m t−1 + λ (γt , ωt ), n = 1, . . . , N and m = 1, . . . , M . END FOR Corollary 3. Suppose that each λm is an η m -mixable proper loss function, for some η m > 0, m = 1, . . . , M . There is a strategy for Learner that guarantees that, in the multiobjective game of prediction with N experts and the loss functions λ1 , . . . , λM , ln M N (m) LT ≤ Ln,m + (5) T ηm for all T , all n = 1, . . . , N , and all m = 1, . . . , M . Proof. This follows easily from Corollary 2. For each n ∈ {1, . . . , N }, let us construct M new experts (n, m). Expert (n, m) predicts as Expert n and is linked to the loss function λm . Applying Corollary 2 to these M N experts, we get the bound (5).
14
6
A. Chernov and V. Vovk
Prediction with Specialist Experts’ Advice
The experts of this section are allowed to “sleep”, i.e., abstain from giving advice to Learner at some steps. We will be assuming that there is only one loss function λ, although generalization to the case of N loss functions λ1 , . . . , λN , each linked to an expert, is straightforward. The loss function λ does not need to be proper (but it is still required to be mixable). Let a be any object that does not belong to [0, 1]; intuitively, it will stand for an expert’s decision to abstain. Prediction with specialist experts’ advice (n) L0 := 0, n = 1, . . . , N . Ln0 := 0, n = 1, . . . , N . FOR t = 1, 2, . . . : Expert n announces γtn ∈ ([0, 1] ∪ {a}), n = 1, . . . , N . Learner announces γt ∈ [0, 1]. Reality announces ωt ∈ {0, 1}. (n) (n) Lt := Lt−1 + I{γtn =a} λ(γt , ωt ), n = 1, . . . , N . n n n Lt := Lt−1 + I{γtn =a} λ(γt , ωt ), n = 1, . . . , N . END FOR n The indicator function I{γtn = a is defined to be 1 if γtn
=a =a} of the event γt
and 0 if γtn = a. Therefore, Lt and Lnt refer to the cumulative losses of Learner and Expert n over the steps when Expert n is awake. Now Learner’s goal is to do as well as each expert on the steps chosen by that expert. (n)
Corollary 4. Let λ be a loss function that is η-mixable for some η > 0. Then Learner has a strategy that guarantees that in the game of prediction with N specialist experts’ advice and loss function λ it holds, for all T and for all n = 1, . . . , N , that ln N (n) LT ≤ LnT + . (6) η Proof. Without loss of generality the loss function λ may be assumed to be proper (as we said earlier, this can be achieved by reparameterization of predictions). The protocol of this section then becomes a special case of the protocol of Sect. 4 in which at each step each expert outputs ηtn = η and either λnt = λ (when he is awake) or λnt = 0 (when he is asleep). (Alternatively, we could allow zero learning rates and make each expert output λnt = λ and either ηtn = η, when he is awake, or ηtn = 0, when he is asleep.)
7
Defensive Forecasting Algorithm and the Proof of Theorem 2
In this section we prove Theorem 2. Our proof is constructive: we explicitly describe the defensive forecasting algorithm achieving the bound in Theorem 2.
Prediction with Expert Evaluators’ Advice
15
We will use the more intuitive notation πt , rather than γt , for the algorithm’s predictions (to emphasize the interpretation of predictions as probabilities: cf. the discussion of proper scoring rules in Sect. 3). The Algorithm For each n = 1, . . . , N , let us define the function ∗ Qn : [0, 1]N × (0, ∞)N × LN × [0, 1] × {0, 1} → [0, ∞] T n n n n eηt λt (πt ,ωt )−λt (γt ,ωt ) , Qn (γ1• , η1• , λ•1 , π1 , ω1 , . . . , γT• , ηT• , λ•T , πT , ωT ) := t=1
(7) where γtn are the components of γt• , ηtn are the components of ηt• , and λnt are the components of λ•t : γt• := (γt1 , . . . , γtN ), ηt• := (ηt1 , . . . , ηtN ), and λ•t := 0 n (λ1t , . . . , λN t ). As usual, the product t=1 is interpreted as 1, so that Q () = 1. The functions Qn will usually be applied to γt• := (γt1 , . . . , γtN ) the predictions made by all the N experts at step t, ηt• := (ηt1 , . . . , ηtN ) the learning rates chosen by the experts at step t, and λ•t := (λ1t , . . . , λN t ) the loss functions used by the experts at step t. Notice that Qn does not depend on the predictions, learning rates, and loss functions of the experts other than Expert n. Set N
1 n Q and ft (π, ω) := N n=1 • • Q γ1• , η1• , λ•1 , π1 , ω1 , . . . , γt−1 , ηt−1 , λ•t−1 , πt−1 , ωt−1 , γt• , ηt• , λ•t , π, ω • • , ηt−1 , λ•t−1 , πt−1 , ωt−1 , (8) − Q γ1• , η1• , λ•1 , π1 , ω1 , . . . , γt−1
Q :=
where (π, ω) ranges over [0, 1] × {0, 1}; the expression ∞ − ∞ is understood as, say, 0. The defensive forecasting algorithm is defined in terms of the functions ft . Defensive forecasting algorithm FOR t = 1, 2, . . . : Read the experts’ predictions γt• = (γt1 , . . . , γtN ) ∈ [0, 1]N , learning rates ηt• = (ηt1 , . . . , ηtN ) ∈ (0, ∞)N , N and loss functions λ•t = (λ1t , . . . , λN t )∈ L . Define ft : [0, 1] × {0, 1} → [−∞, ∞] by (8). If ft (0, 1) ≤ 0, predict πt := 0 and go to R. If ft (1, 0) ≤ 0, predict πt := 1 and go to R. Otherwise (if both ft (0, 1) > 0 and ft (1, 0) > 0), take any π satisfying ft (π, 0) = ft (π, 1) and predict πt := π. R: Read Reality’s move ωt ∈ {0, 1}. END FOR
16
A. Chernov and V. Vovk
The existence of a π satisfying ft (π, 0) = ft (π, 1), when required by the algorithm, will be proved in Lemma 1 below. We will see that in this case the function ft (π) := ft (π, 1) − ft (π, 0) takes values of opposite signs at π = 0 and π = 1. Therefore, a root of ft (π) = 0 can be found by, e.g., bisection (see [10], Chap. 9, for a review of bisection and more efficient methods, such as Brent’s). Reductions The most important property of the defensive forecasting algorithm is that it produces predictions πt such that the sequence Qt := Q(γ1• , η1• , λ•1 , π1 , ω1 , . . . , γt• , ηt• , λ•t , πt , ωt )
(9)
is non-increasing. This property will be proved later; for now, we will only check that it implies the bound on the regret term given in Theorem 2. Since the initial value Q0 of Q is 1, we have Qt ≤ 1 for all t. And since Qn ≥ 0 for all n, we have Qn ≤ N Q for all n. Therefore, Qnt , defined by (9) with Qn in place of Q, is at most N at each step t. By the definition of Qn this means that T
ηtn λnt (πt , ωt ) − λnt (γtn , ωt ) ≤ ln N ,
t=1
which is the bound claimed in the theorem. In the proof of the inequalities Q0 ≥ Q1 ≥ · · · we will follow [2] (for a presentation adapted to the binary case, see [11]). The key fact we use is that Q is a game-theoretic supermartingale (see below). Let us define this notion and prove its basic properties. Let E be any non-empty set. A function S : (E × [0, 1] × {0, 1})∗ → (−∞, ∞] is called a supermartingale (omitting “game-theoretic”) if, for any T , any e1 , . . . , eT ∈ E, any π1 , . . . , πT ∈ [0, 1], and any ω1 , . . . , ωT −1 ∈ {0, 1}, it holds that πT S(e1 , π1 , ω1 , . . . , eT −1 , πT −1 , ωT −1 , eT , πT , 1) + (1 − πT )S(e1 , π1 , ω1 , . . . , eT −1 , πT −1 , ωT −1 , eT , πT , 0) ≤ S(e1 , π1 , ω1 , . . . , eT −1 , πT −1 , ωT −1 ) . (10) Remark 1. The standard measure-theoretic notion of a supermartingale is obtained when the arguments π1 , π2 , . . . in (10) are replaced by the forecasts produced by a fixed forecasting system. See, e.g., [12] for details. Game-theoretic supermartingales are referred to as “superfarthingales” in [13]. A supermartingale S is called forecast-continuous if, for all T ∈ {1, 2, . . .}, all e1 , . . . , eT ∈ E, all π1 , . . . , πT −1 ∈ [0, 1], and all ω1 , . . . , ωT ∈ {0, 1}, S(e1 , π1 , ω1 , . . . , eT −1 , πT −1 , ωT −1 , eT , π, ωT ) is a continuous function of π ∈ [0, 1]. The following lemma (proved and used in similar contexts by, e.g., Levin [14] and Takemura [15]) states the most important for us property of forecast-continuous supermartingales.
Prediction with Expert Evaluators’ Advice
17
Lemma 1. Let S be a forecast-continuous supermartingale. For any T and for any values of the arguments e1 , . . . , eT ∈ E, π1 , . . . , πT −1 ∈ [0, 1], and ω1 , . . . , ωT −1 ∈ {0, 1}, there exists π ∈ [0, 1] such that, for both ω = 0 and ω = 1, S(e1 , π1 , ω1 , . . . , eT −1 , πT −1 , ωT −1 , eT , π, ω) ≤ S(e1 , π1 , ω1 , . . . , eT −1 , πT −1 , ωT −1 ) . Proof. Define a function f : [0, 1] × {0, 1} → (−∞, ∞] by f (π, ω) := S(e1 , π1 , ω1 , . . . , eT −1 , πT −1 , ωT −1 , eT , π, ω) − S(e1 , π1 , ω1 , . . . , eT −1 , πT −1 , ωT −1 ) (the subtrahend is assumed finite: there is nothing to prove when it is infinite). Since S is a forecast-continuous supermartingale, f (π, ω) is continuous in π and πf (π, 1) + (1 − π)f (π, 0) ≤ 0
(11)
for all π ∈ [0, 1]. In particular, f (0, 0) ≤ 0 and f (1, 1) ≤ 0. Our goal is to show that for some π ∈ [0, 1] we have f (π, 1) ≤ 0 and f (π, 0) ≤ 0. If f (0, 1) ≤ 0, we can take π = 0. If f (1, 0) ≤ 0, we can take π = 1. Assume that f (0, 1) > 0 and f (1, 0) > 0. Then the difference f (π) := f (π, 1) − f (π, 0) is positive for π = 0 and negative for π = 1. By the intermediate value theorem, f (π) = 0 for some π ∈ (0, 1). By (11) we have f (π, 1) = f (π, 0) ≤ 0. The fact that the sequence (9) is non-increasing follows from the fact (see below) that Q is a forecast-continuous supermartingale (when restricted to the allowed moves for the players). The pseudocode for the defensive forecasting algorithm and the paragraph following it are extracted from the proof of Lemma 1, as applied to the supermartingale Q. The weighted sum of finitely many forecast-continuous supermartingales taken with positive weights is again a forecast-continuous supermartingale. Therefore, the proof will be complete if we check that Qn is a supermartingale under the restriction that λnt is ηtn -mixable for all n and t (it is forecast-continuous by Assumption 1). But before we can do this, we will need to do some preparatory work in the next subsection. Geometry of Mixability and Proper Loss Functions Assumption 1 and the compactness of [0, 1] imply that the superprediction set (1) is closed. Along with the superprediction set, we will also consider the prediction set Πλ := (x, y) ∈ [0, ∞)2 | ∃γ λ(γ, 0) = x and λ(γ, 1) = y . In many cases (e.g., if λ is proper), the prediction set is the boundary of the superprediction set. The prediction set can also be defined as the set of points Λγ := (λ(γ, 0), λ(γ, 1)) that belong to IR2 , where γ ranges over the prediction space [0, 1].
(12)
18
A. Chernov and V. Vovk
Let us fix a constant η > 0. The prediction set of the generalized log loss function (3) is the curve {(x, y) | e−ηx + e−ηy = 1} in IR2 . For each π ∈ (0, 1), the π-point of this curve is Λπ , i.e., the point
1 1 − ln(1 − π), − ln π . η η Since the generalized log loss function is proper, the minimum of (1 − π)x + πy (geometrically, of the dot product of (1 − π, π) and (x, y)) on the curve e−ηx + e−ηy = 1 is attained at the π-point; in other words, the tangent of e−ηx +e−ηy = 1 at the π-point is orthogonal to the vector (1 − π, π). A shift of the curve e−ηx + e−ηy = 1 is the curve e−η(x−α) + e−η(y−β) = 1 for some α, β ∈ IR (i.e., it is a parallel translation of e−ηx + e−ηy = 1 by some vector (α, β)). The π-point of this shift is the point (α, β) + Λπ , where Λπ is the π-point of the original curve e−ηx + e−ηy = 1. This provides us with a coordinate system on each shift of e−ηx + e−ηy = 1 (π ∈ (0, 1) serves as the coordinate of the corresponding π-point). It will be convenient to use the geographical expressions “Northeast” and “Southwest”. A point (x1 , y1 ) is Northeast of a point (x2 , y2 ) if x1 ≥ x2 and y1 ≥ y2 . A set A ⊆ IR2 is Northeast of a shift of e−ηx + e−ηy = 1 if each point of A is Northeast of some point of the shift. Similarly, a point is Northeast of a shift of e−ηx + e−ηy = 1 (or of a straight line with a negative slope) if it is Northeast of some point on that shift (or line). “Northeast” is replaced by “Southwest” when the inequalities are ≤ rather than ≥, and we add the attribute “strictly” when the inequalities are strict. It is easy to see that the loss function is η-mixable if and only if for each point (a, b) on the boundary of the superprediction set there exists a shift of e−ηx +e−ηy = 1 passing through (a, b) such that the superprediction set lies to the Northeast of the shift. This follows from the fact that the shifts of e−ηx +e−ηy = 1 correspond to the straight lines with negative slope under the homeomorphism Eη : indeed, the preimage of ax + by = c, where a > 0, b > 0, and c > 0, is ae−ηx + be−ηy = c, which is the shift of e−ηx + e−ηy = 1 by the vector
1 a 1 b . − ln , − ln η c η c A similar statement for the property of being proper is: Lemma 2. Suppose the loss function λ is η-mixable. It is a proper loss function if and only if for each π the superprediction set is to the Northeast of the shift of e−ηx + e−ηy = 1 passing through Λπ (as defined by (12)) and having Λπ as its π-point. Proof. The part “if” is obvious, so we will only prove the part “only if”. Let λ be η-mixable and proper. Suppose there exists π such that the shift A1 of e−ηx + e−ηy = 1 passing through Λπ and having Λπ as its π-point has some superpredictions strictly to its Southwest. Let s be such a superprediction, and let A2 be the tangent to A1 at the point Λπ . The image Eη (A1 ) is a straight
Prediction with Expert Evaluators’ Advice
19
line in [0, 1]2 , and the curve Eη (A2 ) touches Eη (A1 ) at Eη (Λπ ) and lies at the same side of Eη (A1 ) as Eη (s). Any point p in the open interval (Eη (s), Eη (Λπ )) that is close enough to Eη (Λπ ) will be strictly Northeast of Eη (A2 ). The point Eη−1 (p) will then be a superprediction (by the η-mixability of λ) that is strictly Southwest of A2 . This contradicts λ being a proper loss function, since A2 is the straight line passing through Λπ and orthogonal to (1 − π, π). Proof of the Supermartingale Property Let E ⊆ ([0, 1]N × (0, ∞)N × LN ) consist of sequences 1 γ , . . . , γ N , η 1 , . . . , η N , λ1 , . . . , λN such that λn is η n -mixable for all n = 1, . . . , N . We will only be interested in the restriction of Qn and Q to (E × [0, 1] × {0, 1})∗; these restrictions are denoted with the same symbols. The following lemma completes the proof of Theorem 2. We will prove it without calculations, unlike the proofs (of different but somewhat similar properties) presented in [2] (and, specifically for the binary case, in [11]). Lemma 3. The function Qn defined on (E × [0, 1] × {0, 1})∗ by (7) is a supermartingale. Proof. It suffices to check that it is always true that πT exp (ηTn (λnT (πT , 1) − λnT (γTn , 1)))
+ (1 − πT ) exp (ηTn (λnT (πT , 0) − λnT (γTn , 0))) ≤ 1 .
To simplify the notation, we omit the indices n and T ; this does not lead to any ambiguity. Using the notation (a, b) := Λπ = (λ(π, 0), λ(π, 1)) and (x, y) := Λγ = (λ(γ, 0), λ(γ, 1)), we can further simplify the last inequality to (1 − π) exp (η (a − x)) + π exp (η (b − y)) ≤ 1 . In other words, it suffices to check that the (super)prediction set lies to the Northeast of the shift
1 1 + exp −η y − b − ln π =1 (13) exp −η x − a − ln(1 − π) η η of the curve e−ηx + e−ηy = 1. The vector by which (13) is shifted is
1 1 a + ln(1 − π), b + ln π , η η and so (a, b) is the π-point of that shift. This completes the proof of the lemma: by Lemma 2, the superprediction set indeed lies to the Northeast of that shift.
20
8
A. Chernov and V. Vovk
Defensive Forecasting for Specialist Experts and the AA
In this section we will find a more explicit version of defensive forecasting in the case of specialist experts. Our algorithm will achieve a slightly more general version of the bound (6); namely, we will replace the ln N in (6) by − ln pn where pn is an a priori chosen weight for Expert n: all pn are non-negative and sum to 1. Without loss of generality all pn will be assumed positive (our algorithm can always be applied to the subset of experts with positive weights). Let At be the set of awake experts at time t: At := {n ∈ {1, . . . , N } | γtn
= a}. Let λ be an η-mixable loss function. By the definition of mixability there exists a function Σ(u1 , . . . , uk , γ1 , . . . , γk ) (called a substitution function) such that: – the domain of Σ consists of all sequences (u1 , . . . , uk , γ1 , . . . , γk ), for all k = 0, 1, 2, . . ., of numbers ui ∈ [0, 1] summing to 1, u1 + · · · + uk = 1, and predictions γ1 , . . . , γk ∈ [0, 1]; – Σ takes values in the prediction space [0, 1]; – for any (u1 , . . . , uk , γ1 , . . . , γk ) in the domain of Σ, the prediction γ := Σ(u1 , . . . , uk , γ1 , . . . , γk ) satisfies ∀ω ∈ {0, 1} : e−ηλ(γ,ω) ≥
k
e−ηλ(γi ,ω) ui .
(14)
i=1
Fix such a function Σ. Notice that its value Σ() on the empty sequence can be chosen arbitrarily, that the case k = 1 is trivial, and that the case k = 2 in fact covers the cases k = 3, k = 4, etc. Defensive forecasting algorithm for specialist experts w0n := pn , n = 1, . . . , N . FOR t = 1, 2, . . . : Read the list At of awake experts and their predictions γtn ∈ [0, 1], n ∈ At .
n n n / n∈At wt−1 . Predict πt := Σ ut−1 n∈A , (γtn )n∈At , where unt−1 := wt−1 t
Read the outcome ωt ∈ {0, 1}. n n Set wtn := wt−1 eη(λ(πt ,ωt )−λ(γt ,ωt )) for all n ∈ At . END FOR
This algorithm is a simple modification of the AA, and it becomes the AA when the experts are always awake. Its main difference from the AA is in the way the experts’ weights are updated. The weights of the sleeping experts are not changed, whereas the weights of the awake experts are multiplied n by eη(λ(πt ,ωt )−λ(γt ,ωt )) . Therefore, Learner’s loss serves as the benchmark: the weight of an awake expert who performs better than Learner goes up, the weight of an awake expert who performs worse than Learner goes down, and the weight
Prediction with Expert Evaluators’ Advice
21
of a sleeping expert does not change. In the case of the log loss function, this algorithm was found by Freund et al. [3]; in this special case, Freund et al. derive the same performance guarantee as we do. Derivation of the Algorithm In this derivation we will need the following notation. For each history of the game, let An , n ∈ {1, . . . , N }, be the set of steps at which Expert n is awake: An := {t ∈ {1, 2, . . .} | n ∈ At } . For each positive integer k, [k] stands for the set {1, . . . , k}. The method of defensive forecasting (as used in the proof of Corollary 4) requires that at step T we should choose π = πT such that, for each ω ∈ {0, 1},
n
pn eη(λ(π,ω)−λ(γT ,ω))
n∈AT
+
n
eη(λ(πt ,ωt )−λ(γt ,ωt ))
t∈[T −1]∩An
pn
n∈AcT
n
eη(λ(πt ,ωt )−λ(γt ,ωt ))
t∈[T −1]∩An
≤
pn
n
eη(λ(πt ,ωt )−λ(γt ,ωt ))
t∈[T −1]∩An
n∈[N ]
where AcT stands for the complement of AcT in [N ]: AT := [N ] \ AT . This inequality is equivalent to
n
pn eη(λ(π,ω)−λ(γT ,ω))
n∈AT
n
eη(λ(πt ,ωt )−λ(γt ,ωt ))
t∈[T −1]∩An
≤
n∈AT
pn
n
eη(λ(πt ,ωt )−λ(γt ,ωt ))
t∈[T −1]∩An
and can be rewritten as
n
eη(λ(π,ω)−λ(γT ,ω)) unT −1 ≤ 1 ,
n∈AT
where unT −1 := wTn −1 /
n∈AT
wTn −1 := pn
wTn −1 are the normalized weights
n
eη(λ(πt ,ωt )−λ(γt ,ωt )) .
t∈[T −1]∩An
Comparing (15) and (14), we can see that it suffices to set π := Σ
unT −1 n∈AT , (γTn )n∈AT .
(15)
22
A. Chernov and V. Vovk
Acknowledgements The anonymous reviewers’ comments were very helpful in weeding out mistakes and improving presentation (although some of their suggestions could only be used for the full version of the paper [4], not restricted by the page limit). This work was supported in part by EPSRC grant EP/F002998/1. We are grateful to the anonymous Eurocrat who coined the term “expert evaluator”.
References 1. Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning, and Games. Cambridge University Press, Cambridge (2006) 2. Chernov, A., Kalnishkan, Y., Zhdanov, F., Vovk, V.: Supermartingales in prediction with expert advice. In: Freund, Y., Gy¨ orfi, L., Tur´ an, G., Zeugmann, T. (eds.) ALT 2008. LNCS (LNAI), vol. 5254, pp. 199–213. Springer, Heidelberg (2008) 3. Freund, Y., Schapire, R.E., Singer, Y., Warmuth, M.K.: Using and combining predictors that specialize. In: Proceedings of the Twenty Ninth Annual ACM Symposium on Theory of Computing, New York, Association for Computing Machinery, pp. 334–343 (1997) 4. Chernov, A., Vovk, V.: Prediction with expert evaluators’ advice. Technical Report arXiv:0902.4127 [cs.LG], arXiv.org e-Print archive (2009) 5. Haussler, D., Kivinen, J., Warmuth, M.K.: Sequential prediction of individual sequences under general loss functions. IEEE Transactions on Information Theory 44, 1906–1925 (1998) 6. Vovk, V.: A game of prediction with expert advice. Journal of Computer and System Sciences 56, 153–173 (1998) 7. Vovk, V.: Derandomizing stochastic prediction strategies. Machine Learning 35, 247–282 (1999) 8. Dawid, A.P.: Probability forecasting. In: Kotz, S., Johnson, N.L., Read, C.B. (eds.) Encyclopedia of Statistical Sciences, vol. 7, pp. 210–218. Wiley, New York (1986) 9. Gneiting, T., Raftery, A.E.: Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association 102, 359–378 (2007) 10. Press, W.H., Flannery, B.P., Teukolsky, S.A., Vetterling, W.T.: Numerical Recipes in C, 2nd edn. Cambridge University Press, Cambridge (1992) 11. Vovk, V.: Defensive forecasting for optimal prediction with expert advice. Technical Report arXiv:0708.1503 [cs.LG], arXiv.org e-Print archive (August 2007) 12. Shafer, G., Vovk, V.: Probability and Finance: It’s Only a Game! Wiley, New York (2001) 13. Dawid, A.P., Vovk, V.: Prequential probability: principles and properties. Bernoulli 5, 125–162 (1999) 14. Levin, L.A.: Uniform tests of randomness. Soviet Mathematics Doklady 17, 337–340 (1976) 15. Vovk, V., Takemura, A., Shafer, G.: Defensive forecasting. In: Cowell, R.G., Ghahramani, Z. (eds.) Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, Savannah Hotel, Barbados, Society for Artificial Intelligence and Statistics, January 6-8, pp. 365–372 (2005), http://www.gatsby.ucl.ac.uk/aistats/
Pure Exploration in Multi-armed Bandits Problems S´ebastien Bubeck1 , R´emi Munos1 , and Gilles Stoltz2,3 2
1 INRIA Lille, SequeL Project, France Ecole normale sup´erieure, CNRS, Paris, France 3 HEC Paris, CNRS, Jouy-en-Josas, France
Abstract. We consider the framework of stochastic multi-armed bandit problems and study the possibilities and limitations of strategies that perform an online exploration of the arms. The strategies are assessed in terms of their simple regret, a regret notion that captures the fact that exploration is only constrained by the number of available rounds (not necessarily known in advance), in contrast to the case when the cumulative regret is considered and when exploitation needs to be performed at the same time. We believe that this performance criterion is suited to situations when the cost of pulling an arm is expressed in terms of resources rather than rewards. We discuss the links between the simple and the cumulative regret. The main result is that the required exploration–exploitation trade-offs are qualitatively different, in view of a general lower bound on the simple regret in terms of the cumulative regret.
1 Introduction Learning processes usually face an exploration versus exploitation dilemma, since they have to get information on the environment (exploration) to be able to take good actions (exploitation). A key example is the multi-armed bandit problem [Rob52], a sequential decision problem where, at each stage, the forecaster has to pull one out of K given stochastic arms and gets a reward drawn at random according to the distribution of the chosen arm. The usual assessment criterion of a strategy is given by its cumulative regret, the sum of differences between the expected reward of the best arm and the obtained rewards. Typical good strategies, like the UCB strategies of [ACBF02], trade off between exploration and exploitation. Our setting is as follows. The forecaster may sample the arms a given number of times n (not necessarily known in advance) and is then asked to output a recommendation, formed by a probability distribution over the arms. He is evaluated by his simple regret, that is, the difference between the average payoff of the best arm and the average payoff obtained by his recommendation. The distinguishing feature from the classical multi-armed bandit problem is that the exploration phase and the evaluation phase are separated. We now illustrate why this is a natural framework for numerous applications. Historically, the first occurrence of multi-armed bandit problems was given by medical trials. In the case of a severe disease, ill patients only are included in the trial and the cost of picking the wrong treatment is high (the associated reward would equal a large negative value). It is important to minimize the cumulative regret, since the test and cure phases coincide. However, for cosmetic products, there exists a test phase R. Gavald`a et al. (Eds.): ALT 2009, LNAI 5809, pp. 23–37, 2009. c Springer-Verlag Berlin Heidelberg 2009
24
S. Bubeck, R. Munos, and G. Stoltz
separated from the commercialization phase, and one aims at minimizing the regret of the commercialized product rather than the cumulative regret in the test phase, which is irrelevant. (Here, several formulæ for a cream are considered and some quantitative measurement, like skin moisturization, is performed.) The pure exploration problem addresses the design of strategies making the best possible use of available numerical resources (e.g., as CPU time) in order to optimize the performance of some decision-making task. That is, it occurs in situations with a preliminary exploration phase in which costs are not measured in terms of rewards but rather in terms of resources, that come in limited budget. A motivating example concerns recent works on computer-go (e.g., the MoGo program of [GWMT06]). A given time, i.e., a given amount of CPU times is given to the player to explore the possible outcome of a sequences of plays and output a final decision. An efficient exploration of the search space is obtained by considering a hierarchy of forecasters minimizing some cumulative regret – see, for instance, the UCT strategy of [KS06] and the BAST strategy of [CM07]. However, the cumulative regret does not seem to be the right way to base the strategies on, since the simulation costs are the same for exploring all options, bad and good ones. This observation was actually the starting point of the notion of simple regret and of this work. A final related example is the maximization of some function f , observed with noise, see, e.g., [Kle04, BMSS09]. Whenever evaluating f at a point is costly (e.g., in terms of numerical or financial costs), the issue is to choose as adequately as possible where to query the value of this function in order to have a good approximation to the maximum. The pure exploration problem considered here addresses exactly the design of adaptive exploration strategies making the best use of available resources in order to make the most precise prediction once all resources are consumed. As a remark, it also turns out that in all examples considered above, we may impose the further restriction that the forecaster ignores ahead of time the amount of available resources (time, budget, or the number of patients to be included) – that is, we seek for anytime performance. The problem of pure exploration presented above was referred to as “budgeted multi-armed bandit problem” in [MLG04], where another notion of regret than simple regret is considered. [Sch06] solves the pure exploration problem in a minmax sense for the case of two arms only and rewards given by probability distributions over [0, 1]. [EDMM02] and [MT04] consider a related setting where forecasters
Parameters: K probability distributions for the rewards of the arms, ν1 , . . . , νK For each round t = 1, 2, . . . , (1) the forecaster chooses ϕt ∈ P{1, . . . , K} and pulls It at random according to ϕt ; (2) the environment draws the reward Yt for that action (also denoted by XIt ,TIt (t) with the notation introduced in the text); (3) the forecaster outputs a recommendation ψt ∈ P{1, . . . , K}; (4) If the environment sends a stopping signal, then the game takes an end; otherwise, the next round starts.
Fig. 1. The pure exploration problem for multi-armed bandits
Pure Exploration in Multi-armed Bandits Problems
25
perform exploration during a random number of rounds T and aim at identifying an ε–best arm. They study the possibilities and limitations of policies achieving this goal with overwhelming 1 − δ probability and indicate in particular upper and lower bounds on (the expectation of) T . Another related problem in the statistical literature is the identification of the best arm (with high probability). However, the binary assessment criterion used there (the forecaster is either right or wrong in recommending an arm) does not capture the possible closeness in performance of the recommended arm compared to the optimal one, which the simple regret does. Unlike the latter, this criterion is not suited for a distribution-free analysis.
2 Problem Setup, Notation We consider a sequential decision problem given by stochastic multi-armed bandits. K 2 arms, denoted by j = 1, . . . , K, are available and the j–th of them is parameterized by a fixed (unknown) probability distribution νj over [0, 1] with expectation μj ; at those rounds when it is pulled, its associated reward is drawn at random according to νj , independently of all previous rewards. For each arm j and all time rounds n 1, we denote by Tj (n) the number of times j was pulled from rounds 1 to n, and by Xj,1 , Xj,2 , . . . , Xj,Tj (n) the sequence of associated rewards. The forecaster has to deal simultaneously with two tasks, a primary one and an associated one. The associated task consists in exploration, i.e., the forecaster should indicate at each round t the arm It to be pulled. He may resort to a randomized strategy, which, based on past rewards, prescribes a probability distribution ϕt ∈ P{1, . . . , K} (where we denote by P{1, . . . , K} the set of all probability distributions over the indexes of the arms). In that case, It is drawn at random according to the probability distribution ϕt and the forecaster gets to see the associated reward Yt , also denoted by XIt ,TIt (t) with the notation above. The sequence (ϕt ) is referred to as an allocation strategy. The primary task is to output at the end of each round t a recommendation ψt ∈ P{1, . . . , K} to be used to form a randomized play in a one-shot instance if/when the environment sends some stopping signal meaning that the exploration phase is over. The sequence (ψt ) is referred to as a recommendation strategy. Figure 1 summarizes the description of the sequential game and points out that the information available to the forecaster for choosing ϕt , respectively ψt , is formed by the Xj,s for j = 1, . . . , K and s = 1, . . . , Tj (t − 1), respectively, s = 1, . . . , Tj (t). As we are only interested in the performances of the recommendation strategy (ψt ), we call this problem the pure exploration problem for multi-armed bandits and evaluate the strategies through their simple regrets. The simple regret rt of a recommendation ψt = (ψj,t )j=1,...,K is defined as the expected regret on a one-shot instance of the game, if a random action is taken according to ψt . Formally, rt = r ψt = μ∗ − μψt where μ∗ = μj ∗ = max μj j=1,...,K and μψt = ψj,t μj j=1,...,K
denote respectively the expectations of the rewards of the best arm j ∗ (a best arm, if there are several of them with same maximal expectation) and of the recommendation
26
S. Bubeck, R. Munos, and G. Stoltz
ψt . A useful notation in the sequel is the gap Δj = μ∗ − μj between the maximal expected reward and the one of the j–th arm ; as well as the minimal gap Δ = min Δj . j:Δj >0
A quantity n of related interest is the cumulative regret at round n, which is defined as Rn = t=1 μ∗ − μIt . A popular treatment of the multi-armed bandit problems is to construct forecasters ensuring that ERn = o(n), see, e.g., [LR85] or [ACBF02], and even Rn = o(n) a.s., as follows, e.g., from [ACBFS02, Theorem 6.3] together with a martingale argument. The quantities rt = μ∗ − μIt are sometimes called instantaneous regrets. They differ from the simple regrets rt and in particular, Rn = r1 + . . . + rn is in general not equal to r1 + . . . + rn . Theorem 1, among others, will however indicate some connections between rn and Rn . Goal and structure of the paper: We study the links between simple and cumulative regrets. Intuitively, an efficient allocation strategy for the simple regret should rely on some exploration–exploitation trade-off. Our main contribution (Theorem 1, Section 3) is a lower bound on the simple regret in terms of the cumulative regret suffered in the exploration phase, showing that the trade-off involved in the minimization of the simple regret is somewhat different from the one for the cumulative regret. It in particular implies that the uniform allocation is a good benchmark when n is large. In Sections 4 and 5, we show how, despite all, one can fight against this negative result. For instance, some strategies designed for the cumulative regret can outperform (for moderate values of n) strategies with exponential rates of convergence for their simple regret.
3 The Smaller the Cumulative Regret, the Larger the Simple Regret It is immediate that for the recommendation formed by the empirical distribution of plays of Figure 3, that is, ψn = (δI1 + . . . + δIn )/n, the regrets satisfy rn = Rn /n; therefore, upper bounds on ERn lead to upper bounds on Ern . We show here that upper bounds on ERn also lead to lower bounds on Ern : the smaller the guaranteed upper bound on ERn , the larger the lower bound on Ern , no matter what the recommendation strategies ψn are. This is interpreted as a variation of the “classical” trade-off between exploration and exploitation. Here, while the recommendation strategies ψn rely only on the exploitation of the results of the preliminary exploration phase, the design of the allocation policies ϕn consists in an efficient exploration of the arms. To guarantee this efficient exploration, past payoffs of the arms have to be considered and thus, even in the exploration phase, some exploitation is needed. Theorem 1 and its corollaries aim at quantifying the needed respective amount of exploration and exploitation. In particular, to have an asymptotic optimal rate of decrease for the simple regret, each arm should be sampled a linear number of times, while for the cumulative regret, it is known that the forecaster should not do so more than a logarithmic number of times on the suboptimal arms. Formally, our main result is as follows. It is strong in the sense that we get lower bounds for all possible sets of Bernoulli distributions {ν1 , . . . , νK } over the rewards.
Pure Exploration in Multi-armed Bandits Problems
27
Theorem 1 (Main result). For all allocation strategies (ϕt ) and all functions ε : {1, 2, . . .} → R such that for all (Bernoulli) distributions ν1 , . . . , νK on the rewards, there exists a constant C 0 with ERn Cε(n),
the simple regret of all recommendation strategies (ψt ) based on the allocation strategies (ϕt ) is such that for all sets of K 3 (distinct, Bernoulli) distributions on the rewards, all different from a Dirac distribution at 1, there exists a constant D 0 and an ordering ν1 , . . . , νK of the considered distributions with Δ −Dε(n) e . Ern 2
Corollary 1. For allocation strategies (ϕt ), all recommendation strategies (ψt ), and all sets of K 3 (distinct, Bernoulli) distributions on the rewards, there exist two constants β > 0 and γ 0 such that, up to the choice of a good ordering of the considered distributions, Ern β e−γn . Theorem 1 is proved below and Corollary 1 follows from the fact that the cumulative regrets are always bounded by n. To get further the point of the theorem, one should keep in mind that the typical (distribution-dependent) rate of growth of the cumulative regrets of good algorithms, e.g., UCB1 of [ACBF02], is ε(n) = ln n. This, as asserted in [LR85], is the optimal rate. But the recommendation strategies based on such allocation strategies are bound to suffer a simple regret that decreases at best polynomially fast. We state this result for the slight modification UCB(p) of UCB1 stated in Figure 2; its proof relies on noting that it achieves a cumulative regret bounded by a large enough distribution-dependent constant times ε(n) = p ln n. Corollary 2. The allocation strategy (ϕt ) given by the forecaster UCB(p) of Figure 2 ensures that for all recommendation strategies (ψt ) and all sets of K 3 (distinct, Bernoulli) distributions on the rewards, there exist two constants β > 0 and γ 0 (independent of p) such that, up to the choice of a good ordering of the considered distributions, Ern β n−γp . Proof. The intuitive version of the proof of Theorem 1 is as follows. The basic idea is to consider a tie case when the best and worst arms have zero empirical means; it happens often enough (with a probability at least exponential in the number of times we pulled these arms) and results in the forecaster basically having to pick another arm and suffering some regret. Permutations are used to control the case of untypical or naive forecasters that would despite all pull an arm with zero empirical mean, since they force a situation when those forecasters choose the worst arm instead of the best one. Formally, we fix the allocation strategies (ϕt ) and a corresponding function ε such that the assumption of the theorem is satisfied. We consider below a set of K 3 (distinct) Bernoulli distributions; actually, we only use below that their parameters are (up to a first ordering) such that 1 > μ1 > μ2 μ3 . . . μK 0 and μ2 > μK (thus, μ2 > 0).
28
S. Bubeck, R. Munos, and G. Stoltz
Another layer of notation is needed. It depends on permutations σ of {1, . . . , K}. To have a gentle start, we first describe the notation when the permutation is the identity, σ = id. We denote by P and E the probability and expectation with respect to the K-tuple of distributions overs the arms ν1 , . . . , νK . For i = 1 (respectively, i = K), we denote by Pi,id and Ei,id the probability and expectation with respect to the Ktuples formed by δ0 , ν2 , . . . , νK (respectively, δ0 , ν2 , . . . , νK−1 , δ0 ), where δ0 denotes the Dirac measure on 0. For a given permutation σ, we consider similar notation up to a reordering. Pσ and Eσ refer to the probability and expectation with respect to the K-tuple of distributions over the arms formed by the νσ−1 (1) , . . . , νσ−1 (K) . Note in particular that the j–th best arm is located in the σ(j)–th position. Now, we denote for i = 1 (respectively, i = K) by Pi,σ and Ei,σ the probability and expectation with respect to the K-tuple formed by the νσ−1 (j) , except that we replaced the best of them, located in the σ(1)–th position, by a Dirac measure on 0 (respectively, the best and worst of them, located in the σ(1)–th and σ(K)–th positions, by Dirac measures on 0). We provide a proof in six steps. Step 1. Lower bounds by an average the maximum of the simple regrets obtained by reordering, max Eσ rn σ
1 μ1 − μ2 Eσ rn Eσ 1 − ψσ(1),n , K! σ K! σ
where we used that under Pσ , the index of the best arm is σ(1) and the minimal regret for playing any other arm is at least μ1 − μ2 . Step 2. Rewrites each term of the sum over σ as the product of three simple terms. We use first that P1,σ is the same as Pσ , except that it ensures that arm σ(1) has zero reward throughout. Denoting by Tj (n)
Cj,n =
Xj,t
t=1
the cumulative reward of the j–th till round n, one then gets Eσ 1 − ψσ(1),n Eσ 1 − ψσ(1),n I{Cσ(1),n =0}
= Eσ 1 − ψσ(1),n Cσ(1),n = 0 × Pσ Cσ(1),n = 0 = E1,σ 1 − ψσ(1),n Pσ Cσ(1),n = 0 . Second, iterating the argument from P1,σ to PK,σ , E1,σ
1 − ψσ(1),n
and therefore,
1 − ψσ(1),n Cσ(K),n = 0 P1,σ Cσ(K),n = 0 = EK,σ 1 − ψσ(1),n P1,σ Cσ(K),n = 0
E1,σ
Pure Exploration in Multi-armed Bandits Problems
29
Eσ 1 − ψσ(1),n EK,σ 1 − ψσ(1),n P1,σ Cσ(K),n = 0 Pσ Cσ(1),n = 0 . (1) Step 3. Deals with the second term in the right-hand side of (1), T (n) E T (n) (1 − μK ) 1,σ σ(K) , P1,σ Cσ(K),n = 0 = E1,σ (1 − μK ) σ(K) where the equality can be seen by conditioning on I1 , . . . , In and then taking the expectation, whereas the inequality is a consequence of Jensen’s inequality. Now, the expected number of times the sub-optimal arm σ(K) is pulled under P1,σ is bounded by the regret, by the very definition of the latter: (μ2 − μK ) E1,σ Tσ(K) (n) E1,σ Rn . Since by hypothesis (and by taking the maximum of K! values), there exists a constant C such that for all σ, E1,σ Rn C ε(n), we finally get P1,σ Cσ(K),n = 0 (1 − μK )Cε(n)/(μ2 −μK ) . Step 4. Lower bounds the third term in the right-hand side of (1) as Pσ Cσ(1),n = 0 (1 − μ1 )Cε(n)/μ2 . We denote by Wn = (I1 , Y1 , . . . , In , Yn ) the history of actions pulled and obtained payoffs up to time n. What follows is reminiscent of the techniques used in [MT04]. We are interested in realizations wn = (i1 , y1 , . . . , in , yn ) of the history such that whenever σ(1) was played, it got a null reward. (We denote above by tj (t) is the realization of Tj (t) corresponding to wn , for all j and t.) The likelihood of such a wn under Pσ is (1 − μ1 )tσ(1) (n) times the one under P1,σ . Thus, Pσ {Wn = wn } Pσ Cσ(1),n = 0 = t (n) T (n) = (1 − μ1 ) σ(1) P1,σ {Wn = wn } = E1,σ (1 − μ1 ) σ(1) where the sums are over those histories wn such that the realizations of the payoffs obtained by the arm σ(1) equal xσ(1),s = 0 for all s = 1, . . . , tσ(1) (n). The argument is concluded as before, first by Jensen’s inequality and then, by using that μ2 E1,σ Tσ(1) (n) E1,σ Rn C ε(n) by definition of the regret and the hypothesis put on its control. Step 5. Resorts to a symmetry argument to show that as far as the first term of the right-hand side of (1) is concerned, σ
K! . EK,σ 1 − ψσ(1),n 2
Since PK,σ only depends on σ(2), . . . , σ(K − 1), we denote by Pσ(2),...,σ(K−1) the common value of these probability distributions when σ(1) and σ(K) vary (and a similar notation for the associated expectation). We can thus group the permutations σ two by two according to these (K −2)–tuples, one of the two permutations being defined by
30
S. Bubeck, R. Munos, and G. Stoltz
σ(1) equal to one of the two elements of {1, . . . , K} not present in the (K − 2)–tuple, and the other one being such that σ(1) equals the other such element. Formally, ⎡ ⎤ EK,σ ψσ(1),n = Ej2 ,...,jK−1 ⎣ ψj,n ⎦ σ
j2 ,...,jK−1
j2 ,...,jK−1
j∈{1,...,K}\{j2 ,...,jK−1 }
K! , Ej2 ,...,jK−1 1 = 2
where the summations over j2 , . . . , jK−1 are over all possible (K −2)–tuples of distinct elements in {1, . . . , K}. Step 6. Simply puts all pieces together and lower bounds max Eσ rn by σ
μ1 − μ2 K!
EK,σ 1 − ψσ(1),n
Pσ Cσ(1),n = 0 P1,σ Cσ(K),n = 0
σ
ε(n) μ1 − μ2 (1 − μK )C/(μ2 −μK ) (1 − μ1 )C/μ2 . 2
4 Upper Bounds on the Simple Regret In this section, we aim at qualifying the implications of Theorem 1 by pointing out that is should be interpreted as a result for large n only. For moderate values of n, strategies not pulling each arm a linear number of the times in the exploration phase can have interesting simple regrets. To do so, we consider only two natural and well-used allocation strategies. The first one is the uniform allocation, which we use as a simple benchmark; it pulls each arm a linear number of times. The second one is UCB(p) (a variant of UCB1 where the quantile factor may be a parameter); it is designed for the classical exploration–exploitation dilemma (i.e., its minimizes the cumulative regret) and pulls suboptimal arms a logarithmic number of times only. Of course, fancier allocation strategies should also be considered in a second time but since the aim of this paper is to study the links between cumulative and simple regrets, we restrict our attention to the two discussed above. In addition to these allocation strategies we consider three recommendation strategies, the ones that recommend respectively the empirical distribution of plays, the empirical best arm, or the most played arm). They are formally defined in Figures 2 and 3. Table 1 summarizes the distribution-dependent and distribution-free bounds we could prove so far (the difference between the two families of bounds is whether the constants can depend or not on the unknown distributions νj ). It shows that two interesting couple of strategies are, on one hand, the uniform allocation together with the choice of the empirical best arm, and on the other hand, UCB(p) together with the choice of the most played arm. The first pair was perhaps expected, the second one might be considered more surprising. We only state here upper bounds on the simple regrets of these two pairs and omit the other ones. The distribution-dependent lower bound is stated in Corollary 1 and the distribution-free lower bound follows from a straightforward adaptation of the proof of the lower bound on the cumulative regret in [ACBFS02].
Pure Exploration in Multi-armed Bandits Problems
31
Parameters: K arms Uniform allocation — Plays all arms one after the other For each round t = 1, 2, . . . , use ϕt = δ[t mod K] , where [t mod K] denotes the value of t modulo K. UCB(p) — Plays each arm once and then the one with the best upper confidence bound Parameter: quantile factor p For rounds t = 1, . . . , K, play ϕt = δt For each round t = K + 1, K + 2, . . . , (1) compute, for all j = 1, . . . , K, the quantities μ j,t−1 =
1 Tj (t − 1)
Tj (t−1)
Xj,s ;
s=1
p ln(t − 1) Tj (t − 1) (ties broken by choosing, for instance, the arm with smallest index).
∗ , where (2) use ϕt = δjt−1
∗ jt−1
∈ argmax μ j,t−1 + j=1,...,K
Fig. 2. Two allocation strategies
Table 1. Distribution-dependent (top) and distribution-free (bottom) bounds on the expected simple regret of the considered pairs of allocation (lines) and recommendation (columns) strategies. Lower bounds are also indicated. The symbols denote the universal constants, whereas the are distribution-dependent constants. Distribution-dependent EDP Uniform UCB(p) Lower bound
EBA
MPA
e−n (p ln n)/n n− n2(1−p) e−n
Distribution-free EDP
EBA MPA K ln K n pK ln n pK ln n √ n n p ln n K n
Table 1 indicates that while for distribution-dependent bounds, the asymptotic optimal rate of decrease in the number n of rounds √ for simple regrets is exponential, for distribution-free bounds, the rate worsens to 1/ n. A similar situation arises for the cumulative regret, see [LR85] (optimal ln n rate for distribution-dependent bounds) versus √ [ACBFS02] (optimal n rate for distribution-free bounds).
32
S. Bubeck, R. Munos, and G. Stoltz
Parameters: the history I1 , . . . , In of played actions and of their associated rewards Y1 , . . . , Yn , grouped according to the arms as Xj,1 , . . . , Xj,Tj (n) , for j = 1, . . . , n Empirical distribution of plays (EDP) Draws a recommendation using the probability distribution ψn =
n 1 δI . n t=1 t
Empirical best arm (EBA) Only considers arms j with Tj (n) 1, computes their associated empirical means μ j,n
Tj (n) 1 = Xj,s , Tj (n) s=1
and forms a deterministic recommendation (conditionally to the history), ψn = δJn∗
where Jn∗ ∈ argmax μ j,n j
(ties broken in some way). Most played arm (MPA) Forms a deterministic recommendation (conditionally to the history), ψn = δJn∗
where
Jn∗ ∈ argmax Tj (n) . j=1,...,N
(ties broken in some way).
Fig. 3. Three recommendation strategies
4.1 A Simple Benchmark: The Uniform Allocation Strategy As explained above, the combination of the uniform allocation with the recommendation indicating the empirical best arm, forms an important theoretical benchmark. This section states its theoretical properties: the rate of decrease of its simple regret is exponential in a√distribution-dependent sense and equals the optimal (up to a logarithmic term) 1/ n rate in the distribution-free case. In Proposition 1, we propose two distribution-dependent bounds, the first one is sharper in the case when there are few arms, while the second one is suited for large n. Their simple proof is omitted; it relies on concentration inequalities, namely, Hoeffding’s inequality and McDiarmid’s inequality. The distribution-free bound of Corollary 3 is obtained not as a corollary of Proposition 1, but as a consequence of its proof. Its simple proof is also omitted. Proposition 1. The uniform allocation strategy associated to the recommendation given by the empirical best arm ensures that the simple regrets are bounded as follows: 2 Ern Δj e−Δj n/K/2 for all n K ; j:Δj >0
Pure Exploration in Multi-armed Bandits Problems
Ern
max Δj
j=1,...,K
1n 2 Δ exp − 8 K
for all n
8 ln K 1+ Δ2
33
K.
Corollary 3. The uniform allocation strategy associated to the recommendation given by the empirical best arm (at round Kn/K) ensures that the simple regrets are bounded in a distribution-free sense, for n K, as 2K ln K sup Ern 2 . n ν1 ,...,νK 4.2 Analysis of UCB(p) Combined with MPA A first (distribution-dependent) bound is stated in Theorem 2; the bound does not involve any quantity depending on the Δj , but it only holds for rounds n large enough, a statement that does involve the Δj . Its interest is first that it is simple to read, and second, that the techniques used to prove it imply easily a second (distribution-free) bound, stated in Theorem 3 and which is comparable to Corollary 3. Theorem 2. For p > 1, the allocation strategy given by UCB(p) associated to the recommendation given by the most played arm ensures that the simple regrets are bounded in a distribution-dependent sense by Ern
K 2p−1 2(1−p) n p−1
4Kp ln n and n K(K + 2). Δ2 The polynomial rate in the upper bound above is not a coincidence according to the lower bound exhibited in Corollary 2. Here, surprisingly enough, this polynomial rate of decrease is distribution-free (but in compensation, the bound is only valid after a distribution-dependent time). This rate illustrates Theorem 1: the larger p, the larger the (theoretical bound on the) cumulative regret of UCB(p) but the smaller the simple regret of UCB(p) associated to the recommendation given by the most played arm. for all n sufficiently large, e.g., such that n K +
Theorem 3. For p > 1, the allocation strategy given by UCB(p) associated to the recommendation given by the most played arm ensures that the simple regrets are bounded for all n K(K + 2) in a distribution-free sense by 4Kp ln n K 2p−1 2(1−p) Kp ln n + n . =O Ern n−K p−1 n Remark 1. We can rephrase the results of [KS06] as using UCB1 as an allocation strategy and forming a recommendation according to the empirical best arm. In particular, [KS06, Theorem 5] provides a distribution-dependent bound on the probability of not picking the best arm with this procedure and can be used to derive the following bound on the simple regret: 2 4 1 ρΔj /2 Ern Δj n j:Δj >0
34
S. Bubeck, R. Munos, and G. Stoltz
for all n 1. The leading constants 1/Δj and the distribution-dependant exponent make it not as useful as the one presented in Theorem 2. √ The best distribution-free bound we could get from this bound was of the order of 1/ ln n, to be compared to the √ asymptotic optimal 1/ n rate stated in Theorem 3. Proofs of Theorems 2 and 3 Lemma 1. For p > 1, the allocation strategy given by UCB(p) associated to the recommendation given by the most played arm ensures that the simple regrets are bounded in a distribution-dependent sense as follows. For all a1 , . . . , aK such that a1 + . . .+ aK = 1 and aj 0 for all j, with the additional property that for all suboptimal arms j and all optimal arms j ∗ , one has aj aj∗ , the following bound holds: Ern
1 (aj n)2(1−p) p−1 ∗ j =j
for all n sufficiently large, e.g., such that, for all suboptimal arms j, aj n 1 +
4p ln n Δ2j
and aj n K + 2 .
Proof. We first prove that whenever the most played arm Jn∗ is different from an optimal arm j ∗ , then at least one of the suboptimal arms j is such that Tj (n) aj n. To do so, we prove the converse and assume that Tj (n) < aj n for all suboptimal arms. Then, K K ai n = n = Ti (n) < Tj ∗ (n) + aj n i=1
j∗
i=1
j
where, in the inequality, the first summation is over the optimal arms, the second one, over the suboptimal ones. Therefore, we get aj ∗ n < Tj∗ (n) j∗
j∗
and there exists at least one optimal arm j ∗ such that Tj∗ (n) > aj∗ n. Since by definition of the vector (a1 , . . . , aK ), one has aj aj ∗ for all suboptimal arms, it comes that Tj (n) < aj n < aj∗ n < Tj ∗ (n) for all suboptimal arms, and the most played arm Jn∗ is thus an optimal arm. Thus, using that Δj 1 for all j, Ern = EΔJn∗ P Tj (n) aj n . j:Δj >0
A side-result extracted from the proof of [ACBF02, Theorem 1] states that for all suboptimal arms j and all rounds t K + 1, P It = j and Tj (t − 1) 2 t1−2p
whenever
4p ln n . Δ2j
(2)
This yields that for a suboptimal arm j and since by the assumptions on n and the aj , the choice = aj n − 1 satisfies K + 1 and (4p ln n)/Δ2j ,
Pure Exploration in Multi-armed Bandits Problems
35
n P Tj (n) aj n P Tj (t − 1) = aj n − 1 and It = j t=aj n
n
2 t1−2p
t=aj n
1 (aj n)2(1−p) p−1
(3)
where we used a union bound for the second inequality and (2) for the third inequality. A summation over all suboptimal arms j concludes the proof. Proof (of Theorem 2). We apply Lemma 1 with the uniform choice aj = 1/K and recall that Δ is the minimum of the Δj > 0. Proof (of Theorem 3). We start the proof by using that ψj,n = 1 and Δj 1 for all j, and can thus write Ern = EΔJn∗ =
K
Δj Eψj,n ε +
j=1
Δj Eψj,n .
j:Δj >ε
Since Jn∗ = j only if Tj (n) n/K, that is, ψj,n = I{Jn∗ =j} I{Tj (n)n/K} , we get Ern ε +
j:Δj >ε
Applying (3) with aj = 1/K leads to
n . Δj P Tj (n) K Ern ε +
j:Δj >ε
Δj K 2(p−1) n2(1−p) p−1
where ε is chosen such that for all Δj > ε, the condition = n/K − 1 (4p ln n)/Δ2j is satisfied (n/K − 1 K + 1 being satisfied by the assumption on n and K). The conclusion thus follows from taking, for instance, ε = (4pK ln n)/(n − K) and upper bounding all remaining Δj by 1.
5 Conclusions: Comparison of the Bounds, Simulation Study We now explain why, in some cases, the bound provided by our theoretical analysis in Lemma 1 is better than the bound stated in Proposition 1. The central point in the argument is that the bound of Lemma 1 is of the form n2(1−p) , for some distributiondependent constant , that is, it has a distribution-free convergence rate. In comparison, the bound of Proposition 1 involves the gaps Δj in the rate of convergence. Some care is needed in the comparison, since the bound for UCB(p) holds only for n large enough, but it is easy to find situations where for moderate values of n, the bound exhibited for the sampling with UCB(p) is better than the one for the uniform allocation. These situations typically involve a rather large number K of arms; in the latter case, the uniform allocation strategy only samples n/K each arm, whereas the UCB strategy focuses rapidly its exploration on the best arms. A general argument is proposed in the extended version [BMS09, Appendix B]. We only consider here one numerical example
36
S. Bubeck, R. Munos, and G. Stoltz ν =B(1/2),i=1..19; ν =B(0.66) i
ν =B(0.1),i=1..18; ν =B(0.5); ν =B(0.9)
20
i
0.15
19
20
0.25 UCB(2) with empirical best arm UCB(2) with most played arm Uniform sampling with empirical best arm
0.145
UCB(2) with empirical best arm UCB(2) with most played arm Uniform sampling with empirical best arm Expectation of the simple regret
Expectation of the simple regret
0.2 0.14 0.135 0.13 0.125 0.12
0.15
0.1
0.05 0.115 0.11 40
60
80
100 120 140 Allocation budget
160
180
200
0 40
60
80
100 120 140 Allocation budget
160
180
200
Fig. 4. Simple regret of different pairs of allocation and recommendation strategies, for K = 20 arms with Bernoulli distributions of parameters indicated on top of each graph; X–axis: number of samples, Y –axis: expectation of the simple regret (the smaller, the better)
extracted from there, see the right part of Figure 4. For moderate values of n (at least when n is about 6 000), the bounds associated to the sampling with UCB(p) are better than the ones associated to the uniform sampling. To make the story described in this paper short, we can distinguish three regimes: – for large values of n, uniform exploration is better (as shown by a combination of the lower bound of Corollary 2 and of the upper bound of Proposition 1); – for moderate values of n, sampling with UCB(p) is preferable, as discussed just above; – for small values of n, the best bounds to use seem to be the distribution-free bounds, which are of the same order of magnitude for the two strategies. Of course, these statements involve distribution-dependent quantifications (to determine which n are small, moderate, or large). We propose two simple experiments to illustrate our theoretical analysis; each of them was run on 104 instances of the problem and we plotted the average simple regrets. (More experiments can be found in [BMS09].) The first one corresponds in some sense to the worst case alluded at the beginning of Section 4. It shows that for small values of n (e.g., n 80 in the left plot of Figure 4), the uniform allocation strategy is very competitive. Of course the range of these values of n can be made arbitrarily large by decreasing the gaps. The second one corresponds to the numerical example described earlier in this section. We mostly illustrate here the small and moderate n regimes. (This is because for large n, the simple regrets are usually very small, even below computer precision.) Because of these chosen ranges, we do not see yet the uniform allocation strategy getting better than UCB–based strategies. This has an important impact on the interpretation of the lower bound of Theorem 1. While its statement is in finite time, it should be interpreted as providing an asymptotic result only.
Pure Exploration in Multi-armed Bandits Problems
37
6 Pure Exploration for Bandit Problems in Topological Spaces These results are of theoretical interest. We summarize them very briefly; statements and proofs can be found in the extended version [BMS09]. Therein, we consider the X –armed bandit problem with bounded payoffs of, e.g., [Kle04, BMSS09] and (re)define the notions of cumulative and simple regrets. The topological set X is a large possibly non-parametric space but the associated mean-payoff function is continuous. We show that, without any assumption on X , there exists a strategy with cumulative regret ERn = o(n) if and only if there exist an allocation and a recommendation strategy with simple regret Ern = o(1). We then use this equivalence to characterize the metric spaces X in which the cumulative regret ERn can always be made o(n): they are given by the separable spaces. Thus, here, in addition to its natural interpretation, the simple regret appears as a tool for proving results on the cumulative regret.
References [ACBF02] [ACBFS02] [BMS09]
[BMSS09]
[CM07] [EDMM02]
[GWMT06] [Kle04] [KS06]
[LR85] [MLG04]
[MT04]
[Rob52] [Sch06]
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Machine Learning Journal 47, 235–256 (2002) Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.: The non-stochastic multiarmed bandit problem. SIAM Journal on Computing 32(1), 48–77 (2002) Bubeck, S., Munos, R., Stoltz, G.: Pure exploration for multi-armed bandit problems. Technical report, HAL report hal-00257454 (2009), http://hal.archives-ouvertes.fr/hal-00257454/en Bubeck, S., Munos, R., Stoltz, G., Szepesvari, C.: Online optimization in X – armed bandits. In: Advances in Neural Information Processing Systems, vol. 21 (2009) Coquelin, P.-A., Munos, R.: Bandit algorithms for tree search. In: Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence (2007) Even-Dar, E., Mannor, S., Mansour, Y.: PAC bounds for multi-armed bandit and Markov decision processes. In: Kivinen, J., Sloan, R.H. (eds.) COLT 2002. LNCS (LNAI), vol. 2375, pp. 255–270. Springer, Heidelberg (2002) Gelly, S., Wang, Y., Munos, R., Teytaud, O.: Modification of UCT with patterns in Monte-Carlo go. Technical Report RR-6062, INRIA (2006) Kleinberg, R.: Nearly tight bounds for the continuum-armed bandit problem. In: 18th Advances in Neural Information Processing Systems (2004) Kocsis, L., Szepesvari, C.: Bandit based Monte-carlo planning. In: F¨urnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) ECML 2006. LNCS (LNAI), vol. 4212, pp. 282–293. Springer, Heidelberg (2006) Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6, 4–22 (1985) Madani, O., Lizotte, D., Greiner, R.: The budgeted multi-armed bandit problem. In: Proceedings of the 17th Annual Conference on Computational Learning Theory, pp. 643–645 (2004); Open problems session Mannor, S., Tsitsiklis, J.N.: The sample complexity of exploration in the multiarmed bandit problem. Journal of Machine Learning Research 5, 623–648 (2004) Robbins, H.: Some aspects of the sequential design of experiments. Bulletin of the American Mathematics Society 58, 527–535 (1952) Schlag, K.: Eleven tests needed for a recommendation. Technical Report ECO2006/2, European University Institute (2006)
The Follow Perturbed Leader Algorithm Protected from Unbounded One-Step Losses Vladimir V. V’yugin Institute for Information Transmission Problems, Russian Academy of Sciences, Bol’shoi Karetnyi per. 19, Moscow GSP-4, 127994, Russia
[email protected]
Abstract. In this paper the sequential prediction problem with expert advice is considered for the case when the losses of experts suffered at each step can be unbounded. We present some modification of Kalai and Vempala algorithm of following the perturbed leader where weights depend on past losses of the experts. New notions of a volume and a scaled fluctuation of a game are introduced. We present an algorithm protected from unrestrictedly large one-step losses. This algorithm has the optimal performance in the case when the scaled fluctuations of onestep losses of experts of the pool tend to zero.
1
Introduction
Experts algorithms are used for online prediction or repeated decision making or repeated game playing. Starting with the Weighted Majority Algorithm (WM) of Littlestone and Warmuth [6] and Vovk’s [11] Aggregating Algorithm, the theory of Prediction with Expert Advice has rapidly developed in the recent times. Also, most authors have concentrated on predicting binary sequences and have used specific loss functions, like absolute loss, square and logarithmic loss. Arbitrary losses are less common. A survey can be found in the book of Lugosi, Cesa-Bianchi [7]. In this paper, we consider a different general approach - “Follow the Perturbed Leader FPL” algorithm, now called Hannan’s algorithm [3], [5], [7]. Under this approach we only choose the decision that has fared the best in the past - the leader. In order to cope with adversary some randomization is implemented by adding a perturbation to the total loss prior to selecting the leader. The goal of the learner’s algorithm is to perform almost as well as the best expert in hindsight in the long run. The resulting FPL algorithm has the same performance guarantees as WM-type √ algorithms for fixed learning rate and bounded one-step losses, save for a factor 2. Prediction with Expert Advice considered in this paper proceeds as follows. We are asked to perform sequential actions at times t = 1, 2, . . . , T . At each time step t, experts i = 1, . . . N receive results of their actions in form of their losses sit - non-negative real numbers. At the beginning of the step t Learner, observing cumulating losses si1:t−1 = i s1 + . . . + sit−1 of all experts i = 1, . . . N , makes a decision to follow one of these R. Gavald` a et al. (Eds.): ALT 2009, LNAI 5809, pp. 38–52, 2009. c Springer-Verlag Berlin Heidelberg 2009
The Follow Perturbed Leader Algorithm
39
experts, say Expert i. At the end of step t Learner receives the same loss sit as Expert i at step t and suffers Learner’s cumulative loss s1:t = s1:t−1 + sit . In the traditional framework, we suppose that one-step losses of all experts are bounded, for example, 0 ≤ sit ≤ 1 for all i and t. Well known simple example of a game with two experts shows that Learner can perform much worse than each expert: let the current losses of two experts on steps t = 0, 1, . . . , 6 be s10,1,2,3,4,5,6 = ( 12 , 0, 1, 0, 1, 0, 1) and s20.1,2,3,4,5,6 = (0, 1, 0, 1, 0, 1, 0). Evidently, the “Follow Leader” algorithm always chooses the wrong prediction. When the experts one-step losses are bounded, this problem has been solved using randomization of the experts cumulative losses. The method of following the perturbed leader was discovered by Hannan [3]. Kalai and Vempala [5] rediscovered this method and published a simple proof of the main result of Hannan. They called an algorithm of this type FPL (Following the Perturbed Leader). The FPL algorithm outputs prediction of an expert i which minimizes 1 si1:t−1 − ξ i , where ξ i , i = 1, . . . N , t = 1, 2, . . ., is a sequence of i.i.d random variables distributed according to the exponential distribution with the density p(x) = exp{−x}, and is a learning rate. Kalai and Vempala [5] show that the expected cumulative loss of the FPL algorithm has the upper bound log N , i=1,...,N where is a positive real number such that 0 < < 1 is a learning rate, N is the number of experts. Hutter and Poland [4] presented a further developments of the FPL algorithm for countable class of experts, arbitrary weights and adaptive learning rate. Also, FPL algorithm is usually considered for bounded one-step losses: 0 ≤ sit ≤ 1 for all i and t. Most papers on prediction with expert advice either consider bounded losses or assume the existence of a specific loss function (see [7]). We allow losses at any step to be unbounded. The notion of a specific loss function is not used. The setting allowing unbounded one-step losses do not have wide coverage in literature; we can only refer reader to [1], [2], [9]. Poland and Hutter [9] have studied the games where one-step losses of all experts at each step t are bounded from above by an increasing sequence Bt given in advance. They presented a learning algorithm which is asymptotically consistent for Bt = t1/16 . Allenberg et al. [2] have considered polynomially bounded one-step losses for a modified version of the Littlestone and Warmuth algorithm [6] under partial monitoring. In full information case, their algorithm has the expected regret √ 1 2 N ln N (T + 1) 2 (1+a+β ) in the case where one-step losses of all experts i = 1, 2, . . . N at each step t have the bound (sit )2 ≤ ta , where a > 0, and β > 0 is E(s1:t ) ≤ (1 + ) min si1:t +
40
V.V. V’yugin
a parameter of the algorithm.1 They have proved that this algorithm is Hannan consistent if T 1 i 2 max (st ) < cT a 1≤i≤N T t=1 for all T , where c > 0 and 0 < a < 1. In this paper, we consider also the case where the loss grows “faster than polynomial, but slower than exponential”. We present some modification of Kalai and Vempala [5] algorithm of following the perturbed leader (FPL) for the case of unrestrictedly large one-step expert losses sit not bounded in advance. This algorithm uses adaptive weights depending on past cumulative losses of the experts. We analyze the asymptotic consistency of our algorithms using nonstandard t scaling. We introduce new notions of the volume of a game vt = maxi sij and j=1
the scaled fluctuation of the game fluc(t) = Δvt /vt , where Δvt = vt − vt−1 . We show in Theorem 1 that the algorithm of following the perturbed leader with adaptive weights constructed in Section 2 is asymptotically consistent in the mean in the case when vt → ∞ and Δvt = o(vt ) as t → ∞ with a computable bound. Specifically, if fluc(t) ≤ γ(t) for all t, where γ(t) is a computable function γ(t) such that γ(t) = o(1) as t → ∞, our algorithm has the expected regret 2
(e2 − 1)(1 + ln N )
T (γ(t))1/2 Δvt , t=1
where e = 2.72 . . . is the base of the natural logarithm. In particular, this algorithm is asymptotically consistent (in the mean) in a modified sense 1 E(s1:T − min si1:T ) ≤ 0, (1) lim sup i=1,...N v T →∞ T where s1:T is the total loss of our algorithm on steps 1, 2, . . . T , and E(s1:T ) is its expectation. Proposition 1 of Section 2 shows that if the condition Δvt = o(vt ) is violated the cumulative loss of any probabilistic prediction algorithm can be much more than the loss of the best expert of the pool. In Section 2 we present some sufficient conditions under which our learning algorithm is Hannan consistent.2 In particular case, Corollary 1 of Theorem 1 says that our algorithm is asymptotically consistent (in the modified sense) in the case when one-step losses of all experts at each step t are bounded by ta , where a is a positive real number. We prove this result under an extra assumption that the volume of the game grows slowly, lim inf vt /ta+δ > 0, where δ > 0 is arbitrary. Corollary 1 shows that our t→∞
algorithm is also Hannan consistent when δ > 12 . 1 2
Allenberg et al. [2] considered losses −∞ < sit < ∞. This means that (1) holds with probability 1, where E is omitted.
The Follow Perturbed Leader Algorithm
41
At the end of Section 2 we consider some applications of our algorithm for the case of standard time-scaling.
2
The Follow Perturbed Leader Algorithm with Adaptive Weights
We consider a game of prediction with expert advice with unbounded one-step losses. At each step t of the game, all N experts receive one-step losses sit ∈ [0, +∞), i = 1, . . . N , and the cumulative loss of the ith expert after step t is equal to si1:t = si1:t−1 + sit . A probabilistic learning algorithm of choosing an expert outputs at any step t the probabilities P {It = i} of following the ith expert given the cumulative losses si1:t−1 of the experts i = 1, . . . N in hindsight. Probabilistic algorithm of choosing an expert FOR t = 1, . . . T Given past cumulative losses of the experts si1:t−1 , i = 1, . . . N , choose an expert i with probability P {It = i}. Receive the one-step losses at step t of the expert sit and suffer one-step loss st = sit of the master algorithm. ENDFOR The performance of this probabilistic algorithm is measured in its expected regret E(s1:T − min si1:T ), i=1,...N
where the random variable s1:T is the cumulative loss of the master algorithm, si1:T , i = 1, . . . N , are the cumulative losses of the experts algorithms and E is the mathematical expectation (with respect to the probability distribution generated by probabilities P {It = i}, i = 1, . . . N , on the first T steps of the game).3 In the case of bounded one-step expert losses, sit ∈ [0, 1], and a convex loss √ function, the well-known learning algorithms have expected regret O( T log N ) (see Lugosi, Cesa-Bianchi [7]). A probabilistic algorithm is called asymptotically consistent in the mean if lim sup T →∞
1 E(s1:T − min si1:T ) ≤ 0. i=1,...N T
(2)
A probabilistic learning algorithm is called Hannan consistent if 3
For simplicity, we suppose that the experts are oblivious, i.e., they cannot use in their work random actions of the learning algorithm. The inequality (12) and the limit (13) of Theorem 1 below can be easily reformulated and proved for non-oblivious experts.
42
V.V. V’yugin
1 lim sup T →∞ T
i s1:T − min s1:T ≤ 0 i=1,...N
(3)
almost surely, where s1:T is its random cumulative loss. In this section we study the asymptotical consistency of probabilistic learning algorithms in the case of unbounded one-step losses. Notice that when 0 ≤ sit ≤ 1 all expert algorithms have total loss ≤ T on first T steps. This is not true for the unbounded case, and there are no reasons to divide the expected regret (2) by T . We change the standard time scaling (2) and (3) on a new scaling based on a new notion of volume of a game. We modify the definition (2) of the normalized expected regret as follows. Define the volume of a game at step t t vt = max sij . j=1
i
Evidently, vt−1 ≤ vt for all t and maxi si1:t ≤ vt ≤ N maxi si1:t for all i and t. A probabilistic learning algorithm is called asymptotically consistent in the mean (in the modified sense) in a game with N experts if lim sup T →∞
1 E(s1:T − min si1:T ) ≤ 0. i=1,...N vT
(4)
A probabilistic algorithm is called Hannan consistent (in the modified sense) if 1 i lim sup s1:T − min s1:T ≤ 0 (5) i=1,...N T →∞ vT almost surely. Notice that the notions of asymptotic consistency in the mean and Hannan consistency may be non-equivalent for unbounded one-step losses. A game is called non-degenerate if vt → ∞ (or equivalently, maxi si1:t → ∞) as t → ∞. Denote Δvt = vt − vt−1 . The number fluc(t) =
Δvt maxi sit = , vt vt
(6)
is called scaled fluctuation of the game at the step t. By definition 0 ≤ fluc(t) ≤ 1 for all t (put 0/0 = 0). The following simple proposition shows that each probabilistic learning algorithm is not asymptotically optimal in some game such that fluc(t) → 0 as t → ∞. For simplicity, we consider the case of two experts. Proposition 1. For any probabilistic algorithm of choosing an expert and for any such that 0 < < 1 two experts exist such that vt → ∞ as t → ∞ and fluc(t) ≥ 1 − , 1 1 E(s1:t − min si1:t ) ≥ (1 − ) i=1,2 vt 2 for all t.
The Follow Perturbed Leader Algorithm
43
Proof. Given a probabilistic algorithm of choosing an expert and such that 0 < < 1, define recursively one-step losses s1t and s2t of expert 1 and expert 2 at any step t = 1, 2, . . . as follows. By s11:t and s21:t denote the cumulative losses of these experts incurred at steps ≤ t, let vt be the corresponding volume, where t = 1, 2, . . .. Define v0 = 1 and Mt = 4vt−1 / for all t ≥ 1. For t ≥ 1, define s1t = 0 and 2 st = Mt if P {It = 1} ≥ 12 , and define s1t = Mt and s2t = 0 otherwise. Let st be one-step loss of the master algorithm and s1:t be its cumulative loss at step t ≥ 1. We have E(s1:t ) ≥ E(st ) = s1t P {It = 1} + s2t P {It = 2} ≥
1 Mt 2
for all t ≥ 1. Also, since vt = vt−1 + Mt = (1 + 4/)vt−1 and min si1:t ≤ vt−1 , the i
normalized expected regret of the master algorithm is bounded from below 1 2/ − 1 1 ≥ (1 − ). E(s1:t − min si1:t ) ≥ i vt 1 + 4/ 2 for all t. By definition fluc(t) =
Mt 1 ≥1− = vt−1 + Mt 1 + /4
for all t. Let γ(t) be a computable non-increasing real function such that 0 < γ(t) < 1 for all t and γ(t) → 0 as t → ∞; for example, γ(t) = 1/tδ , where δ > 0. Define N ln 1+ln 1 e2 −1 1− and (7) αt = 2 ln γ(t) e2 − 1 αt μt = (γ(t)) = (γ(t))1/2 (8) 1 + ln N for all t, where e = 2.72 . . . is the base of the natural logarithm.4 Without loss of generality we suppose that γ(t) < min{1, (e2 − 1)/(1 + ln N )} for all t. Then 0 < αt < 1 for all t. We consider an FPL algorithm with a variable learning rate t =
1 , μt vt−1
(9)
where μt is defined by (8) and the volume vt−1 depends on experts actions on steps < t. By definition vt ≥ vt−1 and μt ≤ μt−1 for t = 1, 2, . . .. Also, by definition μt → 0 as t → ∞. 4
The choice of the optimal value of αt will be explained later. It will be obtained by minimization of the corresponding member of the sum (44).
44
V.V. V’yugin
Let ξt1 ,. . . ξtN , t = 1, 2, . . ., be a sequence of i.i.d random variables distributed according to the density p(x) = exp{−x}. In what follows we omit the lower index t. We suppose without loss of generality that si0 = v0 = 0 for all i and 0 = ∞. The FPL algorithm is defined as follows: FPL algorithm PROT FOR t = 1, . . . T Choose an expert with the minimal perturbed cumulated loss on steps < t It = argmini=1,2,...N {si1:t−1 −
1 i ξ }. t
(10)
Receive one-step losses sit for experts i = 1, . . . , N , and receive one-step loss sIt t of the master algorithm. ENDFOR T Let s1:T = sIt t be the cumulative loss of the FPL algorithm on steps ≤ T . t=1
The following theorem shows that if the game is non-degenerate and Δvt = o(vt ) as t → ∞ with a computable bound then the FPL-algorithm with variable learning rate (9) is asymptotically consistent. Theorem 1. Let γ(t) be a computable non-increasing positive real function such that γ(t) → 0 as t → ∞. Let also the game be non-degenerate and such that fluc(t) ≤ γ(t)
(11)
for all t. Then the expected cumulated loss of the FPL algorithm PROT with variable learning rate (9) for all t is bounded by E(s1:T ) ≤
min si1:T i
T 2 + 2 (e − 1)(1 + ln N ) (γ(t))1/2 Δvt .
(12)
t=1
Also, the algorithm PROT is asymptotically consistent in the mean lim sup T →∞
1 E(s1:T − min si1:T ) ≤ 0. i=1,...N vT
The algorithm PROT is Hannan consistent, i.e., 1 s1:T − min si1:T ≤ 0 lim sup i=1,...N T →∞ vT
(13)
(14)
almost surely, if ∞ t=1
(γ(t))2 < ∞.
(15)
The Follow Perturbed Leader Algorithm
45
Proof. In the proof of this theorem we follow the proof-scheme of [4] and [5]. Let αt be a sequence of real numbers defined by (7); recall that 0 < αt < 1 for all t. The analysis of optimality of the FPL algorithm is based on an intermediate predictor IFPL (Infeasible FPL) with the learning rate t defined by (16). IFPL algorithm FOR t = 1, . . . T Define the learning rate t =
1 , where μt = (γ(t))αt , μt vt
(16)
vt is the volume of the game at step t and αt is defined by (7). Choose an expert with the minimal perturbed cumulated loss on steps ≤ t 1 Jt = argmini=1,2,...N {si1:t − ξ i }. t Receive the one step loss sJt t of the IFPL algorithm. ENDFOR The IFPL algorithm predicts under the knowledge of si1:t , i = 1, . . . N (and vt ), which may not be available at beginning of step t. Using unknown value of t is the main distinctive feature of our version of IFPL. For any t, we have It = argmini {si1:t−1 − 1t ξ i } and Jt = argmini {si1:t − 1 ξ i } = t argmini {si1:t−1 + sit − 1 ξ i }. t The expected one-step and cumulated losses of the FPL and IFPL algorithms at steps t and T are denoted lt = E(sIt t ) and rt = E(sJt t ), l1:T =
T
lt and r1:T =
t=1
T
rt ,
t=1
respectively, where sIt t is the one-step loss of the FPL algorithm at step t and sJt t is the one-step loss of the IFPL algorithm, and E denotes the mathematical expectation. Lemma 1. The cumulated expected losses of the FPL and IFPL algorithms with rearning rates defined by (9) and (16) satisfy the inequality l1:T ≤ r1:T + (e2 − 1)
T
(γ(t))1−αt Δvt
t=1
for all T , where αt is defined by (7). Proof. Let c1 , . . . cN be nonnegative real numbers and 1 ci }, t 1 1 mj = min{si1:t − ci } = min{si1:t−1 + sit − ci }. i =j i =j t t mj = min{si1:t−1 − i =j
(17)
46
V.V. V’yugin
1 2 2 Let mj = sj1:t−1 − 1t cj 1 and mj = sj1:t − 1 cj2 = sj1:t−1 + sjt2 − 1 cj2 . By definition t t and since j2 = j we have
1 2 cj ≤ sj1:t−1 − t 1 1 1 2 sj1:t − cj2 + − t t
1 mj = sj1:t−1 −
1 1 2 cj2 ≤ sj1:t−1 + sjt2 − cj2 = t t 1 1 1 cj2 = mj + − cj2 . t t t
(18) (19)
We compare conditional probabilities P {It = j|ξ i = ci , i = j} and P {Jt = j|ξ i = ci , i = j}. The following chain of equalities and inequalities is valid: P {It = j|ξ i = ci , i = j} = 1 P {sj1:t−1 − ξ j ≤ mj |ξ i = ci , i = j} = t P {ξ j ≥ t (sj1:t−1 − mj )|ξ i = ci , i = j} = P {ξ j ≥ t (sj1:t−1 − mj ) + (t − t )(sj1:t−1 − mj )|ξ i = ci , i = j} ≤ P {ξ ≥ − mj ) + 1 2 + cj2 )|ξ i = ci , i = j} = (t − t )(sj1:t−1 − sj1:t−1 t 2 exp{−(t − t )(sj1:t−1 − sj1:t−1 )} × 1 = j} ≤ P {ξ j ≥ t (sj1:t−1 − mj ) + (t − t ) cj2 |ξ i = ci , i t 2 exp{−(t − t )(sj1:t−1 − sj1:t−1 )} × 1 1 cj2 ) + P {ξ j ≥ t (sj1:t − sjt − mj − − t t 1 (t − t ) cj2 |ξ i = ci , i = j} = t 2 exp{−(t − t )(sj1:t−1 − sj1:t−1 ) + t sjt } × j
exp −
1 μt vt−1
t (sj1:t
1 P {ξ j > (sj − mj )|ξ i = ci , i = j} ≤ μt vt 1:t
j j2 ) Δvt (s1:t−1 − s1:t−1 Δvt × exp − + μt vt vt−1 μt vt
exp
Δvt μt vt
1−
(21) (22) (23)
(24) (25) (26)
mj )|ξ i
P {ξ ≥ − = ci , i = j} = j 1 s 2 × (sj1:t−1 − sj1:t−1 − )+ t μt vt μt vt
P {ξ j >
(20)
t (sj1:t−1
j
(27)
(28)
1 (sj − mj )|ξ i = ci , i = j} = μt vt 1:t j
2 sj1:t−1 − s1:t−1 vt−1
P {Jt = 1|ξ i = ci , i = j}.
(29)
The Follow Perturbed Leader Algorithm
47
Here the inequality (20)-(21) follows from (18) and t ≥ t . We have used twice, in change from (21) to (22) and in change from (25) to (26), the equality P {ξ > a + b} = e−b P {ξ > a} for any random variable ξ distributed according to the exponential law. The equality (23)-(24) follows from (19). We have used in change from (27) to (28) the equality vt − vt−1 = Δvt and the inequality sjt ≤ Δvt for all j and t. The expression in the exponent (29) is bounded j2 sj 1:t−1 − s1:t−1 (30) ≤ 1, vt−1 si
since v1:t−1 ≤ 1 and si1:t−1 ≥ 0 for all t and i. t−1 Therefore, we obtain i = j} ≤ P {It = j|ξ = ci , i 2 Δvt i P {Jt = j|ξ = ci , i = j} ≤ exp μt vt exp{2(γ(t))1−αt }P {Jt = j|ξ i = ci , i = j}.
(31) (32)
Since, the inequality (32) holds for all ci , it also holds unconditionally P {It = j} ≤ exp{2(γ(t))1−αt }P {Jt = j}.
(33)
for all t = 1, 2, . . . and j = 1, . . . N . Using inequality exp{2x} ≤ 1 + (e2 − 1)x for all x such that 0 ≤ x ≤ 1, we obtain from (33) the lower bound N lt = E(sIt t ) = sjt P (It = j) ≤ j=1
exp{2(γ(t))1−αt }
N
sjt P (Jt = j) = exp{2(γ(t))1−αt }E(sJt t ) =
j=1
exp{2(γ(t))1−αt }rt ≤ (1 + (e2 − 1)(γ(t))1−αt )rt .
(34)
Since rt ≤ Δvt for all t, the inequality (34) implies l1:T ≤ r1:T + (e2 − 1)
T
(γ(t))1−αt Δvt
t=1
for all T . Lemma 1 is proved. The following lemma, which is an analogue of the result from [5], gives a bound for the IFPL algorithm. Lemma 2. The expected cumulative loss of the IFPL algorithm with the learning rate (16) is bounded by r1:T ≤ min si1:T + (1 + ln N ) i
for all T , where αt is defined by (7).
T t=1
(γ(t))αt Δvt
(35)
48
V.V. V’yugin
Proof. The proof is along the line of the proof from Hutter and Poland [4] with an exception that now the sequence t is not monotonic. Let in this proof, st = (s1t , . . . sN t ) be a vector of one-step losses and s1:t = 1 (s1:t , . . . sN 1:t ) be a vector of cumulative losses of the experts algorithms. Also, let ξ = (ξ 1 , . . . ξ N ) be a vector whose coordinates are random variables. Recall that t = 1/(μt vt ), μt ≤ μt−1 for all t, and v0 = 0, 0 = ∞. Define s˜1:t = s1:t − 1 ξ for t = 1, 2, . . .. Consider the vector of one-step losses t for the moment. s˜t = st − ξ 1 − 1 t t−1 For any vector s and a unit vector d denote M (s) = argmind∈D {d · s}, where D = {(0, . . . 1), (1, . . . 0)} is the set of N unit vectors of dimension N and “·” is the inner product of two vectors. We first show that T
M (˜ s1:t ) · s˜t ≤ M (˜ s1:T ) · s˜1:T .
(36)
t=1
For T = 1 this is obvious. For the induction step from T − 1 to T we need to show that s1:T ) · s˜1:T − M (˜ s1:T −1 ) · s˜1:T −1 . M (˜ s1:T ) · s˜T ≤ M (˜ This follows from s˜1:T = s˜1:T −1 + s˜T and M (˜ s1:T ) · s˜1:T −1 ≥ M (˜ s1:T −1 ) · s˜1:T −1 . We rewrite (36) as follows T
M (˜ s1:t ) · st ≤ M (˜ s1:T ) · s˜1:T +
t=1
T
M (˜ s1:t ) · ξ
t=1
1 1 − t t−1
.
(37)
By definition of M we have ξ M (˜ s1:T ) · s˜1:T ≤ M (s1:T ) · s1:T − = T ξ min{d · s1:T } − M (s1:T ) · . d∈D T The expectation of the last term in (38) is equal to The second term of (37) can be rewritten T t=1
M (˜ s1:t ) · ξ
1 1 − t t−1
=
T t=1
1 T
(38)
= μT vT .
(μt vt − μt−1 vt−1 )M (˜ s1:t ) · ξ.
(39)
The Follow Perturbed Leader Algorithm
49
We will use the inequality for mathematical expectation E 0 ≤ E(M (˜ s1:t ) · ξ) ≤ E(M (ξ) · ξ) = E(max ξ i ) ≤ 1 + ln N. i
(40)
The proof of this inequality uses ideas of Lemma 1 from [4]. We have for the exponentially distributed random variables ξ i , i = 1, . . . N , P {max ξ i ≥ a} = P {∃i(ξ i ≥ a)} ≤
N
i
P {ξ i ≥ a} = N exp{−a}.
(41)
i=1
Since for any non-negative random variable η, E(η) =
∞
P {η ≥ y}dy, by (41)
0
we have ∞
i
E(max ξ − ln N ) = i
P {max ξ i − ln N ≥ y}dy ≤ i
0
∞ N exp{−y − ln N }dy = 1. 0
Therefore, E(maxi ξ i ) ≤ 1 + ln N . By (40) the expectation of (39) has the upper bound T
E(M (˜ s1:t ) · ξ)(μt vt − μt−1 vt−1 ) ≤ (1 + ln N )
t=1
T
μt Δvt .
t=1
Here we have used the inequality μt ≤ μt−1 for all t, Since E(ξ i ) = 1 for all i, the expectation of the last term in (38) is equal to 1 ξ E M (s1:T ) · = = μT vT . (42) T T Combining the bounds (37)-(39) and (42), we obtain T M (˜ s1:t ) · st ≤ r1:T = E t=1
min si1:T − μT vT + (1 + ln N ) i
T
μt Δvt ≤
t=1
min si1:T + (1 + ln N ) i
Lemma is proved. . We finish now the proof of the theorem.
T t=1
μt Δvt .
(43)
50
V.V. V’yugin
The inequality (17) of Lemma 1 and the inequality (35) of Lemma 2 imply the inequality E(s1:T ) ≤ min si1:T + i
T
((e2 − 1)(γ(t))1−αt + (1 + ln N )(γ(t))αt )Δvt . (44)
t=1
for all T . The optimal value (7) of αt can be easily obtained by minimization of each member of the sum (44) by αt . In this case μt is equal to (8) and (44) is equivalent to (12). T We have t=1 Δvt = vT for all T , vt → ∞ and γ(t) → 0 as t → ∞. Then by Toeplitz lemma [10] T 1 2 (e2 − 1)(1 + ln N ) (γ(t))1/2 Δvt → 0 vT t=1 as T → ∞. Therefore, the FPL algorithm PROT is asymptotically consistent in the mean, i.e., the relation (13) of Theorem 1 is proved. We use some version of the strong law of large numbers to formulate a sufficient condition for Hannan consistency of the algorithm PROT. Lemma 3. Let g(x) be a positive nondecreasing real function such that x/g(x), g(x)/x2 are non-increasing for x > 0 and g(x) = g(−x) for all x. Let the assumptions of Theorem 1 hold and ∞ g(Δvt ) t=1
g(vt )
< ∞.
(45)
Then the FPL algorithm PROT is Hannan consistent, i.e., (5) holds as T → ∞ almost surely. Proof. We use Theorem 11 from Petrov [8] (Chapter IX, Section 2) which gives sufficient conditions in order that the strong law of large numbers holds for a sequence of independent unbounded random variables: Let at be a nondecreasing sequence of real numbers such that at → ∞ as t → ∞ and Xt be a sequence of independent random variables such that E(Xt ) = 0, for t = 1, 2, . . .. Let also, g(x) satisfies assumptions of Lemma 3. By Theorem 11 from Petrov [8] the inequality ∞ E(g(Xt )) t=1
g(at )
<∞
(46)
implies T 1 Xt → 0 aT t=1
as T → ∞ almost surely.
(47)
The Follow Perturbed Leader Algorithm
51
Put Xt = st − E(st ), where st is the loss of the FPL algorithm PROT at step t, and at = vt for all t. By definition |Xt | ≤ Δvt for all t. Then (46) is valid, and by (47) T 1 1 (s1:T − E(s1:T )) = (st − E(st )) → 0 vT vT t=1 as T → ∞ almost surely. This limit and the limit (13) imply (14). By Lemma 3 the algorithm PROT is Hannan consistent, since (15) implies (45) for g(x) = x2 . Theorem 1 is proved. Authors of [2] and [9] considered polynomially bounded one-step losses. We consider a specific example of the bound (44) for polynomial case. Corollary 1. Assume that sit ≤ ta for all t and i = 1, . . . N , and vt lim inf a+δ > 0, t→∞ t where a and δ are positive real numbers. Let also in the algorithm PROT, γ(t) = t−δ and μt = (γ(t))αt , where αt is defined by (7). Then – (i) the algorithm PROT is asymptotically consistent in the mean for any a > 0 and δ > 0; – (ii) this algorithm is Hannan consistent for any a > 0 and δ > 12 ; – (iii) the expected loss of this algorithm is bounded by 1 E(s1:T ) ≤ min si1:T + 2 (e2 − 1)(1 + ln N )T 1− 2 δ+a (48) i
as T → ∞. This corollary follows directly from Theorem 1, where condition (15) of Theorem 1 holds for δ > 12 . If δ = 1 the regret from (48) is asymptotically equivalent to the regret from Allenberg et al. [2] (see Section 1). For a = 0 we have the case of bounded loss function (0 ≤ sit ≤ 1 for all i and t). The FPL algorithm PROT is asymptotically consistent in the mean if vt ≥ β(t) for all t, where β(t) is an arbitrary positive unbounded non-decreasing computable function (we can get γ(t) = 1/β(t) in this case). This algorithm is Hannan consistent if (15) holds, i.e. ∞
(β(t))−2 < ∞.
t=1
For example, this condition be satisfied for β(t) = t1/2 ln t. Theorem 1 is also valid for the standard time scaling, i.e., when vT = T for all T , and when losses of experts are bounded, i.e., a = 0. Then the expected regret has the upper bound T (γ(t))1/2 ≤ 4 (e2 − 1)(1 + ln N )T 2 (e2 − 1)(1 + ln N ) t=1
which is similar to bounds from [4] and [5].
52
V.V. V’yugin
Acknowledgments This research was partially supported by Russian foundation for fundamental research: 09-07-00180-a and 09-01-00709a.
References 1. Cesa-Bianchi, N., Mansour, Y., Stoltz, G.: Improved second-order bounds for prediction with expert advice. Machine Learning 66(2-3), 321–352 (2007) 2. Allenberg, C., Auer, P., Gyorfi, L., Ottucsak, G.: Hannan consistency in on-Line learning in case of unbounded losses under partial monitoring. In: Balc´azar, J.L., Long, P.M., Stephan, F. (eds.) ALT 2006. LNCS (LNAI), vol. 4264, pp. 229–243. Springer, Heidelberg (2006) 3. Hannan, J.: Approximation to Bayes risk in repeated plays. In: Dresher, M., Tucker, A.W., Wolfe, P. (eds.) Contributions to the Theory of Games, vol. 3, pp. 97–139. Princeton University Press, Princeton (1957) 4. Hutter, M., Poland, J.: Prediction with expert advice by following the perturbed leader for general weights. In: Ben-David, S., Case, J., Maruoka, A. (eds.) ALT 2004. LNCS (LNAI), vol. 3244, pp. 279–293. Springer, Heidelberg (2004) 5. Kalai, A., Vempala, S.: Efficient algorithms for online decisions. In: Sch¨olkopf, B., Warmuth, M.K. (eds.) COLT/Kernel 2003. LNCS (LNAI), vol. 2777, pp. 26–40. Springer, Heidelberg (2003); Extended version in Journal of Computer and System Sciences 71, 291–307 (2005) 6. Littlestone, N., Warmuth, M.K.: The weighted majority algorithm. Information and Computation 108, 212–261 (1994) 7. Lugosi, G., Cesa-Bianchi, N.: Prediction, Learning and Games. Cambridge University Press, New York (2006) 8. Petrov, V.V.: Sums of independent random variables. Ergebnisse der Mathematik und ihrer Grenzgebiete, Band 82. Springer, Heidelberg (1975) 9. Poland, J., Hutter, M.: Defensive universal learning with experts. For general weight. In: Jain, S., Simon, H.U., Tomita, E. (eds.) ALT 2005. LNCS (LNAI), vol. 3734, pp. 356–370. Springer, Heidelberg (2005) 10. Shiryaev, A.N.: Probability. Springer, Berlin (1980) 11. Vovk, V.G.: Aggregating strategies. In: Fulk, M., Case, J. (eds.) Proceedings of the 3rd Annual Workshop on Computational Learning Theory, San Mateo, CA, pp. 371–383. Morgan Kaufmann, San Francisco (1990)
Computable Bayesian Compression for Uniformly Discretizable Statistical Models Łukasz Dębowski Centrum Wiskunde & Informatica, 1098 XG Amsterdam, The Netherlands
Abstract. Supplementing Vovk and V’yugin’s ‘if’ statement, we show that Bayesian compression provides the best enumerable compression for parameter-typical data if and only if the parameter is Martin-L¨ of random with respect to the prior. The result is derived for uniformly discretizable statistical models, introduced here. They feature the crucial property that given a discretized parameter, we can compute how much data is needed to learn its value with little uncertainty. Exponential families and certain nonparametric models are shown to be uniformly discretizable.
1
Introduction
Algorithmic information theory inspires an appealing interpretation of Bayesian inference [1,2,3,4]. Literally, a fixed individual parameter cannot have the property of being distributed according to a distribution but, when it is represented as a sequence of digits, the parameter is almost surely algorithmically random. Thus, if you believe that a parameter obeys a prior, it may rather mean that you suppose that the parameter is algorithmically random with respect to the prior. We want to argue that this interpretation is valid. We will assume that the parameter θ is, in some sense, effectively identifiable. Then one can disprove that a finite prefix of a fixed, not fully known θ is algorithmically random by estimating the prefix and showing that there exists a shorter description of that prefix. Hence, Bayesian beliefs seem admissible scientific hypotheses according to the Popperian philosophy, cf. [1]. Secondly, it follows that the Bayesian measure Pθ dQ(θ) gives the best enumerable compression of Pθ -typical data if and only if parameter θ is algorithmically random with respect to prior Q. This statement is useful when Pθ is not computable for a fixed θ. Moreover, once we know where Bayesian compression fails, we should systematically adjust the prior to our hypotheses about the algorithmic complexity of θ in an application. As we will show, this ‘if and only if ’ result can be foreseen using the chain rule for prefix Kolmogorov complexity of finite objects [5], [6, Theorem 3.9.1]. The chain rule allows to relate randomness deficiencies for finite prefixes of the data and of the parameter in some specific statistical models, which we call uniformly discretizable. That yields a somewhat weaker ‘if and only if ’ statement. Subsequently, the statement can be strengthened using the dual chain rule for impossibility levels of infinite sequences [1, Theorem 1] and extensions of Lambalgen’s R. Gavald` a et al. (Eds.): ALT 2009, LNAI 5809, pp. 53–67, 2009. c Springer-Verlag Berlin Heidelberg 2009
54
Ł. Dębowski
theorem for conditionally random sequences [7], [4, Theorem 4.2 and 5.3]. The condition of uniform discretization can be completely removed from the ‘if ’ part and relaxed to an effective identifiability of the parameter in the ‘only if ’ part. Namely, given a prefix of the parameter, we must be able to compute how much data is needed to learn its value with a fixed uncertainty. The organization of this paper is as follows. In Section 2, we discuss quality of Bayesian compression for individual parameters and we derive the randomness deficiency bounds for prefixes of the parameter and the parameter-typical data. These bounds hold for the newly introduced class of uniformly discretizable statistical models. In Section 3, we show that exponential families are uniformly discretizable. The assumptions on the prior and the proof look familiar to statisticians working in minimum description length (MDL) inference [8,9]. An example of a ‘nonparametric’ uniformly discretizable model appears in Section 4. In the final Section 5, we prove that countable mixtures of uniformly discretizable models are uniformly discretizable if the Bayesian estimator consistently chooses the right submodel for the data. The definition of uniformly discretizable models is given below. Condition (3) says that the parameter may be discretized to m ≥ μ(n) digits for the sake of approximating the ‘true’ probability of data xn . Condition (4) asserts that the parameter, discretized to m digits, can be predicted for all but finitely many m given data xn of length n ≥ ν(m). Functions μ and ν depend on a model. To fix our notation in advance, we use a countable alphabet X and a finite Y = {0, 1, ..., D − 1}, D > 1. The logarithm to base D is written as log. An italic x ∈ X+ is a string, a boldface x ∈ XN is an infinite sequence. The n-th symbol of x is written as xn ∈ X and xn is the prefix of x of length n: x = x1 x2 x3 ... and xn = x1 x2 ...xn . Capital boldface Y : X∗ → R denotes a distribution of strings normalized lengthwise, i.e., 0 ≤ Y (x), a Y (xa)1{|a|=n} = Y (x), and Y (λ) = 1 for the empty string λ. There is a unique measure on measurable sets of infinite sequences x ∈ XN , also denoted as Y , such that Y ({x : xn = x for n = |x|}) = Y (x). Quantifier ‘n-eventually’ means ‘for all but finitely many n ∈ N’. Definition 1. Fix a measurable subset Θ ⊂ YN . Let P : X∗ × Θ (x, θ) → Pθ (x) ∈ R be a probability kernel, i.e., Pθ : X∗ → R is a probability measure for each θ ∈ Θ and the mapping θ → Pθ is measurable. Let also Q : Y∗ → R be a probability measure on Θ, i.e., Q(Θ) = 1. A Bayesian statistical model (P , Q) is called (μ, ν)-uniformly discretizable if it satisfies the following. (i) Define the measure T : X∗ × Y∗ → R as T (x, θ) :=
A(θ)
Pθ (x)dQ(θ),
(1)
where A(θ) := {θ ∈ Θ : θ is the prefix of θ}, and denote its other marginal Y (x) := T (x, λ) =
Pθ (x)dQ(θ).
(2)
Computable Bayesian Compression
55
(ii) Function μ : N → R is nondecreasing and we require that for all θ ∈ Θ, Pθ -almost all x, and m ≥ μ(n), log [Q(θm )Pθ (xn )/T (xn , θm )] = 0. n→∞ log m lim
(3)
(iii) Function ν : N → R is nondecreasing and we require that for all θ ∈ Θ, Pθ -almost all x, and n ≥ ν(m), lim T (xn , θm )/Y (xn ) = 1.
m→∞
(4)
˜ with a kernel P˜ : X∗ × Θ ˜ → R and a meaRemark: A Bayesian model (P˜ , Q) ˜ on Θ ˜ will be called (ρ, μ, ν)-uniformly discretizable if (P , Q) is (μ, ν)sure Q ˜ → Θ, Pθ (x) := P˜ρ−1 (θ) (x), and uniformly discretizable for a bijection ρ : Θ −1 ˜ Q := Q ◦ ρ . We will write ‘(ρ, μ(n), ν(m))-uniformly discretizable’ when there are no convenient symbols for functions μ and ν. A few words of comment to this definition are due. By condition (3), the support of prior Q equals Θ, i.e., Q(θm ) > 0 for all m and θ ∈ Θ. Condition (4) admits a consistent estimator if there is a function σ : X∗ → N, where ν(σ(xn )) ≤ n, σ(xn+1 ) ≥ σ(xn ), and limn σ(xn ) = ∞. Define the discrete maximum likelihood estimator MLE(x; σ) := argmaxθ∈Ym T (x, θ) with m = σ(x). The estimator is n called consistent if MLE(xn ; σ) = θσ(x ) n-eventually for all θ ∈ Θ and Pθ almost all x. This property is indeed satisfied. Four models presented in Sections 3 and 4 satisfy a stronger condition. Definition 2. A (μ, ν)-uniformly discretizable model is called μ-uniformly discretizable if ν is recursive and μ(ν(m)) ≤ mα for an α > 0. These models feature log μ(n) close to the logarithm of Shannon redundancy − log Y (xn ) + log Pθ (xn ). A heuristic rationale is as follows. If we had μ ◦ ν = id, − log Q(θm ) = Ω(m), and we put n = ν(m) then |− log Y (xn ) + log Pθ (xn ) + log Q(θm )| = o(log m) and hence μ(n) = m = O(− log Y (xn ) + log Pθ (xn )). Whereas − log Q(θm ) = Ω(m) is a reasonable assumption, we rather observe μ(ν(m)) > m. The present approach allows only discrete data. We hope, however, that uniformly discretizable models can be generalized to nondiscrete data so that consistency and algorithmic optimality of Bayesian procedures in density estimation could be characterized in a similar fashion, cf. [10]. Another interesting path of development is to integrate the algorithmic perspective on Bayesianism with the present MDL framework [8,9], where normalized maximum likelihood codes are discussed. By the algorithmic optimality of Bayesian compression, the normalized maximum likelihood measure, if it can be defined properly, should converge to the Bayesian measure Pθ dQ(θ) in log-loss. We also suppose that reasonable luckiness functions, introduced to guarantee existence of modified normalized maximum likelihood codes [9, Section 11.3], may be close to algorithmic information between the parameter and the data.
56
2
Ł. Dębowski
Bounds for the Data and Parameter Complexity
We will use a universal computer with an oracle, which can compute certain functions R → R. To make it clear which these are, we adopt the following definitions, cf. [11], [6, Sections 1.7 and 3.1], [1, Section 2], [4, Section 5]: (i) A universal computer is an appropriate finite state machine that interacts with infinite tapes. The machine can move along the tapes in discrete steps, read and write on them single symbols from the finite set Y, and announce the end of computation. We fix three one-sided tapes. At the beginning of computation, tape α contains a program, i.e., a string from a prefix-free subset of Y+ , and tape β contains an oracle, i.e., an element of (0Y∗ ) ∪ (1YN ). At the end of computation, tape γ contains an output, i.e., a string from Y∗ . (ii) The prefix Kolmogorov complexity K(y) of a string y ∈ Y∗ is the minimal length of such a program on tape α that y is output on tape γ provided no symbol is read from tape β. (iii) The conditional complexity K(y|δ) for y ∈ Y∗ and δ ∈ Y∗ ∪ YN is the minimal length of such a program on tape α that y is output on tape γ given 0δ or 1δ, respectively, as an oracle on tape β. (iv) A function f : Y∗ ∪YN → Y∗ is recursive if there is such a program z ∈ Y+ that string f (y) is output for all oracles y ∈ Y∗ ∪ YN . (v) Function φ is a prefix code if it is an injection and its image is prefix-free. (vi) For certain prefix codes φW : W → Y∗ and φU : U → Y∗ ∪ YN and arbitrary w ∈ W and u ∈ U, we put K(w) := K(φW (w)) and K(w|u) := K(φW (w)|φW (u)). Fixing φY∗ and φY∗ ∪YN as identity functions, f : U → W is called recursive if so is φW ◦ f ◦ φ−1 U . (vii) Numbers obey special conventions. Integers are Elias-coded [12], whereas φQ (p/q) := φZ (p)N (q) for every irreducible fraction p/q. To convert a real number from (−∞, ∞) into aone-sided sequence, we assume that φR (r) = ∞ θ satisfies [1 + exp(−r)] = i=1 θi D−i . This solves the problem of real arguments. A real-valued function f : W → R is called enumerable if there is a recursive function g : W × N → Q nondecreasing in k such that limk g(w, k) = f (w). A stronger condition, the f is called recursive if there is a recursive function h : W × N → Q such that |f (w) − h(w, k)| < 1/k. (viii) Pairs (w, u) enjoy the code φW×U (w, u) := φW (w)φW (u). This code cannot be used if w is real. In the Proposition 2 of Section 3, where we need to string real vectors, Cantor’s code is used instead. (ix) The concepts mentioned above are analogously extended to partial functions. Special care must be taken to assume computability of their domains, which is important to guarantee that the inverse of the ShannonFano-Elias code, used in Theorem 1, is recursive. Last but not least, a semimeasure U is a function X∗ → R that satisfies 0 ≤ ∗ U (x), a U (xa)1{|a|=n} ≤ U (x), and U (λ) ≤ 1. Symbol < denotes inequality up to a multiplicative constant.
Computable Bayesian Compression
57
Impossibility level n
D−K(x ) I(x; Y ) := inf n∈N Y (xn )
(5)
is a natural measure of randomness deficiency for a sequence x ∈ XN with respect to a recursive measure Y , cf. [1], [6, Def. 4.5.10 and Thm. 4.5.5]. The respective set of Y -Martin-L¨of random sequences LY := {x : I(x; Y ) < ∞}
(6)
has two important properties. Firstly, LY is the maximal set of sequences on which no enumerable semimeasure outperforms a recursive measure Y more than by a multiplicative constant. Let M be the universal enumerable semimeasure [6, Section 4.5.1]. By [2, Theorem 1 and Lemma 3], we have ∗
I(x; Y ) < lim inf n→∞
M (xn ) ∗ M (xn ) ∗ < sup < [I(x; Y )]1+ n n) Y (x ) Y (x n∈N
(7)
∗
for a fixed > 0 and recursive Y . By the definition of M , U (xn ) < M (xn ) for any enumerable (semi)measure U . Hence supn∈N U (xn )/Y (xn ) < ∞ if x ∈ LY . Moreover, LY = LU if Y and U are mutually equivalent recursive measures, i.e., supn∈N U (xn )/Y (xn ) < ∞ ⇐⇒ supn∈N Y (xn )/U (xn ) < ∞ for all x ∈ XN . Secondly, the set LY has full measure Y . The fact is well-known, cf. e.g. [1, Remark 2], and it can be seen easily using the auxiliary statement below, which strengthens Barron’s result [13, Theorem 3.1]. Whereas Y (LY ) = 1 follows for |B(·)| = K(·), we shall use this lemma later also for |B(·)| = K(·|θ). Lemma 1 (no hypercompression). Let B : X∗ → Y+ be a prefix code. Then |B(xn )| + log Y (xn ) > 0
(8)
n-eventually for Y -almost all sequences x. Proof. Consider the function W (x) := D −|B(x)| . By the Markov inequality, W (xn ) W (xn ) ≥ 1 ≤ E x∼Y = Y ((8) is false) = Y 1{|x|=n} W (x). Y (xn ) Y (xn ) x Hence n Y ((8) is false) ≤ x D−|B(x)| ≤ 1 < ∞ by the Kraft inequality. The claim now follows by the Borel-Cantelli lemma. Now let Y be a recursive Bayesian measure (2). In a prototypical case, measures Pθ are not enumerable Q-almost surely. But the data that are almost surely typical for these measures can be optimally compressed with the effectively computable measure Y . That is, Pθ (LY ) = 1 holds Q-almost everywhere, as implied by the following statement.
58
Ł. Dębowski
Lemma 2 (cf. [14, Section 9]). Equality Y (X ) = 1 for Y = implies Pθ (X ) = 1 for Q-almost all θ.
Pθ dQ(θ)
Proof. Let Gn := {θ ∈ Θ : Pθ (X ) ≥ 1 − 1/n}. We have 1 = Y (X ) ≤ Q(Gn ) + Q(Θ \ Gn )(1 − 1/n) = 1 − n−1 Q(Θ \ Gn ). Thus Q(Gn ) = 1. By
σ-additivity, Q(G) = inf n Q(Gn ) = 1 follows for G := {θ ∈ Θ : Pθ (X ) = 1} = n Gn . Notably, the Bayesian compressor can be shown optimal exactly when the parameter is incompressible. Strictly speaking, we will obtain Pθ (LY ) = 1 if and only if θ is Martin-L¨of random with respect to Q. This holds, of course, under some tacit assumptions. For instance, if we take Pθ ≡ Y then Pθ (LY ) = 1 for all θ ∈ Θ. We may thus suppose that the ‘if and only if ’ statement holds provided the parameter can be effectively identified. The following two propositions form the first step to see what assumptions are needed exactly. Lemma 3. For a computer-dependent constant A, we have K(x|θ) ≤ A + K(x|θm , K(θm )) + K(K(θm )) + K(m).
(9)
Proof. A certain program for computing x given θ operates as follows. It first calls a subroutine of length K(m) to compute m and a subroutine of length K(K(θm )) to compute K(θm ). Then it reads the prefix θm of θ and passes θm and K(θm ) to a subroutine of length K(x|θm , K(θm )) which returns x. Theorem 1. Let (P , Q) be a Bayesian statistical model with a recursive prior Q : Y∗ → R and a recursive kernel P : X∗ × Θ → R. (i) If (3) holds for Pθ -almost all x then K(xn ) + log Y (xn ) ≥ K(θm ) + log Q(θm ) − 3 log m + o(log m)
(10)
is also true for Pθ -almost all x. (ii) If (4) holds for a recursive τ : Y∗ → N and n = τ (θm ) then K(xn ) + log Y (xn ) ≤ K(θm ) + log Q(θm ) + O(1).
(11)
Proof. (i) For Pθ -almost all x we have both (3) and K(xn |θ) + log Pθ (xn ) ≥ 0
(12)
n-eventually, by Lemma 1 for |B(·)| = K(·|θ). Applying Lemma 3 to these sequences yields K(xn |θm , K(θm )) + log T (xn , θm ) − log Q(θm ) ≥ −K(K(θm )) − K(m) + o(log m) = −2 log m + o(log m) because K(θm ) ≤ m + log m + o(log m) and K(m) ≤ log m + o(log m). Since K(xn |θm , K(θm )) + K(θm ) = K(xn , θm ) + O(1)
(13)
Computable Bayesian Compression
59
by the chain rule for prefix complexity [6, Theorem 3.9.1], we obtain K(xn , θm ) + log T (xn , θm ) ≥ K(θm ) + log Q(θm ) − 2 log m + o(log m). In the following, we apply (13) with xn and θm switched, and observe that K(θm |xn , K(xn )) ≤ A + K(m) − log
T (xn , θm ) Y (xn )
follows by conditional Shannon-Fano-Elias coding of θm of an arbitrary length given xn , cf. [15, Section 5.9]. Hence (10) holds for Pθ -almost all x. (ii) By conditional Shannon-Fano-Elias coding of xn given θm we obtain K(xn , θm ) ≤ A + K(θm ) − log
T (xn , θm ) . Q(θm )
(14)
(This time, we need not specify the length of xn separately since it can be computed from θm .) Substituting (4) into (14) and chaining the result with K(xn ) ≤ A + K(xn , θm ) yields (11). Theorem 1 applies to uniformly discretizable models if we plug in m ≥ μ(n) and τ (θm ) ≥ ν(m). Hence we obtain the first, less elegant dichotomy. Proposition 1. Let (P , Q) be a μ-uniformly discretizable model with a recursive prior Q : Y∗ → R and a recursive kernel P : X∗ × Θ → R. We have 1 if θ ∈ LQ,log n , Pθ (LY ,log μ(n) ) = (15) 0 if θ ∈ LQ,log n , where the sets of (Y , g(n))-random sequences are defined as
K(xn ) + log Y (xn ) > −∞ . LY ,g(n) := x : inf n∈N g(n)
(16)
In particular, LY ,1 = LY . Theorem 1(ii) suffices to prove Pθ (LY ) = 0 for θ ∈ LQ but to show Pθ (LY ) = 1 in the other case we need a stronger statement than Theorem 1(i). Here we can rely on the chain rule for conditional impossibility levels by Vovk and V’yugin [1, Theorem 1] and extensions of Lambalgen’s theorem for conditionally random sequences by Takahashi [4]. For a recursive kernel P , let us define by analogy the conditional impossibility level n
D−K(x |θ) I(x; P |θ) := inf n∈N Pθ (xn ) and the set of conditionally random sequences LP |θ := x ∈ XN : I(x; P |θ) < ∞ .
(17)
(18)
60
Ł. Dębowski
We have Pθ (LP |θ ) = 1 for all θ by Lemma 1, as used in (12). Adjusting the proof of [6, Theorem 4.5.5] to computation with an oracle, we can show that the definition of I(x; P |θ) given here is equivalent to the one given by [1], cf. [6, Def. 4.5.10]. Hence ∗ ∗ 1+ inf [I(x; P |θ) I(θ; Q)] < I(x; Y ) < inf I(x; P |θ) [I(θ; Q)]
θ∈Θ
θ∈Θ
(19)
holds for Y = Pθ dQ(θ) and > 0 by [1, Corollary 4]. Inequality (19) and Theorem 1(ii) imply the main claim of this article. Theorem 2. Let (P , Q) be a Bayesian statistical model with a recursive prior Q : Y∗ → R and a recursive kernel P : X∗ × Θ → R. Suppose that (4) holds for all θ ∈ Θ, Pθ -almost all x, and n = τ (θm ), where τ : Y∗ → N is recursive. Then we have 1 if θ ∈ LQ , Pθ (LY ) = (20) 0 if θ ∈ LQ . The upper part of (20) can be strengthened as decomposition LY = θ∈LQ LP |θ , which holds for all recursive P and Q [4, Cor. 4.3 & Thm. 5.3]. (Our definition of a recursive P corresponds to ‘uniformly computable’ in [4].) We suppose that, under the assumption of Theorem 2, sets LP |θ are disjoint for θ ∈ Θ. This would strengthen the lower part of (20).
3
The Case of Exponential Families
As shown in [16], k-parameter exponential families exhibit Shannon redundancy − log Y (xn ) + log Pθ (xn ) = k2 log n + Θ(log log n). Here we shall prove that these models are uniformly discretizable with μ(n) = k2 + log n respectively. The ˜ on Θ ˜ ⊂ Rk result is established under a familiar condition. Namely, a prior Q ˜ there is universally lower-bounded by the Lebesgue measure λ if for each ϑ ∈ Θ ˜ exists an open set C ϑ and a w > 0 such that Q(E) ≥ wλ(E) for every ˜ is the support of Q ˜ and is measurable E ⊂ C. This condition implies that Θ ˜ ˜ satisfied, in particular, if Q and λ restricted to Θ are mutually equivalent. Let us write the components of vectors ϑ, ϑ ∈ Rk as ϑ = (ϑ1 , ϑ2 , ..., ϑk ) and k 2 their Euclidean distance as |ϑ − ϑ| := l=1 (ϑl − ϑl ) . ˜ (x, ϑ) Example 1 (an exponential family). Let the kernel P˜ : X∗ × Θ → ˜ Pϑ (x) ∈ R represent a regular k-parameter exponential family. That is: (i) Certain functions p : X → (0, ∞) and T : X → Rk satisfy x∈X p(x) < ∞ k and ∀β∈Rk \0 ∀c∈R ∃x∈X l=1 βl Tl (x) = c (i.e., T has affinely independent components).
Computable Bayesian Compression
(ii) Let Z(β) :=
x∈X p(x) exp n
P˜β (x ) :=
n
61
k
β T (x) and define measures l l l=1
p(xi ) exp
i=1
k
βl Tl (x) − ln Z(β)
l=1
for β ∈ B := β ∈ Rk : Z(β ) < ∞ . (iii) We require that B is open. (It is not empty since 0 ∈ B.) Under this condition, ϑ(·) : B β → ϑ(β) := E x∼P˜β T (xi ) ∈ Rk is a twice differen˜ := ϑ(B) and put P˜ϑ := P˜β(ϑ) for tiable injection [17], [9]. Thus assume Θ β(·) := ϑ−1 (·). ˜ be universally lower-bounded by the Lebesgue meaAdditionally, let the prior Q k ˜ Θ) ˜ = 1. sure on R and let it satisfy Q( ˜ → (0, 1)k is Proposition 2. Use Cantor’s code ρ := ρs ◦ ρn , where ρn : Θ N a differentiable injection and ρs : (0, 1)k → Y satisfies ρ (y) = θ s 1 θ2 θ3 ... for any ∞ −i vector y ∈ (0, 1)k with components yl = θ D . Then the model i=1 (i−1)k+l k (2/k+)m ˜ ˜ (P , Q) is ρ, 2 + log n, D -uniformly discretizable for > 0. ˜ ◦ ρ−1 , and A(θ) := ˜ Pθ (x) := P˜ρ−1 (θ) (x), Q := Q Proof. Let Θ := ρ(Θ), {θ ∈ Θ : θ is the prefix of θ}. Consider a θ ∈ Θ. Firstly, let m ≥ k2 + log n. We have (21) for ϑ = ρ−1 (θ) and An = ρ−1 (A(θm )). Hence (3) holds by the Theorem 3(i) below. Secondly, let n ≥ D(2/k+)m . We have (23) for ϑ = ρ−1 (θ) and Bn = ρ−1 (A(θm )). Hence (4) follows by Theorem 3(ii). The statement below may look more familiar for statisticians. ˜ for the model specified in Example 1. Theorem 3. Fix a ϑ ∈ Θ ˜ which satisfy (i) If we take sufficiently small measurable sets An ⊂ Θ supϑ ∈An |ϑ − ϑ| √ =0 n→∞ n−1 ln ln n ˜ )/ ˜ ) then dQ(ϑ and put P˜n (x) := An P˜ϑ (x)dQ(ϑ An lim sup
log P˜n (xn ) − log P˜ϑ (xn ) =0 n→∞ ln ln n lim
for P˜ϑ -almost all x. (ii) On the other hand, if we take sufficiently large measurable sets ˜ : |ϑ − ϑ| ≥ n−1/2+α Bn ⊃ ϑ ∈ Θ for an arbitrary α ∈ (0, 1/2) then ˜ ) − log lim log P˜ϑ (xn )dQ(ϑ n→∞
for P˜ϑ -almost all x.
Bn
˜ ) = 0 P˜ϑ (xn )dQ(ϑ
(21)
(22)
(23)
(24)
62
Ł. Dębowski
ˆ n ) := n−1 n T (xi ) is the maximum likelihood estiProof. (i) Function ϑ(x i=1 ˜ yields mator of ϑ, in the usual sense. Thus the Taylor expansion for any ϑ ∈ Θ n ˜ n log P˜ϑ(x ˆ n ) (x ) − log Pϑ (x ) = n
k
l,m=1
Rlm (ϑ)Slm (ϑ),
(25)
ˆl (xn ))(ϑm − ϑ ˆm (xn )) and Rlm (ϑ) := 1 (1 − t)Ilm (tϑ + where Slm (ϑ) := (ϑl − ϑ 0 ˆ n ))dt, whereas the observed Fisher information matrix Ilm (ϑ) := (1 − t)ϑ(x −n−1 ∂ϑl ∂ϑm log P˜ϑ (xn ) does not depend on n and xn . Consequently, log P˜ϑ (xn ) − log P˜ϑ (xn ) = n kl,m=1 [Rlm (ϑ ) [Slm (ϑ) − Slm (ϑ)] + [Rlm (ϑ ) − Rlm (ϑ)] Slm (ϑ)] . ˜ and the smallest ball containing An and of Θ With Cn denote the intersection n n ˆ ˆ ϑ(x ). Let dn := ϑ − ϑ(x ) and an := supϑ ∈An |ϑ − ϑ|. Hence we bound + k + − |an (2dn + an ) + |Rlm − Rlm |d2n , log P˜n (xn ) − log P˜ϑ (xn ) ≤ n l,m=1 |Rlm + − where Rlm := supϑ ∈Cn Rlm (ϑ ) and Rlm := inf ϑ ∈Cn Rlm (ϑ ). By continuity of + − and Rlm tend to Ilm (ϑ) for Fisher information Ilm (ϑ) as a function of ϑ, Rlm n → ∞. On the other hand, the law of iterated logarithm
lim sup n→∞
ˆl (xn ) − ϑl ϑ √ =1 σl 2n−1 ln ln n
(26)
is satisfied for P˜ϑ -almost all x with variance σl2 := Varx∼P˜ϑ Tl (xi ) since the ˆ n ) = ϑ. Consequently, maximum likelihood estimator is unbiased, i.e., E x∼P˜ϑ ϑ(x we obtain (22) for (21). (ii) The proof applies Laplace approximation as in [18] or in the proof of Theorem 8.1 of [9, pages 248–251]. First of all, we have log
n
˜ ) − log P˜ϑ (x )dQ(ϑ
˜ ) P˜ϑ (xn )dQ(ϑ ˜ ˜ ) ≤ Θ\B . P˜ϑ (x )dQ(ϑ n ˜ ) P˜ϑ (xn )dQ(ϑ Bn n
Bn
In the following, we consider a sufficiently large n. Because of the law of iterated ˆ n ) belongs to Bn for P˜ϑ -almost all x. Hence the robustlogarithm (26), ϑ(x ness property and the convexity of Kullback-Leibler divergence for exponential families [9, Eq. (19.12) and Proposition 19.2] imply a bound for the numerator ˜ ) ≤ sup ˜ ˜ n ˜ n P˜ϑ (xn )dQ(ϑ ˜ ϑ ∈Θ\Bn Pϑ (x ) ≤ supϑ ∈∂Bn Pϑ (x ), Θ\B n where ∂Bn is the boundary of Bn . Using (25) gives further n − 2 supϑ ∈∂Bn P˜ϑ (xn ) ≤ P˜ϑ(x ˆ n ) (x ) exp −nR δ
Computable Bayesian Compression
63
k ˆ n )|2 and δ := inf ϑ ∈∂Bn with R− := inf ϑ ∈Bn R (ϑ )S (ϑ ) /|ϑ − ϑ(x lm lm l=1 ˆ n )|. Since the prior is universally lower-bounded by the Lebesgue mea|ϑ − ϑ(x sure, then (25) implies a bound for the denominator ˜ ) ≥ wP˜ˆ n (xn ) exp −nR+ |t|2 dt, P˜ (xn )dQ(ϑ ϑ(x ) Bn ϑ |t|<δ k ˆ n )|2 . Hence where w > 0 and R+ := supϑ ∈Bn R (ϑ )S (ϑ ) /|ϑ − ϑ(x lm lm l=1 we obtain an inequality for the ratio √ ˜ P˜ϑ (xn )dQ(ϑ) ˜ nR+ exp −nR− δ 2 /2 Θ\B n . ≤ ˜ n ˜ w |t|<δ√nR+ exp [−|t|2 ] dt B Pϑ (x )dQ(ϑ) n
The right-hand side tends to zero with n → ∞ since δ = Ω(n−1/2+α ) whereas R+ and R− tend to strictly positive constants by continuity and strictly positive definiteness of the Fisher information matrix.
4
Less Standard Examples
In this section we shall present less standard examples of statistical models. We begin with two very simple models. Example 2 (the data are the parameter). Put Pθ (xn ) := 1{xn =θn } for X = Y and let Q(θ) > 0 for θ ∈ Y∗ . This model is (n, m)-uniformly discretizable. Example 3 (a singleton model). Each parameter θ is random with respect to the prior Q concentrated on this parameter, Θ = {θ}. The respective singleton model (P , Q) is (0, 0)-uniformly discretizable. Now, a slightly more complex instance. Consider a class of stationary processes (Xi )i∈Z of form Xi := (Ki , θKi ), where the variables Ki are independent and distributed according to the hyperbolic distribution P (Ki = k) = p(k) :=
k −1/β , ζ(1/β)
k ∈ N,
(27)
with a fixed β ∈ (0, 1). This family of processes was introduced to model logical consistency of texts in natural language [19]. The distribution of variables Xi is equal to the measure P (Xi ∈ · ) = Pθ for the following Bayesian model. Example 4 (an accessible description model). Put Pθ (xn ) :=
n i=1
p(ki )1{zi =θk } i
for xi = (ki , zi ) ∈ N × Y and let Q(θ) > 0 for θ ∈ Y∗ .
(28)
64
Ł. Dębowski
For this model, Shannon information between the data and the parameter equals E (x,θ)∼T [− log Y (xn ) + log Pθ (xn )] = Θ(nβ ) asymptotically if Q(θ) = D−|θ| , cf. [19, Theorem 10]. As a consequence of the next statement, the accessible description model (28) is (nυ , m1/λ )-uniformly discretizable for υ > 2β/(1 − β) and λ < β. Proposition 3. For independent variables (Ki )i∈Z with the distribution (27), {K1 , K2 , ..., Kn } \ {1, 2, ..., nυ } = ∅, 1, 2, ..., nλ \ {K1 , K2 , ..., Kn } = ∅,
(29) (30)
n-eventually almost surely. Proof. To establish the first claim, put Un := nυ and observe = ∅) ≤ ∞ P ({K1 , K2 , ..., Kn } \ {1, 2, ..., Un } j=Un +1 P (j ∈ {K1 , K2 , ..., Kn }) ∞ ∞ n = j=Un +1 1 − (1 − p(j)) ≤ j=Un +1 np(j) ∞ 1−1/β n−1− n Un n ≤ for an > 0. k −1/β dk = ≤ ζ(1/β) Un ζ(1/β) 1/β − 1 ζ(1/β)(1/β − 1) ∞ Hence n=1 P ({K1 , K2 , ..., Kn } \ {1, 2, ..., Un } = ∅) < ∞ so (29) holds by the Borel-Cantelli lemma. As for the second claim, put Ln := nλ and observe n P ({1, 2, ..., Ln } \ {K1 , K2 , ..., Kn } = ∅) ≤ L ∈ {K1 , K2 , ..., Kn }) j=1 P (j L n n n = j=1 (1 − p(j)) ≤ Ln (1 − p(Ln )) = Ln exp [n log (1 − p(Ln ))] ≤ Ln exp [−np(Ln )] ≤ nβ exp [−n ] for an > 0. ∞ = ∅) < ∞ so (30) is also satisfied Hence n=1 P ({1, 2, ..., Ln } \ {K1 , K2 , ..., Kn } by the Borel-Cantelli lemma. To use the above statement for the Bayesian model, notice first that Pθ (xn ) > 0 for Pθ -almost all x. Hence equalities zi = θki and n m M T (xn , θm ) = yM ∈YM i=1 p(ki )1{zi =yki } k=1 1{θk =yk } Q(y ) M = Pθ (xn ) yM ∈YM k∈{k1 ,k2 ,...,kn }∪{1,2,...,m} 1{θk =yk } Q(y ) hold for Pθ -almost all x with M := max {m, k1 , k2 , ..., kn }. Consequently, Q(θm )Pθ (xn ) = T (xn , θm ) n
m
n
T (x , θ ) = Y (x )
if
{k1 , k2 , ..., kn } \ {1, 2, ..., m} = ∅,
(31)
if
{1, 2, ..., m} \ {k1 , k2 , ..., kn } = ∅.
(32)
Thus the model given in Example 4 is (nυ , m1/λ )-uniformly discretizable. The last example is not uniformly discretizable. It stems from the observation that any probability measure on X∞ can be encoded with a single sequence from Y∞ . Such parameter is not identifiable, however.
Computable Bayesian Compression
65
Example 5 (a model that contains all distributions). For simplicity let X = N and Y = {0, 1}. The link between θ and Pθ will be established by imposing equalities Pθ (λ) = 1 and ∞ n n−1 n−1 Pθ (x ) = Pθ (x )− Pθ (x y) · θφ(xn ,k) 2−k , (33) y<xn
k=1
where a recursive bijection φ : N+ × N → N is used. It is easy to see that Pθ is a probability measure on X∞ for each θ. Conversely, each probability measure on X∞ equals Pθ for at least one θ. −|θ| . Then the Bayesian measure Let the prior be the uniform measure Q(θ) := 2 Y = Pθ (x)dQ(θ) is recursive and equals n 1 n n−1 n−1 Y (x Y (x ) = )− Y (x y) =⇒ log2 Y (xn ) = − i=1 xi . 2 y<x n
Measure Y is not only optimal for all Q-random θ, in the sense of Pθ (LY ) = 1, but it is also optimal for a certain θ ∈ LQ that satisfies Pθ = Y . On the other hand, by the asymptotic equipartition property, Pθ (LY ) = 0 for stationary measures Pθ that have a different entropy rate than Y [15, Section 15.7].
5
Countable Unions of Models
Bayesian mixtures of uniformly discretizable models are uniformly discretizable under the additional condition (34), which says that Bayesian model selection is consistent for each θ ∈ Θ. Let us write θkm := θk θk+1 ...θm . Moreover, define T i and Y i via (1)–(2) for models (P i , Qi ) substituted for (P , Q) respectively. Theorem 4. Let models (P i , Qi ) be (μi , νi )-uniformly discretizable with kernels Pθi (x) for θ ∈ Θ i and i ∈ A, a countable set. For a prefix code c : A → Y+ , put Θ := i∈A c(i)Θ i . Consecutively, denote idx(θ) := i and trn(θ) := ϑ for idx(θ)
θ = c(i)ϑ ∈ Θ. Define the kernel Pθ (x) := Ptrn(θ) (x) for θ ∈ Θ and the prior Q := i∈A w(i)(Qi ◦ trn) for i∈A w(i) = 1 and w(i) > 0. The model (P , Q) is (μ, ν)-uniformly discretizable provided μ(n) := supi∈A (|c(i)| + μi (n)) < ∞, ν(m) := supi∈A νi (m − |c(i)|) < ∞, and lim Y (xn )/Y i (xn ) = w(i)
n→∞
(34)
for i = idx(θ), Pθ -almost all x, and all θ ∈ Θ. Remark: Assuming recursive models and mutually singular Pϑi , convergence (34) may fail only for θ that are not Q-random, cf. [20]. Put X :=
66
Ł. Dębowski
x : limn Y (xn )/Y i (xn ) = w(i) . By the ordinary martingale convergence, Y i (X ) = 1, whereas by convergence of recursive martingales [4, Theorem 3.1], X ⊃ LY i . Next, by [4, Cor. 4.3 & Thm. 5.3], we obtain LY i ⊃ LP i |ϑ for Qi -random ϑ. Hence Pθ (X ) = 1 if θ ∈ LQ and (35) holds true, in view of the Theorem 5 below. m Proof. Let i = idx(θ). Observe that T (xn , θm ) = w(i)T i (xn , θ|c(i)|+1 ) and m i m Q(θ ) = w(i)Q (θ|c(i)|+1 ) if m ≥ |c(i)|. Hence for Pθ -almost all x and m ≥ μ(n), we have m i m n Qi (θ|c(i)|+1 )Ptrn(θ) (xn ) Q(θ )P (x ) θ log = log = o(log m). m T (xn , θm ) T i (xn , θ|c(i)|+1 )
On the other hand, for Pθ -almost all x and n ≥ ν(m), " ! i n m T (xn , θm ) wi Y i (xn ) T (x , θ|c(i)|+1 ) lim = lim · = 1. m→∞ m→∞ Y (xn ) Y (xn ) Y i (xn ) A complementary result says that the set of random parameters with respect to the mixture is the union of the respective sets for the combined models. Theorem 5. Consider the models from Theorem 4 and suppose that Qi satisfy Qi (θk )/Qi (θm ) ≥ ack−m
(35)
for all k ≥ m ≥ 0 and certain constants c < 1 and a > 0. Then for g(n) = Ω(1) we have θ ∈ LQ,g(n) if and only if trn(θ) ∈ LQidx(θ) ,g(n) . Proof. Let i = idx(θ). The claim is true if |c(i)|+m |c(i)|+m K(θm ) + log Q(θm ) − K(θ|c(i)|+1 ) − log Qi (θ|c(i)|+1 ) = O(1) |c(i)|+m for m ≥ |c(i)|. The latter condition is satisfied since K(θm ) − K(θ|c(i)|+1 ) ≤ |c(i)|+m |c(i)| + O(1), whereas log Q(θm ) − log Qi (θ|c(i)|+1 ) ≤ |log w(i)| + O(|c(i)|) by
m ) and (35). Q(θm ) = w(i)Qi (θ|c(i)|+1
These propositions may be useful when we seek a compressor that is optimal for all random and certain nonrandom parameters with respect to a given prior. A possible solution is to find priors against which the originally considered nonrandom parameters are random. Suppose that these priors and the original prior yield uniformly discretizable models and consistent Bayesian selection among these models is feasible. Then Theorems 2, 4, and 5 guarantee that the Bayesian mixture of all considered models achieves the best enumerable compression for all requested parameters and no so many others!
Computable Bayesian Compression
67
Acknowledgement I would like to thank P. Gr¨ unwald, P. Harremo¨es, and J. Mielniczuk for discussions. Cordial acknowledgements are due to an anonymous referee for suggesting relevant references. They helped to improve this paper considerably. The research, supported under the PASCAL II Network of Excellence, IST-2002506778, was done during the author’s leave from the Institute of Computer Science, Polish Academy of Sciences.
References 1. Vovk, V.G., V’yugin, V.V.: On the empirical validity of the Bayesian method. J. Roy. Statist. Soc. B 55, 253–266 (1993) 2. Vovk, V.G., V’yugin, V.V.: Prequential level of impossibility with some applications. J. Roy. Statist. Soc. B 56, 115–123 (1994) 3. Vit´ anyi, P., Li, M.: Minimum description length induction, Bayesianism and Kolmogorov complexity. IEEE Trans. Inform. Theor. 46, 446–464 (2000) 4. Takahashi, H.: On a definition of random sequences with respect to conditional probability. Inform. Comput. 206, 1375–1382 (2008) 5. G´ acs, P.: On the symmetry of algorithmic information. Dokl. Akad. Nauk SSSR 15, 1477–1480 (1974) 6. Li, M., Vit´ anyi, P.M.B.: An Introduction to Kolmogorov Complexity and Its Applications, 2nd edn. Springer, Heidelberg (1997) 7. van Lambalgen, M.: Random Sequences. PhD thesis, Universiteit van Amsterdam (1987) 8. Barron, A., Rissanen, J., Yu, B.: The minimum description length principle in coding and modeling. IEEE Trans. Inform. Theor. 44, 2743–2760 (1998) 9. Gr¨ unwald, P.D.: The Minimum Description Length Principle. MIT Press, Cambridge (2007) 10. Yu, B., Speed, T.P.: Data compression and histograms. Probab. Theor. Rel. Fields 92, 195–229 (1992) 11. Hopcroft, J.E., Ullman, J.D.: Introduction to Automata Theory, Languages and Computation. Addison-Wesley, Reading (1979) 12. Elias, P.: Universal codeword sets and representations for the integers. IEEE Trans. Inform. Theor. 21, 194–203 (1975) 13. Barron, A.R.: Logically Smooth Density Estimation. PhD thesis, Stanford University (1985) 14. Dawid, A.: Statistical theory: The prequential approach. J. Roy. Statist. Soc. A 147, 278–292 (1984) 15. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, Chichester (1991) 16. Li, L., Yu, B.: Iterated logarithmic expansions of the pathwise code lengths for exponential families. IEEE Trans. Inform. Theor. 46, 2683–2689 (2000) 17. Barndorff-Nielsen, O.E.: Information and Exponential Families. Wiley, Chichester (1978) 18. Jeffreys, H.: Theory of Probability, 3rd edn. Oxford University Press, Oxford (1961) 19. Dębowski, Ł.: On the vocabulary of grammar-based codes and the logical consistency of texts (2008) E-print, http://arxiv.org/abs/0810.3125 20. Csiszar, I., Shields, P.C.: The consistency of the BIC Markov order estimator. Ann. Statist. 28, 1601–1619 (2000)
Calibration and Internal No-Regret with Random Signals Vianney Perchet ´ Equipe Combinatoire et Optimisation, FRE 3232 CNRS, Universit´e Pierre et Marie Curie - Paris 6, 175 rue du Chevaleret, 75013 Paris
[email protected]
Abstract. A calibrated strategy can be obtained by performing a strategy that has no internal regret in some auxiliary game. Such a strategy can be constructed explicitly with the use of Blackwell’s approachability theorem, in an other auxiliary game. We establish the converse: a strategy that approaches a convex B-set can be derived from the construction of a calibrated strategy. We develop these tools in the framework of a game with partial monitoring, where players do not observe the actions of their opponents but receive random signals, to define a notion of internal regret and construct strategies that have no such regret.
1
Introduction
Consider an agent trying to predict a sequence of outcomes. For example, a meteorologist announces each day the probability that it will rain the following day. He will do this with a given accuracy (for instance, he chooses between {0, 0.1, 0.2, . . . , 1}). The predictions will be considered successful if on the days when the meteorologist forecasts 0.5, nearly half of these days are rainy and half sunny. And this should be true for every possible prediction. Foster and Vohra [6] called this property calibration and proved the existence of calibrated strategies, without any assumption on the sequence of outcomes and on the knowledge of the predictor. The first section deals with the connections between three tools: calibration, approachability and no-regret. The notion of regret in full monitoring has been introduced by Hannan [9]: a player has asymptotically no external regret if his average payoff could not have been better by knowing in advance the empirical distribution of moves of the other players. Hannan [9] proved the existence of such strategies and Blackwell [4] gave an alternative proof using his approachability theorem. Foster and Vohra [7] (see also Fudenberg and Levine [8]) extended Hannan’s result by proving the existence of strategies with no internal regret, which is a more precise notion: a player has asymptotically no internal regret, if for each of his action, he has no external regret on the set of stages where he played it. We refer to Cesa-Bianchi and Lugosi [5] for a survey on sequential prediction and regret. R. Gavald` a et al. (Eds.): ALT 2009, LNAI 5809, pp. 68–82, 2009. c Springer-Verlag Berlin Heidelberg 2009
Calibration and Internal No-Regret with Random Signals
69
A calibrated strategy can be obtained through the construction of a strategy with no internal regret in an auxiliary game (see Sorin [17]). And this construction can be done explicitly using Blackwell’s approachability theorem [3] for an orthant in IRd (see Hart and Mas-Colell [10]). We will provide a kind of converse result: we derive an explicit construction of an approachability strategy for a convex B-set through the use of a calibrated strategy, in some auxiliary game. In the second section, we consider repeated games with partial monitoring, where players do not observe the action of their opponents, but only receive random signals and we focus on strategies that have no regret, in the following sense. A player has asymptotically no external regret if his average payoff could not have been better by knowing in advance the empirical distribution of signals (see Rustichini [15]). The existence of strategies with no external regret was proved by Rustichini [15] and Lugosi, Mannor and Stoltz [14] constructed explicitly such strategies. Lehrer and Solan [13] defined a notion of internal regret in the partial monitoring framework and proved the existence of strategies with no such regret. We will generalize these results by constructing strategies that have no regret, for a more precise notion of regret.
2
Full Monitoring Case: Approachability Implies Calibration
This section is devoted to the full monitoring case. We recall the main results about calibration of Foster and Vohra [6], approachability of Blackwell [3] and regret of Hart and Mas-Colell [10]. We will prove some of these results in details, since they give the main ideas about the construction of strategies in the partial monitoring framework, given in section 4. 2.1
Calibration
Let S be a finite set of states. We consider a two-person repeated game where, at stage n ∈ IN, Nature (Player 2) chooses a state sn ∈ S and Predictor (Player 1) chooses μn ∈ Δ(S) the set of probabilities over S. We assume that μn belongs to a finite set M = {μ(l), l ∈ L}. Let ε > 0 such that for every probability μ ∈ Δ(S), there exists μ(l) ∈ M such that μ − μ(l) ≤ ε where Δ(S) is seen as a subset of IR|S| . Then M is called an ε-grid of Δ(S). With this notations, the prediction at stage n is the choice of an element ln ∈ L, called the type of that stage. The choices of ln and sn are functions of the past observations (or the finite history) hn−1 = (l1 , s1 , . . . , ln−1 , sn−1 ) and may be random. Explicitly, the set n 0 of finite histories is denoted by H = n∈IN (L × S) , with (L × S) = ∅ and a strategy σ of Player 1 (resp. τ of Player 2) is a function from H to Δ(L) (resp. Δ(S)) and σ(hn ) (resp. τ (hn )) is the law of ln+1 (resp. sn+1 ) after hn . A couple IN of strategies (σ, τ ) generates a probability, denoted by IPσ,τ , over H = (L × S) , the set of plays embedded with the cylinder σ-field.
70
V. Perchet
We will use the following notations. For any families {am ∈ IRd , lm ∈ L}m∈IN and n ∈ IN, Nn (l) = {1 ≤ m ≤ n, lm = l} is the set of stages of type l (before the n-th), an (l) = m∈Nn (l) am /|Nn (l)| is the average of {am } on this set and n an = m=1 am /n is the average over all the stages (before the n-th). Definition 1. Foster-Vohra [6] A strategy σ of Player 1 is calibrated (with respect to the ε-grid M) if for every l ∈ L and every strategy τ of Player 2: |Nn (l)| sn (l) − μ(l)2 − ε2 ≤ 0, IPσ,τ -as . lim sup n n→+∞ In words, a strategy of Player 1 is calibrated if, on the set of stages where μ(l) is forecast, the empirical distribution of states is asymptotically close to μ(l) (as long as the frequency of l is not too small). Foster-Vohra [6] proved the existence of such strategies with an algorithm based on the Expected Brier Score. 2.2
Approachability
We will prove that calibration will follow from no-regret and that no-regret will follow from approachability (following respectively Sorin [17] and Hart and Mas-Colell [10]). We present here the notion of approachability introduced by Blackwell [3]. Consider a two-person repeated game in discrete time with vector payoffs, where at stage n ∈ IN, Player 1 (resp. Player 2) chooses the action in ∈ I (resp. jn ∈ J), with both I and J finite. The corresponding vector payoff is ρn = ρ(in , jn ) where ρ : I × J → IRd . As usual, a strategy σ (resp.τ ) of Player 1 n (resp. Player 2) is a function from the set of finite histories H = n∈IN (I × J) to Δ(I) (resp. Δ(J)). For a closed set E ⊂ IRd and δ ≥ 0, we denote by E δ = {z ∈ IRd , dE (z) ≤ δ} the δ-neighborhood of E and ΠE (z) = {e ∈ E, dE (z) = z − e} the set of closest point to z in E, where dE (z) = inf e∈E z − e. Definition 2. i) A closed set E ⊂ IRd is approachable by Player 1 if for every ε > 0, there exists a strategy σ of Player 1 and N ∈ IN, such that for every strategy τ of Player 2 and every n ≥ N : Eσ,τ [dE (ρn )] ≤ ε and IP sup dE (ρn ) ≥ ε ≤ ε . n≥N
Such a strategy is called an approachability strategy of E. ii) A set E is excludable by Player 2, if there exists δ > 0 such that the complement of E δ is approachable by Player 2. In words, a set E ⊂ IRd is approachable by Player 1, if he has a strategy such that the average payoff converges almost surely to E, uniformly with respect to the strategies of Player 2. Blackwell [3] gave a sufficient geometric condition for a closed set E to be approachable by Player 1. Denote by P 1 (x) = {ρ(x, y), y ∈ Δ(J)}, the set of expected payoffs compatible with x ∈ Δ(I) and define similarly P 2 (y).
Calibration and Internal No-Regret with Random Signals
71
Definition 3. A closed subset E of IRd is a B-set, if for every z ∈ IRd , there exist p ∈ ΠE (z) and x (= x(z, p)) ∈ Δ(I) such that the hyperplane through p and perpendicular to z − p separates z from P 1 (x), or formally: ∀z ∈ IRd , ∃p ∈ ΠE (z), ∃x ∈ Δ(I), ρ(x, y) − p, z − p ≤ 0,
∀y ∈ Δ(J) .
(1)
Informally, from any point z outside E there is a closest point p and a probability x ∈ Δ(J) such that, whatever being the choice of Player 2, the expected payoff and z are on different sides of the hyperplane through p and perpendicular to z − p. In fact, this definition (and the following theorem) does not require that J is finite: one can assume that Player 2 chooses an outcome vector U ∈ [−1, 1]|I| so that the expected payoff is x, U . Theorem 1. Blackwell [3] If E is a B-set, then E is approachable by Player 1. Moreover, the strategy σ of Player 1 defined by σ(hn ) = x(ρn ) is such that, for every strategy τ of Player 2: 4B 8B 2 and IPσ,τ sup dE (ρn ) ≥ η ≤ 2 , (2) Eσ,τ [dE (ρn )] ≤ n η N n≥N with B = supi,j ρ(i, j)2 . In the case of a convex set C, there is a complete characterization: Corollary 1. Blackwell [3] A closed convex set C ⊂ IRd is approachable by Player 1 if and only if: P 2 (y) ∩ C = ∅,
∀y ∈ Δ(J) .
(3)
In particular, a closed convex set C is either approachable by Player 1, or excludable by Player 2. Remark 1. Corollary 1 implies that there are (at least) two different ways to prove that a convex set is approachable. The first one, called direct proof, consists in proving that C is a B-set while the second one, called undirect proof, consists in proving that C is not excludable by Player 2, which reduces to find, for every y ∈ Δ(J), some x ∈ Δ(I) such that ρ(x, y) ∈ C. 2.3
Approachability Implies Internal no-Regret
Consider a two-person repeated game in discrete time, where at stage n ∈ IN Player 1 chooses in ∈ I as above and Player 2 chooses a vector Un ∈ [−1, 1]c (with c = |I|). The associated payoff is Unin , the in -th coordinate of Un . The internal regret of the stage is the matrix Rn = R(in , Un ), where the function 2 R : I × [−1, 1]c → IRc is defined by: 0 if i = i (i ,j) = R(i, U ) j i U − U otherwise.
72
V. Perchet
With this definition, the average internal regret Rn is defined by: j
i |Nn (i)| m∈Nn (i) Um − Um j i Rn = = U n (i) − U n (i) j∈I . n n i∈I i,j∈I
Definition 4. Foster and Vohra [7]: A strategy σ of Player 1 is internally consistent if for any strategy τ of Player 2: lim sup Rn ≤ 0, n→∞
IPσ,τ -as .
The existence of such strategies have been proved by Foster and Vohra [7] and Fudenberg and Levine [8]. Theorem 2. There exist internally consistent strategies. Note that an internally consistent strategy can be obtained by constructing a 2 strategy that approaches the negative orthant Ω = IRc− in the auxiliary game where the vector payoff at stage n is Rn . The proof of Hart and Mas-Colell [10] of the fact that Ω is a B-set relies on the two followings lemmas: Lemma 1 gives a geometric property of Ω and Lemma 2 gives a property of the function R. 2
Lemma 1. Let ΠΩ (·) be the projection onto Ω. Then, for every A ∈ IRc : ΠΩ (A), A − ΠΩ (A) = 0 .
(4)
2
Proof. Note that since Ω = IRc− then A+ = A−ΠΩ (A) where A+ ij = max (Aij , 0) and similarly A− = ΠΩ (A). The result is just a rewriting of A− , A+ = 0. For every non-negative (c × c)-matrix A = (aij )i,j∈I , λ ∈ Δ(L) is an invariant probability of A if for every i ∈ I:
λ(j)aji = λ(i) aij . j∈I
j∈I
The existence of an invariant probability follows from the similar result for Markov chains. Lemma 2. Let A = (aij )i,j∈I be a non-negative matrix. Then for every λ, invariant probability of A, and every U ∈ IRc : A, Eλ [R(·, U )] = 0 .
(5)
Proof. The (i, j)-th coordinate of Eλ [R(·, U )] is λ(l) U j − U i , therefore:
aij λ(i) U j − U i A, Eλ [R(·, U )] =
i,j∈I
and the coefficient of each U i is j∈I aij λ(i) − j∈I aji λ(j) = 0, because λ is an invariant measure of A. Therefore A, Eλ [R(·, U )] = 0.
Calibration and Internal No-Regret with Random Signals
73
Proof of Theorem 2. Summing equations (4) (with A = Rn ) and (5) (with + A = Rn ) gives: Eλn [R(·, U )] − ΠΩ (Rn ), Rn − ΠΩ (Rn ) = 0 , +
for every λn invariant probability of Rn and every U ∈ [−1, 1]I . Define the strategy σ of Player 1 by σ(hn ) = λn . The expected payoff at stage n + 1 (given hn and Un+1 = U ) is Eλn [R(·, U )], so Ω is a B-set and is approachable by Player 1. Remark 2. The construction of the strategy is based on approachability properties therefore the convergence is uniform with respect to the strategies of Player 2. Theorem 1 implies that for every η > 0, and for every strategy τ of Player 2: |Nn (i)| 1 IPσ,τ ∃n ≥ N, ∃i, j ∈ i, U n (i)j − U n (i)i > η = O n η2 N + |Nn (l)| 1 and Eσ,τ sup . =O √ U n (i)j − U n (i)i n n i∈I 2.4
Internal Regret Implies Calibration
Sorin [17] proved that the construction of calibrated strategy can be reduced to the construction of internally consistent strategy. The proof relies on the following lemma: Lemma 3. Let (am )m∈IN be a sequence in IRd and α, β two points in IRd . Then for every n ∈ IN∗ : n 2 2 2 2 m=1 am − β2 − am − α2 = an − β2 − an − α2 , (6) n with · 2 the L2 -norm of IRd . Proof. Develop the sums in equation (6) to get the result.
Now, we can prove the following: Theorem 3. Foster and Vohra [6] For every finite grid of Δ(S), there exist calibrated strategies of Player 1. Proof. We start with the framework described in 2.1. Consider the auxiliary two-person game with vector payoff defined as follows. At stage n ∈ IN, Player 1 (resp. Player 2) chooses the action ln ∈ L (resp. sn ∈ S) which generates the payoff Rn = R(ln , Un ) ∈ IRd , where R is as in 2.3, with: 2 Un = − sn − μ(l)2 ∈ IRc . l∈L
74
V. Perchet
By definition of R and using Lemma 3, for every n ∈ IN∗ : 2 2 |Nn (l)| lk m∈Nn (l) sm − μ(l)2 − sm − μ(k)2 Rn = n |Nn (l)| |Nn (l)| 2 2 sn (l) − μ(l)2 − sn (l) − μ(k)2 . = n Let σ be an internally consistent strategy in this auxiliary game, then for every l ∈ L and k ∈ L: |Nn (l)| 2 2 lim sup sn (l) − μ(l)2 − sn (k) − μ(k)2 ≤ 0, IPσ,τ -as . n n→∞ Since {μ(k), k ∈ L} is a ε-grid of Δ(S), for every l ∈ L, and every n ∈ IN∗ , there exists k ∈ L such that sn (l) − μ(k)22 ≤ ε2 , hence: lim sup n→∞
|Nn (l)| 2 sn (l) − μ(l)2 − ε2 ≤ 0, n
IPσ,τ -as .
Remark 3. We have proved that σ is such that, for every l ∈ L, sn (l) is closer to μ(l) than to any other μ(k), as soon as |Nn (l)|/n is not too small. The fact that sn belongs to a finite set S and {μ(l)} are probabilities over S is irrelevant: one can show that for any finite set {a(l) ∈ IRd , l ∈ L}, Player 1 has a strategy σ such that for any bounded sequence (am )m∈IN in IRd and for every l and k : |Nn (l)| an (l) − a(l)2 − an (l) − a(k)2 ≤ 0 . lim sup n n→∞
3
Calibration Implies Approachability
The proof of Theorem 3 shows that the construction of a calibrated strategy can be obtained through an approachability strategy of an orthant in an auxiliary game. Conversely, we will show that the approachability of a convex B-set can be reduced to the existence of a calibrated strategy in an auxiliary game, and so give a new proof of Corollary 1. Alternative proof of Corollary 1. The idea of the proof is very natural: given ε > 0, we construct a finite covering {Y (l), l ∈ L} of Δ(J) and associate to Y (l) a probability x(l) ∈ Δ(I) such that ρ(x(l), y) ∈ C ε for every y ∈ Y (l). Player 1 will always choose his action accordingly to one of the {x(l)}. Assume that on the stages when Player 1 played x(l), the empirical action of Player 2 is in Y (l), then the average payoff on these stages is in the convex set C ε (by linearity of ρ). And if this property is true for every l ∈ L, then the average payoff is also in C ε (by convexity).
Calibration and Internal No-Regret with Random Signals
75
Formally, assume that condition (3) is satisfied and rephrased as: ∀y ∈ Δ(J), ∃x(= xy ) ∈ Δ(I), ρ(xy , y) ∈ C .
(7)
Since ρ is multilinear and therefore continuous on Δ(I) × Δ(J), for every ε > 0, there exists δ > 0 such that: ∀y, y ∈ Δ(J), y − y 2 ≤ 2δ ⇒ ρ(xy , y ) ∈ C ε . We introduce the auxiliary game Γ where Player 2 chooses action (or state) j ∈ J and Player 1 forecasts it, using {y(l), l ∈ L}, a finite grid of Δ(J) whose diameter is smaller than δ. Let σ be a calibrated strategy for Player 1, so that jn (l), the empirical distribution of actions of Player 2 on Nn (l), is asymptotically δ-close to y(l). Define the strategy of Player 1 in the initial game by performing σ and if ln = l by playing accordingly to xy(l) = x(l) ∈ Δ(I), as depicted in (7). Since the choices of actions of the two players are independent, ρn (l) will be close to ρ (x(l), jn (l)), hence close to ρ(x(l), y(l)) and finally close to C ε , as soon as |Nn (l)| is not too small. Indeed, by construction of σ, for every η > 0 there exists N 1 ∈ IN such that, for every strategy τ of Player 2: |Nn (l)| 2 jn (l) − y(l)2 − δ 2 ≤ η ≥ 1 − η . (8) IPσ,τ ∀l ∈ L, ∀n ≥ N 1 , n Hoeffding-Azuma inequality for sum of bounded martingale differences (see [2,11]) implies that for any η ∈ (0, 1) with probability at least 1 − η, 2 2 ln , |ρn (l) − ρ(x(l), jn (l)| ≤ |Nn (l)| η and therefore there exists N 2 ∈ IN such that for every l ∈ L: IPσ,τ ∀m ≥ n, |ρn (l) − ρ(x(l), jn (l))| ≤ η |Nn (l)| ≥ N 2 ≥ 1 − η .
(9)
Equations (8) and (9), taken with η ≤ ε/L, imply that, with probability at least 2 1 − 2ε, for every n ≥ max{N 1 , LN 2 /ε}, |ρn (l) − ρ(x(l), jn (l))| ≤ η ≤ ε, and 2 2 2 if Nn (l)/n ≥ ε/L then |Nn (l)| > N , so jn (l) − y(l) ≤ 2δ , and therefore dC (ρn (l)) ≤ 2ε. Since C is a convex set, dC (·) is convex and with probability at least 1 − 2ε:
|Nn (l)|
|Nn (l)| dC (ρn (l)) dC (ρn ) = dC ρn (l) ≤ n n l∈L
≤
l:Nn (l)/n≥ε/L
≤ 2ε + ε = 3ε.
l∈L
|Nn (l)| dC (ρn (l)) + n
l:Nn (l)/n<ε/L
|Nn (l)| n
76
V. Perchet
Therefore C is approachable by Player 1. On the other hand, if there exists y such that P 2 (y) ∩ C = ∅, then Player 2 can approach P 2 (y), by playing at every stage accordingly to y. Therefore C is not approachable by Player 1. Remark 4. Blackwell’s proof of this result is not explicit. He showed that the condition (7) implies that C is a B-set and his proof relies on the use of Von Neumann’s minmax theorem. In words, let z be a fixed point outside C. Assume that if Player 1 knows y ∈ Δ(J) the law of the action of Player 2, then there is a law xy ∈ Δ(I) such that the expected payoff ρ(xy , y) and z are in different sides of the hyperplane described in the definition of a B-set. The minmax theorem implies that there exists x ∈ Δ(I) such that for every y ∈ Δ(I), z and ρ(x, y) are on different sides and therefore C is a B-set. This gives the existence of an approachability strategy of C. One of the major interest in calibration, is that it transforms this implicit proof into an explicit constructive proof: while performing a calibrated strategy (in an auxiliary game where J plays the role of the set of states), Player 1 can enforce the property that, for every l ∈ L, the average move of Player 2 is almost y(l) on Nn (l). So he just has (and could not do better) to play xy(l) on these stages. Remark 5. 1) Hoeffding-Azuma’s inequality for sums of bounded martingale differences implies that for every strategy τ of Player 2: |Nn (l)| ln(n) Eσ,τ sup |ρn (l) − ρ (x(l), y n (l))| = O n n l∈L The strategy σ is based on approachability properties and on HoeffdingAzuma’s inequality, so one can show that: ln(n) . Eσ,τ [dC (ρn ) − ε] ≤ O n 2) To deduce that ρn is in C ε from the fact that ρn (l) is in C ε for every l ∈ L, it is necessary that C (or dC (·)) is convex.
4
Internal Regret in the Partial Monitoring Framework
Consider a two person repeated game in discrete time. At stage n ∈ IN, Player 1 (resp. Player 2) chooses in ∈ I (resp. jn ∈ J), which generates the payoff ρn = ρ(in , jn ) with ρ : I × J → IR. Player 1 does not observe this payoff, he just receives a signal sn ∈ S whose law is s(in , jn ) with s : I × J → Δ(S). The three sets I, J and S are finite, the two functions ρ and s are extended multilineary to Δ(I) × Δ(J) and we define s : Δ(J) → Δ(S)I by s(y) = (s(i, y))i∈I , where Δ(S)I is the set of vectors of probability over S. We call any such vector a flag. As usual, a strategy σ of Player 1 (resp. τ of Player 2) is a function from the
Calibration and Internal No-Regret with Random Signals
77
set of finite histories for Player 1, H 1 = n∈IN (I × S)n , to Δ(I) (resp. from n H 2 = n∈IN (I × S × J) to Δ(J)). A couple (σ, τ ) generates a probability IPσ,τ IN over H = (I × S × J) . 4.1
External Regret
Rustichini [15] defined the regret in the partial monitoring framework as follows: a strategy σ of Player 1 has no external regret if IPσ,τ -as: lim sup max
min ρ(x, y) − ρn ≤ 0 . y ∈ Δ(J), ⎩ s(y) = s(j ) n
⎧ n→+∞ x∈Δ(I) ⎨
where s(jn ) ∈ Δ(S)I is the average flag. In words, the average payoff of Player 1 could not have been better uniformly if he had known the average distribution of flags before the beginning of the game. In this framework, given a flag μ ∈ Δ(S)I , the function miny∈s−1 (μ) ρ(·, y) may not be linear. So the best response of Player 1 might not be a pure action in I, but a mixed action x ∈ Δ(I) and any pure action in the support of x may be a bad response. Note that this also appears in Rustichini’s definition, since the maximum is taken over Δ(I) and not just over I as in the usual definition of external regret in full monitoring. 4.2
Internal Regret
We consider here a generalization of the previous’s framework: At stage n ∈ IN, Player 2 chooses a flag μn ∈ Δ(S)I while Player 1 chooses an action in and receives a signal sn whose law is the in -th coordinate of μn . Given a flag μ and x ∈ Δ(I), Player 1 evaluates the payoff through an evaluation function G : Δ(I) × Δ(S)I → IR, which is not necessarily linear. There are two requirements to define internal regret: we have to define a finite partition of IN and for every element of that partition, Player 1 must choose a point in Δ(I) that is a best response (or at least an ε-best response) to some flag. Hence we have to distinguish the stages, not as a function of the action played, but as a function of the law of the action. We also assume that the strategy of Player 1 can be described by a finite family {x(l) ∈ Δ(I), l ∈ L} such that, at stage n ∈ IN, Player 1 chooses a type ln and the law of its action in is x(ln ). Definition 5. Lehrer-Solan [13] For every n ∈ IN and every l ∈ L, the average internal regret of type l at stage n is Rn (l) = sup [G(x, μn (l)) − G(ın (l), μn (l))] . x∈Δ(I)
A strategy σ of Player 1 is (L, ε)-internally consistent if for every strategy τ of Player 2: |Nn (l)| lim sup Rn (l) − ε ≤ 0, ∀l ∈ L, IPσ,τ -as . n n→+∞
78
V. Perchet
Remark 6. Note that this definition is not intrinsic (unlike in the full monitoring case) since it depends on the choice of {x(l), l ∈ L}, and is based uniquely on the potential observations (ie the sequences of flags (μn )n∈IN ) of Player 1. In order to construct (L, ε)-internally consistent strategies, some regularity over G is required: Assumption 1. For every ε > 0, there exist μ(l) ∈ Δ(S)I , x(l) ∈ Δ(I), l ∈ L two finite families and η, δ > 0 such that: 1. Δ(S)I ⊂ l∈L B(μ(l), δ); 2. For every l ∈ L, if x − x(l) ≤ 2η and μ − μ(l) ≤ 2δ, then x ∈ BRε (μ), where BRε (μ) = x ∈ Δ(I) : G(x, μ) ≥ supz∈Δ(I) G(z, μ) − ε is the set of ε best response to μ ∈ Δ(S)I and B(μ, δ) = μ ∈ Δ(S)I , μ − μ ≤ δ . In words, Assumption 1 implies that G is regular with respect to μ and with respect to x: given ε, the set of flags can be covered by a finite number of balls centered in {μ(l)}, such that x(l), a best response to μ(l), is an ε-best response to any μ in this ball. And if x is close enough to x(l), then x is also an ε-best response to any μ close to μ(l). Theorem 4. If G fulfills Assumption 1, there exist (L, ε)-internally consistent strategies. Some parts of the proof are quite technical, however the insight is very simple, so we give firstly the main ideas. First assume that, in the one stage game, μ ∈ Δ(S)I is observed by Player 1, then there exists x ∈ Δ(I) such that x ∈ BR (μ). Using an minmax argument, like Blackwell did for the proof of Corollary 1, one could prove that Player 1 has an (L, ε)-internally consistent strategy (as did Lehrer and Solan [13]). The idea is to use calibration, as in the alternative proof of Corollary 1, to transform this implicit proof into a constructive proof. Fix ε > 0 and assume for the moment that Player 1 observes each μn . Consider the game where Player 1 predicts the sequence (μn )n∈IN using the δ-grid {μ(l), l ∈ L} given by Assumption 1. A calibrated strategy of Player 1 chooses a sequences (ln )n∈IN in such a way that μn (l) is asymptotically δ-close to μ(l). Hence Player 1 just has to play accordingly to x(l) ∈ BRε (μ(l)) on these stages. Indeed, since the choices of action are independent, ın (l) will be asymptotically η-close to x(l) and the regularity of G will imply then that ın (l) ∈ BRε (μn (l)) and so the strategy will be (L, ε)-internally consistent. The only issue is that in the current framework the signal depends on the action of Player 1 who does not observe μn . The existence of calibrated strategies is therefore not straightforward. However, it is well known that, up to a slight perturbation of x(l), the information available to Player 1 after a long time is close to μn (l) (as in the multi-armed bandit problem, some calibration and no-regret frameworks, see chapter 6 in [5] for a survey on these techniques).
Calibration and Internal No-Regret with Random Signals
79
For every x ∈ Δ(I), define xη ∈ Δ(I), the η-perturbation of x by xη = (1 − η)x + ηu with u the uniform probability over I and for every stage n of type l, define sn by: sn = (0, . . . , 0,
sn , 0, . . . , 0) , xη (l)[in ]
with xη (l)[in ] the weight put by xη (l) on in and denote by sn (l), the average of { sm } on Nn (l): Lemma 4. For every θ > 0, there exists N ∈ IN such that, for every l ∈ L: sn (l) − μn (l) ≤ θ| Nn (l) ≥ N ) ≥ 1 − θ . IPσ,τ (∀m ≥ n, Proof. Since for every n ∈ IN, the choices of in and μn are independent:
s ,...,0 Eσ,τ [ sn | hn−1 , ln , μn ] = μin [s]xη (ln )[i] 0, . . . , xη (ln )[i] i∈I s∈S
0, . . . , μin , . . . , 0 μin [s] (0, . . . , s, . . . , 0) = = i∈I s∈S
= μ1n , . . . , μIn = μn .
i∈I
Therefore sn (l) is an unbiased estimator of μn (l) and Hoeffding-Azuma’s inequality implies that for every θ > 0 there exists N ∈ IN such that, for every l ∈ L: sn (l) − μn (l) ≤ θ| Nn (l) ≥ N ) ≥ 1−θ . IPσ,τ (∀m ≥ n,
Assume now that Player 1 uses a calibrated strategy to predict the sequences of sn (this game is in full monitoring), then he knows that asymptotically sn (l) is closer to μ(l) than to any μ(k) (as soon as the frequency of l is big enough), therefore it is δ-close to μ(l). Lemma 4 implies that μn (l) is asymptotically close to sn (l) and therefore 2δ-close to μ(l). Note that instead of trying to compute the sequence of payoffs from the signals, we consider an auxiliary game defined on the signal space (ie the observations) so that this new game is in fact (almost) in full monitoring. Proof of Theorem 4. Consider the families {x(l) ∈ Δ(I), μ(l) ∈ Δ(S)I , l ∈ L} and δ > 0 given by Assumption 1 for a fixed ε > 0. Let Γ be the auxiliary repeated game where at stage n Player 1 (resp Player 2) chooses ln ∈ L (resp. μn ∈ Δ(S)I ). Given these choices, in (resp. sn ) is drawn accordingly to xη (ln ) (resp. μinn ). By Lemma 4, for every θ > 0, there exists N 1 ∈ IN such that for every l ∈ L: (10) sn (l) − μn (l) ≤ θ| Nn (l) ≥ N 1 ≥ 1 − θ . IPσ,τ (∀m ≥ n, Let σ be a calibrated strategy associated to ( sn )n∈IN in Γ . For every θ > 0, 2 there exists N ∈ IN such that with IPσ,τ -probability greater than 1 − θ:
80
V. Perchet
|Nn (l)| ∀n ≥ N , ∀l, k ∈ L, n
2
2
sn (l) − μ(l) − sn (l) − μ(k)
2
≤θ .
(11)
Since {μ(k), k ∈ L} is a grid of Δ(S)I , for every n ∈ IN and l ∈ L, there exists k ∈ L such that sn (l) − μ(k) ≤ δ. Therefore, combining equation (10) and (11), for every θ > 0 there exists N 3 ∈ IN such that: |Nn (l)| 2 μn (l) − μ(l) − δ 2 ≤ θ, ≥ 1 − θ . (12) IPσ,τ ∀n ≥ N 3 , ∀l ∈ L, n For every stage of type l ∈ L, in is drawn accordingly to xη (l) and by definition xη (l) − x(l) ≤ η. Therefore Hoeffding-Azuma’s inequality implies that, for every θ > 0 there exists N 4 ∈ IN such that: |Nn (l)| ın (l) − x(l) − η ≤ θ, ≥ 1 − θ . (13) IPσ,τ ∀n ≥ N 4 , ∀l ∈ L, n Combining equation (12), (13) and using Assumption 1, for every θ > 0, there exists N ∈ IN such that for every strategy τ of Player 2: |Nn (l)| Rn (l) − ε ≤ θ, ≥ 1 − θ , IPσ,τ ∀n ≥ N, ∀l ∈ L, (14) n
and σ is (L, ε)-internally consistent.
Remark 7. The strategy constructed is based on δ-calibration and HoeffdingAzuma’s inequality, therefore one can show that: |Nn (l)| ln(n) Eσ,τ sup Rn (l) − ε . ≤O n n l∈L
4.3
Back to Payoff Space
Assumption 1 can be fulfilled with some continuity assumptions over G: Proposition 1. Let G : Δ(I) × Δ(S)I be such that for every μ ∈ Δ(S)I , G(·, μ) is continuous and the family of function {G(x, ·), x ∈ Δ(I)} is equicontinuous. Then G fulfills Assumption 1. Proof. Since {G(x, ·), x ∈ Δ(I)} is equicontinuous and Δ(S)I compact, for every ε > 0, there exists δ > 0 such that: ∀x ∈ Δ(I), ∀μ, μ ∈ Δ(S)I , μ − μ ≤ 2δ ⇒ |G(x, μ) − G(x, μ )| ≤
ε . 2
Let {μ(l), l ∈ L} be a finite δ-grid of Δ(S)I and for every l ∈ L, x(l) ∈ BR(μ(l)) so that G(x(l), μ(l)) = maxz∈Δ(I) G(z, μ(l)). Since G(x(l), ·) is continuous, there exists η(l) > 0 such that: x − x(l) ≤ η(l) ⇒ |G(x, μ(l)) − G(x(l), μ(l))| ≤ ε/2 .
Calibration and Internal No-Regret with Random Signals
81
Define η = minl∈L η(l) and let x ∈ Δ(I), μ ∈ Δ(S)I and l ∈ L such that x − x(l) ≤ η and μ − μ(l) ≤ δ, then: G(x, μ) ≥ G(x, μ(l)) −
ε ≥ G(x(l), μ(l)) − ε = max G(z, μ(l)) − ε , 2 z∈Δ(I)
and x ∈ BRε (μ).
This proposition implies that the evaluation function used by Rustichini fulfills Assumption 1 (Lugosi, Mannor and Stoltz [14]). Before proving that, we introduce S, the range of s, which is a closed convex subset of Δ(S)I , and ΠS (·) the projection onto it. Corollary 2. Define G : Δ(I) × Δ(S)I → IR by: inf y∈s−1 (μ) ρ(x, y) if μ ∈ S G(x, μ) = G (x, ΠS (μ)) otherwise. Then G fulfills Assumption 1.
Proof. The function s can be extended linearly to IR|J| by s(y) = j∈J y(j)s(j) where y = (y(j))j∈J . Therefore, by Aubin and Frankowska [1] (Theorem 2.2.1, page 57), the multivalued application s−1 : S → Δ(J)I is λ-Lipschitz, and since ΠS is 1-Lipschitz (because S is convex), G(x, ·) is also λ-Lipschitz, for every x ∈ Δ(I). Therefore, {G(x, ·), x ∈ Δ(I)} is equicontinuous. For every μ ∈ Δ(S)I , G(·, μ) is 1-Lipschitz (see [14]), therefore continuous. Hence, by Proposition 1, G fulfills Assumption 1. Concluding Remarks The definitions and proofs rely uniquely on Assumption 1: it is not relevant to assume that Player 1 faces only one opponent nor that the action set of its opponent is finite. The only requirement is that given his information (a probability in Δ(I) and a flag in Δ(S)I ), Player 1 can evaluate his payoff, no matter how this payoff is obtained: for example we could have assumed that Player 2 chooses at each stage an (unobserved) outcome vector U ∈ [−1, 1]|I| and Player 1 chooses a coordinate, which is his observed payoff. In the full monitoring framework, many improvements have been made in the past years about calibration and regret (see for instance [12,16,18]). Here, we aimed to clarify the links between the original notions of approachability, internal regret and calibration in order to extend applications (in particular, to get rid of the finiteness of J), to define the internal regret (with signals) as calibration over an appropriate space and to give a proof derived from no-internal regret (in full monitoring), itself derived from the approachability of an orthant in this space. Acknowledgments. I deeply thanks my advisor Sylvain Sorin for its great help and numerous comments. I also acknowledge helpful remarks from Eilon Solan and Gilles Stoltz.
82
V. Perchet
References 1. Aubin, J.-P., Frankowska, H.: Set-valued Analysis. Birkh¨ auser Boston Inc., Basel (1990) 2. Azuma, K.: Weighted sums of certain dependent random variables. Tˆ ohoku Math. J. 19(2), 357–367 (1967) 3. Blackwell, D.: An analog of the minimax theorem for vector payoffs. Pacific J. Math. 6, 1–8 (1956) 4. Blackwell, D.: Controlled random walks. In: Proceedings of the International Congress of Mathematicians, 1954, Amsterdam, vol. III, pp. 336–338 (1956) 5. Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning, and Games. Cambridge University Press, Cambridge (2006) 6. Foster, D.P., Vohra, R.V.: Asymptotic calibration. Biometrika 85, 379–390 (1998) 7. Foster, D.P., Vohra, R.V.: Regret in the on-line decision problem. Games Econom. Behav. 29, 7–35 (1999) 8. Fudenberg, D., Levine, D.K.: Conditional universal consistency. Games Econom. Behav. 29, 104–130 (1999) 9. Hannan, J.: Approximation to Bayes risk in repeated play. In: Contributions to the theory of Games. Annals of Mathematics Studies, vol. 3(39), pp. 97–139. Princeton University Press, Princeton (1957) 10. Hart, S., Mas-Colell, A.: A simple adaptive procedure leading to correlated equilibrium. Econometrica 68, 1127–1150 (2000) 11. Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58, 13–30 (1963) 12. Lehrer, E.: A wide range no-regret theorem. Games Econom. Behav. 42, 101–115 (2003) 13. Lehrer, E., Solan, E.: Learning to play partially-specified equilibrium (manuscript, 2007) 14. Lugosi, G., Mannor, S., Stoltz, G.: Strategies for prediction under imperfect monitoring. Math. Oper. Res. 33, 513–528 (2008) 15. Rustichini, A.: Minimizing regret: the general case. Games Econom. Behav. 29, 224–243 (1999) 16. Sandroni, A., Smorodinsky, R., Vohra, R.V.: Calibration with many checking rules. Math. Oper. Res. 28, 141–153 (2003) 17. Sorin, S.: Lectures on Dynamics in Games. Unpublished Lecture Notes (2008) 18. Vovk, V.: Non-asymptotic calibration and resolution. Theoret. Comput. Sci. 387, 77–89 (2007)
St. Petersburg Portfolio Games L´ aszl´o Gy¨ orfi and P´eter Kevei Department of Computer Science and Information Theory Budapest University of Technology and Economics Magyar Tud´ osok k¨ or´ utja 2., Budapest, Hungary, H-1117
[email protected] Analysis and Stochastics Research Group Hungarian Academy of Sciences Aradi v´ertan´ uk tere 1, Szeged, Hungary, H-6720
[email protected]
Abstract. We investigate the performance of the constantly rebalanced portfolios, when the random vectors of the market process {Xi } are independent, and each of them distributed as (X (1) , X (2) , . . . , X (d) , 1), d ≥ 1, where X (1) , X (2) , . . . , X (d) are nonnegative iid random variables. Under general conditions we show that the optimal strategy is the uniform: (1/d, . . . , 1/d, 0), at least for d large enough. In case of St. Petersburg components we compute the average growth rate and the optimal strategy for d = 1, 2. In order to make the problem non-trivial, a commission factor is introduced and tuned to result in zero growth rate on any individual St. Petersburg components. One of the interesting observations made is that a combination of two components of zero growth can result in a strictly positive growth. For d ≥ 3 we prove that the uniform strategy is the best, and we obtain tight asymptotic results for the growth rate.
1
Constantly Rebalanced Portfolio
Consider a hypothetical investor who can access d financial instruments (asset, bond, cash, return of a game, etc.), and who can rebalance his wealth in each round according to a portfolio vector b = (b(1) , . . . , b(d) ). The j-th component b(j) of b denotes the proportion of the investor’s capital invested in financial instrument j. We assume that the portfolio vector b has nonnegative components and sum up to 1. The nonnegativity assumption means that short selling is not allowed, while the latter condition means that our investor does not consume nor deposit new cash into his portfolio, but reinvests it in each round. The set of portfolio vectors is denoted by ⎧ ⎫ d ⎨ ⎬ Δd = b = (b(1) , . . . , b(d) ); b(j) ≥ 0, b(j) = 1 . ⎩ ⎭ j=1
The first author acknowledges the support of the Computer and Automation Research Institute of the Hungarian Academy of Sciences. The work was supported in part by the Hungarian Scientific Research Fund, Grant T-048360.
R. Gavald` a et al. (Eds.): ALT 2009, LNAI 5809, pp. 83–96, 2009. c Springer-Verlag Berlin Heidelberg 2009
84
L. Gy¨ orfi and P. Kevei
The behavior of the market is given by the sequence of return vectors {xn }, (1) (d) (j) xn = (xn , . . . , xn ), such that the j-th component xn of the return vector xn denotes the amount obtained after investing a unit capital in the j-th financial instrument on the n-th round. Let S0 denote the investor’s initial capital. Then at the beginning of the first (j) round S0 b1 is invested into financial instrument j, and it results in return (j) (j) S0 b1 x1 , therefore at the end of the first round the investor’s wealth becomes S 1 = S0
d
(j) (j)
b1 x1 = S0 b1 , x1 ,
j=1
where · , · denotes inner product. For the second round b2 is the new portfolio and S1 is the new initial capital, so S2 = S1 · b2 , x2 = S0 · b1 , x1 · b2 , x2 . By induction, for the round n the initial capital is Sn−1 , therefore Sn = Sn−1 bn , xn = S0
n
bi , xi .
(1)
i=1
Of course the problem is to find the optimal investment strategy for a long run period, that is to maximize Sn in some sense. The best strategy depends on the optimality criteria. A naive attitude is to maximize the expected return in each round. This leads to the risky strategy to invest all the money into (j) (i) the financial instrument j, with EXn = max{EXn : i = 1, 2, . . . , n}, where (1) (2) (d) Xn = (Xn , Xn , . . . , Xn ) is the market vector in the n-th round. Since the (j) random variable Xn can be 0 with positive probability, repeated application of this strategy lead to quick bankrupt. The underlying phenomena is the simple fact that E(Sn ) may increase exponentially, while Sn → 0 almost surely. A more delicate optimality criterion was introduced by Breiman [3]: in each round we maximize the expectation E lnb, Xn for b ∈ Δd . This is the so-called logoptimal portfolio, which is optimal under general conditions [3]. If the market process {Xi } is memoryless, i.e., it is a sequence of independent and identically distributed (i.i.d.) random return vectors then the log-optimal portfolio vector is the same in each round: b∗ := arg max E{ln b , X1 }. b∈Δd
In case of constantly rebalanced portfolio (CRP) we fix a portfolio vector b ∈ Δd . In this special case, according to (1) we get Sn = S0 ni=1 b , xi , and so the average growth rate of this portfolio selection is n
1 1 1 ln Sn = ln S0 + ln b , xi , n n n i=1
St. Petersburg Portfolio Games
85
therefore without loss of generality we can assume in the sequel that the initial capital S0 = 1. The optimality of b∗ means that if Sn∗ = Sn (b∗ ) denotes the capital after round n achieved by a log-optimum portfolio strategy b∗ , then for any portfolio strategy b with finite E{ln b , X1 } and with capital Sn = Sn (b) and for any memoryless market process {Xn }∞ 1 , lim
n→∞
1 1 ln Sn ≤ lim ln Sn∗ n→∞ n n
almost surely
and maximal asymptotic average growth rate is lim
n→∞
1 ln Sn∗ = W ∗ := E{ln b∗ , X1 } n
almost surely.
The proof of the optimality is a simple consequence of the strong law of large numbers. Introduce the notation W (b) = E{ln b , X1 }. Then the strong law of large numbers implies that n
1 1 ln Sn = ln b , Xi n n i=1 n
=
n
1 1 E{ln b , Xi } + (ln b , Xi − E{ln b , Xi }) n i=1 n i=1 n
= W (b) + → W (b)
1 (ln b , Xi − E{ln b , Xi }) n i=1 almost surely.
Similarly, lim
n→∞
1 ln Sn∗ = W (b∗ ) = max W (b) almost surely. b n
In connection with CRP in a more general setup we refer to Kelly [8] and Breiman [3]. In the following we assume that the i.i.d. random vectors {Xi }, have the general form X = (X (1) , X (2) , . . . , X (d) , X (d+1) ), where X (1) , X (2) , . . . , X (d) are nonnegative i.i.d. random variables and X (d+1) is the cash, that is X (d+1) ≡ 1, and d ≥ 1. Then the concavity of the logarithm, and the symmetry of the first d components immediately imply that the log-optimal portfolio has the form b = (b, b, . . . , b, 1 − db), where of course 0 ≤ b ≤ 1/d. When does b = 1/d correspond to the optimal strategy; that is when should we play with all our money? In our special case W has the form d
(i) W (b) = E ln b X + 1 − bd . i=1
86
L. Gy¨ orfi and P. Kevei
Let denote Zd = di=1 X (i) . Interchanging the order of integration and differentiation, we obtain d
d d Zd − d (i) W (b) = E ln b . X + 1 − bd =E db db bZd + 1 − bd i=1 For b = 0 we have W (0) = E(Zd ) − d, which is nonnegative if and only if E(X (1) ) ≥ 1. This implies the intuitively clear statement that we should risk at all, if and only if the expectation of the game is not less than one. Otherwise the optimal strategy is to take all your wealth in cash. The function W (·) is concave, therefore the maximum is in b = 1/d if W (1/d) ≥ 0, which means that d E ≤ 1. (2) Zd According to the strong law of large numbers d/Zd → 1/E(X (1) ) a.s. as d → ∞, thus under some additional assumptions for the underlying variables E(d/Zd ) → 1/E(X (1) ), as d → ∞. Therefore if E(X (1) ) > 1, then for d large enough the optimal strategy is (1/d, . . . , 1/d, 0). In the latter computations we tacitly assumed some regularity conditions, that is we can interchange the order of differentiation and integration, and that we can take the L1 -limit instead of almost sure limit. One can show that these conditions are satisfied if the underlying random variables have strictly positive infimum. We skip the technical details.
2 2.1
St. Petersburg Game Iterated St. Petersburg Game
Consider the simple St. Petersburg game, where the player invests 1$ and a fair coin is tossed until a tail first appears, ending the game. If the first tail appears in step k then the the payoff X is 2k and the probability of this event is 2−k : P{X = 2k } = 2−k . The distribution function of the gain is 0, F (x) = P{X ≤ x} = 1 − 2log12 x = 1 −
(3)
2{log2 x} , x
if x < 2 , if x ≥ 2 ,
(4)
where x is the usual integer part of x and {x} stands for the fractional part. Since E{X} = ∞, this game has delicate properties (cf. Aumann [1], Bernoulli [2], Haigh [7], and Samuelson [10]). In the literature, usually the repeated St. Petersburg game (called iterated St. Petersburg game, too) means multi-period game such that it is a sequence of simple St. Petersburg games, where in each round the player invests 1$. Let Xn denote the payoff for the n-th simple game.
St. Petersburg Portfolio Games
87
∞ Assume that the sequence {X nn }n=1 is i.i.d. After n rounds the player’s gain in ¯ the repeated game is Sn = i=1 Xi , then
lim
n→∞
S¯n =1 n log2 n
in probability, where log2 denotes the logarithm with base 2 (cf. Feller [6]). Moreover, S¯n =1 lim inf n→∞ n log2 n a.s. and S¯n =∞ lim sup n→∞ n log2 n a.s. (cf. Chow and Robbins [4]). Introducing the notation for the largest payoff Xn∗ = max Xi 1≤i≤n
and for the sum with the largest payoff withheld Sn∗ =
n
Xi − Xn∗ = S¯n − Xn∗ ,
i=1
one has that
Sn∗ =1 n→∞ n log2 n a.s. (cf. Cs¨ org˝ o and Simons [5]). lim
2.2
Sequential St. Petersburg Game According to the previous results S¯n ≈ n log2 n. Next we introduce the sequential St. Petersburg game, having exponential growth. The sequential St. Petersburg game means that the player starts with initial capital S0 = 1$, and there is a sequence of simple St. Petersburg games, and for each simple game the player (c) reinvests his capital. If Sn−1 is the capital after the (n − 1)-th simple game then (c) (c) the invested capital is Sn−1 (1 − c), while Sn−1 c is the proportional cost of the simple game with commission factor 0 < c < 1. It means that after the n-th round the capital is Sn(c) = Sn−1 (1 − c)Xn = S0 (1 − c)n (c)
n
Xi = (1 − c)n
i=1
n
Xi .
i=1
(c)
Because of its multiplicative definition, Sn has exponential trend: (c)
Sn(c) = 2nWn ≈ 2nW
(c)
with average growth rate 1 log2 Sn(c) n and with asymptotic average growth rate Wn(c) :=
,
88
L. Gy¨ orfi and P. Kevei
W (c) := lim
n→∞
1 log2 Sn(c) . n
Let’s calculate the the asymptotic average growth rate. Because of
n 1 1 n log2 (1 − c) + log2 Xi , Wn(c) = log2 Sn(c) = n n i=1 the strong law of large numbers implies that n
1 log2 Xi = log2 (1 − c) + E{log2 X1 } n→∞ n i=1
W (c) = log2 (1 − c) + lim
a.s., so W (c) can be calculated via expected log-utility (cf. Kenneth [9]). A commission factor c is called fair if W (c) = 0, so the growth rate of the sequential game is 0. Let’s calculate the fair c: log2 (1 − c) = −E{log2 X1 } = −
∞
k · 2−k = −2,
k=1
i.e., c = 3/4. 2.3
Portfolio Game with One or Two St. Petersburg Components
Consider the portfolio game, where a fraction of the capital is invested in simple fair St. Petersburg games and the rest is kept in cash, i.e., it is a CRP problem with the return vector X = (X (1) , . . . , X (d), X (d+1) ) = (X1 , . . . , Xd , 1) (d ≥ 1) such that the first d i.i.d. components of the return vector X are of the form P{X = 2k−2 } = 2−k , (5) (k ≥ 1), while the last component is the cash. The main aim is to calculate the largest growth rate Wd∗ . Proposition 1. We have that W1∗ = 0.149 and W2∗ = 0.289. Proof. For d = 1, fix a portfolio vector b = (b, 1 − b), with 0 ≤ b ≤ 1. The asymptotic average growth rate of this portfolio game is W (b) = E{log2 b , X} = E{log2 (bX + 1 − b)} = E{log2 (b(X/4 − 1) + 1)}.
St. Petersburg Portfolio Games
89
The function log2 is concave, therefore W (b) is concave, too, so W (0) = 0 (keep everything in cash) and W (1) = 0 (the simple game is fair) imply that for all 0 < b < 1, W (b) > 0. Let’s calculate maxb W (b). We have that W (b) =
∞
log2 (b(2k /4 − 1) + 1) · 2−k
k=1
= log2 (1 − b/2) · 2
−1
+
∞
log2 (b(2k−2 − 1) + 1) · 2−k .
k=3
Figure 1 shows the curve of the average growth rate of the portfolio game. The function W (·) attains its maximum at b = 0.385, that is b∗ = (0.385, 0.615) , where the growth rate is W1∗ = W (0.385) = 0.149. It means that if for each round of the game one reinvests 38.5% of his capital such that the real investment is 9.6%, while the cost is 28.9%, then the growth rate is approximately 11%, i.e., the portfolio game with two components of zero growth rate (fair St. Petersburg game and cash) can result in growth rate of 10.9%.
0.14 0.12 0.1 0.08 0.06 0.04 0.02 0.2
0.4
0.6
0.8
1
Fig. 1. The growth rate for one St. Petersburg component
Consider next d = 2. At the end of Section 1 we proved that the log-optimal portfolio vector has the form b = (b, b, 1 − 2b), with 0 ≤ b ≤ 1/2. The asymptotic average growth rate of this portfolio game is W (b) = E{log2 b , X} = E{log2 (bX1 + bX2 + 1 − 2b)} = E{log2 (b((X1 + X2 )/4 − 2) + 1)}.
90
L. Gy¨ orfi and P. Kevei
0.25
0.2
0.15
0.1
0.05
0.1
0.2
0.3
0.4
0.5
Fig. 2. The growth rate for two St. Petersburg component
Figure 2 shows the curve of the average growth rate of the portfolio game. Numerically we can determine that the maximum is taken at b = 0.364, so b∗ = (0.364, 0.364, 0.272) , where the growth rate is W2∗ = W (0.364) = 0.289. 2.4
Portfolio Game with at Least Three St. Petersburg Components
Consider the portfolio game with d ≥ 3 St. Petersburg components. We saw that the log-optimal portfolio has the form b = (b, . . . , b, 1 − db) with b ≤ 1/d. Proposition 2. For d ≥ 3, we have that b∗ = (1/d, . . . , 1/d, 0). Proof. Using the notations at the end of Section 1, we have to prove the inequality d W (1/d) ≥ 0 . db According to (2) this is equivalent with d 1≥E . X1 + · · · + Xd For d = 3, 4, 5, numerically one can check this inequality. One has to prove the proposition for d ≥ 6, which means that 1 1 ≥ E 1 d . (6) i=1 Xi d
St. Petersburg Portfolio Games
91
We use induction. Assume that (6) holds until d − 1. Choose the integers d1 ≥ 3 and d2 ≥ 3 such that d = d1 + d2 . Then 1 d
i=1 Xi
1 d
= =
1 d
1
d1
i=1 Xi
d1 1 d d1
+
1 d
d
i=d1 +1
1
d1
i=1 Xi
+
d2 1 d d2
+
d2 d
Xi
d
i=d1 +1
Xi
,
therefore the Jensen inequality implies that 1 d
and so
E
1 d
i=1 Xi
1
1 d
d
i=1
Xi
≤
d1 d
1 d1
1 d1
i=1 Xi
1 d2
d
1
i=d1 +1
Xi
,
d2 1 1 d1 ≤E + d1 d d 1 d 1 i=1 Xi i=d +1 Xi d1 d2 1 d2 d1 1 1 + E 1 d2 = E 1 d1 d d i=1 Xi i=1 Xi d1 d2 ≤
d2 d1 + = 1, d d
where the last inequality follows from the assumption of the induction. 2.5
Portfolio Game with Many St. Petersburg Components
For d ≥ 3, the best portfolio is the uniform portfolio with asymptotic average growth rate d
d 1 1 = E log2 . X Xi Wd∗ = E log2 d i=1 i 4d i=1 First we compute this growth rate numerically for small values of d, then we determine the exact asymptotic growth rate for d → ∞. For d ≥ 2 arbitrary, by (3) we may write d
∞ log2 2i1 + 2i2 + · · · + 2id E log2 = Xi . 2i1 +i2 +···+id i=1 i ,i ,...,i =1 1
2
d
Straightforward calculation shows that for d ≤ 8, summing from 1 to 20 in each index independently, that is taking only 20d terms, the error is less then 1/1000. Here are the first few values: d 1 2 3 4 5 6 7 8 Wd∗ 0.149 0.289 0.421 0.526 0.606 0.669 0.721 0.765 Notice that W1∗ and W2∗ come from Section 2.3. Now we return to the asymptotic results.
92
L. Gy¨ orfi and P. Kevei
Theorem 1. For the asymptotic behavior of the average growth rate we have −
0.8 1 log2 log2 d + 4 ≤ Wd∗ − log2 log2 d + 2 ≤ . ln 2 log2 d ln 2 log2 d
Proof. Because of Wd∗
= E log2
d
1 Xi 4d i=1
d i=1
Xi d log2 d
= E log2
+ log2 log2 d − 2,
we have to show that
d log2 log2 d + 4 0.8 1 i=1 Xi ≤ E log2 ≤ . − ln 2 log2 d d log2 d ln 2 log2 d
Concerning the upper bound in the theorem, use the decomposition d d ˜ d Xi i=1 Xi i=1 Xi = log2 + log2 i=1 , log2 d ˜ d log2 d d log2 d i=1 Xi
where ˜i = X We prove that
E log2
and 0≤E
if Xi ≤ d log2 d , Xi , d log2 d, otherwise.
d
˜ i=1 Xi d log2 d d
log2 i=1 d i=1
≤
Xi ˜i X
log2 log2 d + 2 , ln 2 log2 d ≤
2 . ln 2 log2 d
(7)
(8)
For (8), we have that d d x i=1 Xi i=1 Xi P log2 d ≥ x = P d ≥2 ˜ ˜ i=1 Xi i=1 Xi ˜i} ≤ P{∃ i ≤ d : Xi ≥ 2x X x = P{∃ i ≤ d : Xi ≥ 2 min{Xi , d log2 d}} = P{∃ i ≤ d : Xi ≥ 2x d log2 d} ≤ d P{X ≥ 2x d log2 d} 2 , ≤d x 2 d log2 d where we used that P{X ≥ x} ≤ 2/x, which is an immediate consequence of (4). Therefore d d ∞ i=1 Xi i=1 Xi E log2 d = P log2 d ≥ x dx ˜ ˜ 0 i=1 Xi i=1 Xi ∞ 2 2 ≤ dx = , x 2 log2 d ln 2 log2 d 0
St. Petersburg Portfolio Games
93
and the proof of (8) is finished. For (7), put l = log2 (d log2 d). Then for the expectation of the truncated variable we have E(X˜1 ) =
l
2k
k=1
∞ 1 1 + d log d 2 k 2 2k k=l+1
= l + d log2 d
1 2l+1
2=l+
d log2 d ≤ l + 2. 2 log2 (d log2 d)
Thus, E log2
d
˜i X d log2 d
i=1
d ˜ 1 i=1 Xi = E ln ln 2 d log2 d d ˜i X 1 i=1 E −1 ≤ ln 2 d log2 d
˜1} 1 E{X −1 = ln 2 log2 d 1 l+2 −1 ≤ ln 2 log2 d 1
log2 (d log2 d) + 2 −1 = ln 2 log2 d 1 log2 d + log2 log2 d + 2 −1 ≤ ln 2 log2 d 1 log2 log2 d + 2 = . ln 2 log2 d
Concerning the lower bound in the theorem, consider the decomposition d log2
Xi = d log2 d i=1
d log2
Xi d log2 d
+
i=1
−
d log2
Xi d log2 d
−
i=1
On the one hand for arbitrary ε > 0, we have that d x i=1 Xi ≤2 ≤ P { for all i ≤ d, Xi ≤ 2x d log2 d} P d log2 d = P {X ≤ 2x d log2 d} d 1 ≤ 1− x 2 d log2 d ≤e
1 − 2x log
2d
1−ε , ≤1− x 2 log2 d
d
.
94
L. Gy¨ orfi and P. Kevei
for d large enough, where we used the inequality e−z ≤ 1 − (1 − ε)z, which holds for z ≤ − ln(1 − ε). Thus d 1−ε x i=1 Xi >2 , ≥ x P d log2 d 2 log2 d which implies that ⎧
+ ⎫ d d ⎬ ⎨ ∞ X X i i log2 i=1 > x dx P log2 i=1 = E ⎭ ⎩ d log2 d d log2 d 0 ∞ d x i=1 Xi > 2 dx = P d log2 d 0 ∞ 1−ε ≥ dx x log d 2 0 2 1 1−ε . = log2 d ln 2 Since ε is arbitrary we obtain ⎧
+ ⎫ d ⎨ ⎬ X 1 1 i log2 i=1 . E ≥ ⎩ ⎭ log2 d ln 2 d log2 d For the estimation of the negative part we use an other truncation method. Now we cut the variable at d, so put Xi , if Xi ≤ d , ˆ Xi = d, otherwise. d ˆ ˆ Introduce also the notations Sˆd = i=1 Xi and cd = E(X1 )/ log2 d. Similar computations as before show that d ˆ 1 ) = log d + E(X = log2 d + 2{log2 d} − {log2 d} and 2 log 2 2 d
2 1−{log2 d} {log2 d} ˆ 2 ≤ 2 2 log2 d − 1 + d ≤ 3d , < d 2 + 2 E X 1 2 log2 d
√ where we used that 2 2 ≤ 21−y + 2y ≤ 3 for y ∈ [0, 1]; this can be proved easily. Simple analysis shows again that 0.9 ≤ 2y − y ≤ 1 for y ∈ [0, 1], and so for cd − 1 we obtain 1 0.9 < cd − 1 < . log2 d log2 d ˆ i we have that Since di=1 Xi ≥ di=1 X ⎧ ⎧ ⎫
− ⎫ d d ˆ − ⎬ ⎨ ⎬ ⎨ Xi Xi log2 i=1 log2 i=1 E ≤E . ⎩ ⎭ ⎩ ⎭ d log2 d d log2 d
St. Petersburg Portfolio Games
95
Noticing that log2
2d Sˆd > log2 = 1 − log2 log2 d , d log2 d d log2 d
we obtain ⎧
− ⎫ 0 ⎬ ⎨ Sˆd Sˆd log2 ≤ x dx , P log2 = E ⎭ ⎩ d log2 d d log2 d − log2 log2 d thus we have to estimate the tail probabilities of Sˆd . According to Bernstein’s inequality, for x < 0 we have
ˆd ) E( S Sˆd Sˆd − E(Sˆd ) P log2 ≤x =P ≤ 2x − d log2 d d log2 d d log2 d Sˆd − E(Sˆd ) x ≤ 2 − cd =P d log2 d ⎧ ⎫ ⎨ ⎬ 2 2 x 2 d log2 d (cd − 2 ) ≤ exp − 2 x ⎩ 2 d E[(X) ˆ 2 ] + d log2 d (cd −2 ) ⎭ 3 2 x 2 log2 d (cd − 2 ) . ≤ exp − 6 + 23 log2 d (cd − 2x ) Let γ > 0 be fixed, we define it later. For x < −γ and d large enough the last −γ 2 upper bound ≤ d−(1−2 ) , therefore log2 log2 d Sˆd ≤ x dx ≤ (1−2 P log2 −γ )2 . d log d d − log2 log2 d 2
−γ
We give an estimation for the integral on [−γ, 0]: γ log22 d (cd − 2−x )2 Sˆd ≤ x dx ≤ dx P log2 exp − d log2 d 6 + 23 log2 d (cd − 2−x ) −γ 0 γ ln 2 1 log22 d (cd − e−x )2 = dx . exp − ln 2 0 6 + 23 log2 d (cd − e−x )
0
For arbitrarily fixed ε > 0 we choose γ > 0 such that 1 − x ≤ e−x ≤ 1 − (1 − ε)x, for 0 ≤ x ≤ γ ln 2. Using also our estimations for cd − 1 we may write exp −
log22 d (cd − e−x )2 6 + 23 log2 d (cd − e−x )
log2 d (0.9/ log2 d + (1 − ε)x)2 ≤ exp − 2 2 6 + 3 log2 d (1/ log2 d + x)
96
L. Gy¨ orfi and P. Kevei
and continuing the estimation of the integral we have γ ln 2 1 log22 d (0.9/ log2 d + (1 − ε)x)2 ≤ dx exp − ln 2 0 6 + 23 log2 d (1/ log2 d + x) log2 d γ ln 2 (0.9 + (1 − ε)x)2 1 1 dx exp − = ln 2 log2 d 0 6 + 23 (1 + x) ∞ 1 (0.9 + (1 − ε)x)2 1 ≤ dx exp − ln 2 log2 d 0 6 + 23 (1 + x) 1.7 1 , ≤ ln 2 log2 d where the last inequality holds if ε is small enough. Summarizing, we have ⎧ ⎫ d ˆ − ⎬ 0 ⎨ X Sˆd i i=1 log2 E ≤ x dx P log2 = ⎩ ⎭ d log2 d d log2 d − log2 log2 d log2 log2 d 1.7 1 + ln 2 log2 d d(1−2−γ )2 1.8 1 , ≤ ln 2 log2 d ≤
for d large enough. Together with the estimation of the positive part this proves our theorem.
References 1. Aumann, R.J.: The St. Petersburg paradox: A discussion of some recent comments. Journal of Economic Theory 14, 443–445 (1977) 2. Bernoulli, D.: Exposition of a new theory on the measurement of risk. Econometrica 22, 22–36 (1954); Originally published in 1738; translated by L. Sommer 3. Breiman, L.: Optimal gambling systems for favorable games. In: Proc. Fourth Berkeley Symp. Math. Statist. Prob., vol. 1, pp. 65–78. Univ. California Press, Berkeley (1961) 4. Chow, Y.S., Robbins, H.: On sums of independent random variables with infinite moments and “fair” games. Proc. Nat. Acad. Sci. USA 47, 330–335 (1961) 5. Cs¨ org˝ o, S., Simons, G.: A strong law of large numbers for trimmed sums, with applications to generalized St. Petersburg games. Statistics and Probability Letters 26, 65–73 (1996) 6. Feller, W.: Note on the law of large numbers and “fair” games. Ann. Math. Statist. 16, 301–304 (1945) 7. Haigh, J.: Taking Chances. Oxford University Press, Oxford (1999) 8. Kelly, J.L.: A new interpretation of information rate. Bell System Technical Journal 35, 917–926 (1956) 9. Kenneth, A.J.: The use of unbounded utility functions in expected-utility maximization: Response. Quarterly Journal of Economics 88, 136–138 (1974) 10. Samuelson, P.: The St. Petersburg paradox as a divergent double limit. International Economic Review 1, 31–37 (1960)
Reconstructing Weighted Graphs with Minimal Query Complexity Nader H. Bshouty and Hanna Mazzawi Technion - Israel Institute of Technology {bshouty,hanna}@cs.technion.ac.il
Abstract. In this paper we consider the problem of reconstructing a hidden weighted graph using additive queries. We prove the following: Let G be a weighted hidden graph with n vertices and m edges such that the weights on the edges are bounded between n−a and nb for any positive constants a and b. For any m there exists a non-adaptive algorithm that finds the edges of the graph using m log n O log m additive queries. This solves the open problem in [S. Choi, J. H. Kim. Optimal Query Complexity Bounds for Finding Graphs. Proc. of the 40th annual ACM Symposium on Theory of Computing , 749–758, 2008]. Choi and Kim’s proof holds for m ≥ (log n)α for a sufficiently large constant α and uses graph theory. We use the algebraic approach for the problem. Our proof is simple and holds for any m.
1
Introduction
In this paper we consider the following problem of reconstructing weighted graphs using additive queries: Let G = (V, E, w) be a weighted hidden graph where E ⊆ V × V , w : E → {i | n−a ≤ i ≤ nb } and n is the number of vertices in V . Denote by m the size of E. Suppose that the set of vertices V is known and the set of edges E is unknown. Given a set of vertices S ⊆ V , an additive query, Q(S), returns the sum of weights in the subgraph induces by S. That is, Q(S) =
w(e).
e∈E∩(S×S)
Our goal is to exactly reconstruct the set of edges using additive queries. One can distinguish between two types of algorithms to solve the problem. Adaptive algorithms are algorithms that take into account outcomes of previous queries where non-adaptive algorithms make all queries in advance, before any answer is known. In this paper, we consider non-adaptive algorithms for the problem. Our concern is the query complexity, that is, the number of queries needed to be asked in order to reconstruct the graph. R. Gavald` a et al. (Eds.): ALT 2009, LNAI 5809, pp. 97–109, 2009. c Springer-Verlag Berlin Heidelberg 2009
98
N.H. Bshouty and H. Mazzawi
The problem of reconstructing graphs using additive queries has been motivated by applications in bioinformatics. Assume we have a set of labeled chemicals, and we are able to tell how many pairs react when mixing several of those chemicals together. We can represent the problem as a graph, where the chemicals are the vertices and two chemicals that react with each other are connected with an edge. The goal is to reconstruct this reactions graph using as few experiments as possible. One concrete example for reconstructing a hidden graph is in genome sequencing. Obtaining the genome sequence is important for the study of organisms. To obtain the sequence, one common approach is to obtain short reads from the genome sequence. These reads are assembled to contigs, which are contiguous fragments that cover the genome sequence with possible gaps. Given these contigs, the goal is to determine their relative place in the genome sequence. The process of ordering the contigs is done using the multiplex PCR method. This method, given a group of contigs determines the number of adjacent contigs in the original genome sequence. Assuming that the genome sequence is circular, the problem of ordering the contigs using the multiplex PCR method is equivalent to reconstructing a hidden Hamiltonian cycle using queries [6,12]. The graph reconstructing problem has known a significant progress in the past decade. For unweighted graph the information theoretic lower bound gives 2 m log nm Ω log m for the query complexity for any adaptive algorithm for this problem. A tight upper bound was proved for some subclasses of unweighted graphs (Hamiltonian graphs, matching, stars and cliques etc.) [13,12,11,6], unweighted graphs with Ω(dn) edges where the degree of each vertex is bounded by d [11], graphs with Ω(n2 ) edges [11] and then the former was extended to d-degenerate unweighted graphs with Ω(dn) edges [13], i.e., graphs that their edges can be changed to directed edges where the out-degree of each vertex is bounded by d. A recent paper by Choi and Kim, [8], gave a tight upper bound for all unweighted graphs. For reconstructing weighted graphs, in [8], Choi and Kim proved the following: If m > (log n)α for sufficiently large α, then, there exists a non-adaptive algorithm for reconstructing a weighted graph where the weights are bounded between n−a and nb for any positive constants a and b using m log n O log m queries. In this paper, we close the gap in m and prove that for any m there exists a non-adaptive algorithm that reconstructs the hidden graph using m log n O log m queries. This matches the information theoretic lower bound.
Reconstructing Weighted Graphs with Minimal Query Complexity
99
In our analysis, we apply algebraic techniques for solving this problem. This simplifies the proofs of the correctness. The paper is organized as follows: In Section 2, we present notation, basic tools and some background. In Section 3, we prove the existence of an algorithm for the discretization of the problem. In Section 4, we present the algorithm for the problem and prove correctness. Finally, Section 5, contains open problems.
2
Preliminaries
In this section we present some background, basic tools and notation. 2.1
Notation and Preliminary Results
We denote by R the set of real numbers and by R+ the set of positive numbers. For an integer c, we denote by [c] the set {1, 2, . . . , c}. For s1 , s2 ∈ R, we denote by [s1 , s2 ] the set of real numbers between s1 and s2 , that is, {j | s1 ≤ j ≤ s2 }. Given t ∈ R+ such that s1 /t and s2 /t are integers, we denote by [s1 , t, s2 ] the set {s1 , s1 + t, s1 + 2t, . . . , s2 − 2t, s2 − t, s2 }. Let G = (V, E, w) be a weighted graph, where E ⊆ V ×V and w : E → [s1 , s2 ] for some s1 , s2 ∈ R+ . Throughout the paper, we denote by n the size of V and by m the size of E. Given a matrix or a vector M , the weight wt(M ) of M is the number of nonzero entries in M . Given an entry a of a matrix or a vector and s ∈ R+ , we say that a is s-heavy if |a| ≥ s. For a matrix or a vector x, we denote by ψs (x) the number of entries in x that are s-heavy. We now prove two lemmas that will be used in this paper. Lemma 1. Let B be a symmetric matrix. There are at least
ψs (B) −
wt(B) d
rows in B where each row is of weight at most d and contains an s-heavy entry. Proof.There are at most wt(B)/d rows of weight more than d. Also, there are at least ψs (B) rowswhere each row contains at least one s-heavy entry. Therefore, there are at least ψs (B) − wt(B)/d rows in B where each row is of weight at most d and contains an s-heavy entry. Let ι be a function on non-negative integers defined as follows: ι(0) = 1 and ι(i) = i for i > 0. Lemma 2. Let m1 , m2 , . . . , mt be integers in [m] ∪ {0} such that m1 + m2 + · · · + mt = ≥ t. Then
t i=0
ι(mi ) ≥ m(−t)/(m−1) .
100
N.H. Bshouty and H. Mazzawi
Proof. Notice that when 1 < m1 ≤ m2 < m then ι(m1 − 1)ι(m2 + 1) = (m1 − 1)(m2 + 1) < m1 m2 = ι(m1 )ι(m2 ). Also when m1 = 0 and 1 < m2 < m then ι(m1 + 1)ι(m2 − 1) = m2 − 1 < m2 = ι(m1 )ι(m2 ). Therefore the optimal value of ι(m1 )ι(m2 ) · · · ι(mt ) is obtained when for every 0 < i < j ≤ t we either have mi ∈ {1, m} or mj ∈ {1, m}. This is equivalent to: all mi ∈ {1, m} except at most one. This implies that at least ( − t)/(m − 1) of the mi s are equal to m. 2.2
Algebraic View of the Problem
In this subsection we show that our problem is equivalent to reconstructing a bilinear function xT Ay from substitution queries with x, y ∈ {0, 1}n, where A is an n × n symmetric matrix with 2m nonzero entries [13]. Let G = (V, E, w) be a non-directed weighted graph where V = {v1 , v2 , . . . , vn }, E ⊆ V ×V and w : E → [s1 , s2 ]. Let AG = (aij ) ∈ Rn×n be the adjacency matrix of G, that is, aij equals w((i, j)) if (i, j) ∈ E and equals zero otherwise. Given a set of vertices S ⊆ V define the vector a where ai equals “1” if vi ∈ S and “0” otherwise. Then, we have aT AG a Q(S) = . 2 Now, let z = x ∗ y = (xi yi ) ∈ {0, 1}n and x1 = x − z, y1 = y − z. Since AG is symmetric we have xT AG y =
xT AG x y T AG y (x1 + y1 )T AG (x1 + y1 ) + + − xT1 AG x1 − y1T AG y1 . 2 2 2
Therefore, the problem of reconstructing the set of edges of the graph G using additive queries is equivalent to finding the non-zero entries in its adjacency matrix AG using queries of the form f (x, y) = xT AG y, where x, y ∈ {0, 1}n . 2.3
Basic Probability
In this subsection we give a preliminary results in probability that will be used in this paper. We start with Chernoff bound Poisson trials such that Xi ∈ {0, 1} Lemma 3. Let X1 , . . . , X t be independent t t and E[Xi ] = pi . Let P = i=1 pi and X = i=1 Xi . Then 2
Pr [X ≤ (1 − λ)P ] ≤ e−λ
P/2
.
The following can be derived from Littlewood-Offord Theorem [14,10]. We prove it in the appendix for completeness
Reconstructing Weighted Graphs with Minimal Query Complexity
101
Lemma 4. Let a1 , a2 , . . . , at , b1 , b2 , . . . , bt and s ≥ 0 be real numbers such that bi − ai ≥ s for all i and let X1 , . . . , Xt be independent random variables. Suppose that there is pi where 0 < pi < 1 such that Pr[Xi ≤ ai ] ≥ pi and Pr[Xi ≥ bi ] ≥ pi for all i. Then, there is a constant c such that for any real number r and integer ρ ≥ 1 we have
t cρ Pr r ≤ Xi < r + ρs ≤ . t i=1 p i=1 i The following lemmas will be used in this paper Lemma 5. Let a ∈ Rn be a vector. Then, there is a constant c such that for any integer ρ ≥ 1, and a randomly (uniformly) chosen vectors x ∈ {0, 1}n we have that
cρ Pr |aT x| ≤ ρs ≤ . ψs (a) Proof. Let Xi = ai xi . Then Xi = ai with probability 1/2 and Xi = 0 with probability 1/2. The lemma now follows immediately from Lemma 4. Lemma 6. Let b ∈ Rn where ψs (b) > 0. Then, for a randomly (uniformly) chosen vectors x ∈ {0, 1}n we have that Pr[|bT x| ≥ s/2] ≥ 1/2 Proof. Suppose w.l.o.g. |b1 | ≥ s. For any fixed x2 , . . . , xn ∈ {0, 1} we have xT b = b1 x1 + b0 for some b0 ∈ R. Now this takes the values b0 for x1 = 0 and b0 + b1 for x1 = 1. Since |(b0 + b1 ) − b0 | = |b1 | ≥ s, one of them is at least s/2. Corollary 1. Let B ∈ Rn×n be a matrix where ψs (B) > 0. Then, for a randomly (uniformly) chosen vectors x, y ∈ {0, 1}n we have that Pr[|xT By| ≥ s/4] ≥ 1/4 Proof. By Lemma 6 the probability that By contains an s/2-heavy entry is at least 1/2. Assuming it does, by Lemma 6, the probability that |xT By| ≥ s/4 is again 1/2. This implies the result.
3
Reconstructing Graphs
In this section we give an upper bound for the discretization of the problem and then show how to solve the general problem. Let G be the set of all graphs with n vertices and m edges such that the weights of the edges are from the set [s1 , s1 /8m2 , s2 ], that is, the weights are bounded by s1 , s2 and are multiples of s1 /8m2 . For the class G we prove the following
102
N.H. Bshouty and H. Mazzawi
Theorem 1. There exists a set of queries S = {(x1 , y1 ), (x2 , y2 ), . . . , (xt , yt )} of size ⎞ ⎛ m log n + log ss21 ⎠ t = O⎝ log m = E(G ) there where xi , yi ∈ {0, 1}n such that for all G, G ∈ G where E(G) exists i ∈ [t] such that |xTi AG yi − xTi AG yi | > s1 /(8m). To prove Theorem 1 we will prove that there exists a set of queries S of size t such that for every B = AG − AG where G, G ∈ G and E(G) = E(G ) there T exists (x, y) ∈ S such that |x By| > s1 /8m. We divide into two cases. The first case is when the matrix B is a substraction of adjacency matrices of graphs that are “close” to each other, i.e., B has only few heavy entries. The second case is when the matrix B is a substraction of two adjacency matrices of graphs that are “far” from each other, i.e., B has many heavy entries. First notice the following properties of B: P1. Since G and G contains at most m edges we have wt(B) ≤ 2m. P2. Since the weights of the edges are in [s1 , s1 /(8m2 ), s2 ], the weights in B are in [−s2 , s1 /(8m2 ), s2 ]. P3. Since E(G) = E(G ) at least one of the entries of B is s1 -heavy. We denote by B the class of symmetric matrices that satisfy (P1-P3). Then the first case will be B1 = {B ∈ B | ψs1 /(8m) (B) ≤ m3/4 } and the second case will be B2 = {B ∈ B | ψs1 /(8m) (B) > m3/4 }. 3.1
Eliminating all Close Graphs
In this subsection we analyze the case where the matrix B has few heavy entries. Similar analysis appears in [8] and is presented here for completeness. We prove the following. Lemma 7. Let B1 = {B ∈ B | ψs1 /(8m) (B) ≤ m3/4 }. There exists a set of queries S = {(x1 , y1 ), (x2 , y2 ), . . . , (xt , yt )} of size m log n + log ss21 t= log m such that for every B ∈ B1 there exist i ∈ [t] such that |xTi Byi | > s1 /(8m). Proof. We start by proving a weaker claim. Let B1 = {B ∈ B1 | B ∈ ([−s2 , s1 /(8m2 ), −s1 /(8m)] ∪ [s1 /(8m), s1 /(8m2 ), s2 ])n×n }.
Reconstructing Weighted Graphs with Minimal Query Complexity
103
We first prove that there exists a set of queries S = {(x1 , y1 ), (x2 , y2 ),. . . ,(xt , yt )} such that for every B ∈ B1 we have i ∈ [t] such that |xT i B yi | ≥ s1 /4.
By (P3) and Corollary 1, for a randomly chosen query x, y we have Pr[|xT B y| ≤ s1 /4] ≤ 3/4. The size of B1 is bounded by |B1 |
≤
n2 m3/4
8s2 m2 s1
m3/4
< (4/3)t .
Therefore t t Pr[(∃B ∈ B1 )(∀i) |xT i Byi | < s1 /4] < (4/3) (3/4) < 1
and the weaker claim follows. Now, we argue that S is the set of queries we are looking for. Let B ∈ B1 , remove all entries smaller than s1 /(8m) in absolute value. Denote the new matrix by B ∗ . Notice that B ∗ ∈ B1 , therefore we have i ∈ [t] such that ∗ |xT i B yi | ≥ s1 /4.
(1)
T ∗ T ∗ |xT i Byi | = |xi B yi + xi (B − B )yi |.
(2)
Also note that
Since B − B ∗ has at most 2m − 1 non-zero entries and each non-zero entry is bounded by s1 /8m in absolute value we have ∗ |xT i (B − B )yi | < (s1 /8m)(2m − 1) = s1 /4 − s1 /(8m).
(3)
By (1), (2) and (3) we get |xT i Byi | > s1 /(8m).
Therefore the result follows. 3.2
Eliminating All Far Graphs
In this section we first prove the following, Lemma 8. Let U = {u | u ∈ [−s2 , s1 /(8m2 ), s2 ]n , wt(u) < m3/4 and ψs1 /(8m) (u) > 0}. Then, for every
⎞ m log n + log ss21 ⎠ t= Ω⎝ log m ⎛
there exists a set of vectors Y = {y1 , y2 , . . . , yt } ⊂ {0, 1}n such that for every u ∈ U the size of Yu = {i | |yiT u| > s1 /(16m)} is greater than t/4.
104
N.H. Bshouty and H. Mazzawi
Proof. By Lemma 6, for any u ∈ U and a randomly chosen y ∈ {0, 1}n we have Pr[|y T u| ≥ s1 /16m] ≥ 1/2. Therefore, if we randomly independently choose y1 , y2 , . . . , yt ∈ {0, 1}n we have E[|Yu |] ≥ t/2. By Chernoff bound the probability that |Yu | ≤ t/4 is −t
Pr[|Yu | ≤ t/4] < e 16 . The probability that for all u ∈ U we have |Yu | > t/4 is −t
Pr[∀u ∈ U : |Yu | ≥ t/4] = 1 − Pr[∃u ∈ U : |Yu | < t/4] ≥ 1 − |U |e 16 . Finally, note that |U | <
n m3/4
16s2 m2 s1
m3/4
< et/16 ,
and therefore −t
1−|U |e 16 > 0. Lemma 9. Let B2 = {B ∈ B | ψs1 /(8m) (B) > m3/4 }. There exists a set of queries S = {(x1 , y1 ), (x2 , y2 ), . . . , (xt , yt )} of size ⎞ ⎛ m log n + log ss21 ⎠ t = O⎝ log m such that For every B ∈ B2 there is i ∈ [t] such that |xTi Byi | > s1 /(8m). Proof. Define U as in Lemma 8. That is, U = {u | u ∈ [−s2 , s1 /(8m2 ), s2 ]n , wt(u) < m3/4 and ψs1 /(8m) (u) > 0}. Let Y = {y1 , y2 , . . . , yt } the set of vectors that satisfies the condition in Lemma 8. By Lemma 1, for d = m3/4 , there is at least wt(B) ψs1 /(8m) (B) − m3/4 rows in B that are in U . By property P1 and since B ∈ B2 we have at least wt(B) ψs1 /(8m) (B) − 3/4 ≥ m3/8 − 2m1/4 m rows in B that are in U . Let k = m3/8 − 2m1/4 and let BU be a k × n matrix that its rows are any k rows in B that are in U . By Lemma 8, t i=1
ψs1 /(16m) (BU yi ) ≥
kt . 4
Reconstructing Weighted Graphs with Minimal Query Complexity
105
By Lemma 2, t
ι(ψs1 /(16m) (BU yi )) ≥ k (k−4)t/(4k−4) ≥ mc1 t ,
i=1
for some constant c1 By Lemma 5 if we randomly independently choose x1 , x2 , . . . , xt , the probability that non of the queries xi , yi satisfy |xTi Byi | > s1 /8m is bounded by t s1 c ≤ Pr ∀i ∈ [t] : |xTi Byi | ≤ 2 16m ι(ψ s1 /(16m) (Byi )) i=1
= t
i=1
ct ι(ψs1 /(16m) (Byi ))
≤ t ≤
ct
ι(ψs1 /(16m) (BU yi )) t c = m−αt , c /2 i=1
m
1
where c > 1 and α > 0 are some constants. Since 1 1 , m−αt < 2m < 2 2 |B2 | n 8s2 m 2m
s1
the result follows.
4
The Algorithm
In the previous section we showed that there exists a set of queries S = {(x1 , y1 ), (x2 , y2 ), . . . , (xt , yt )} such that for every G∗ , G ∈ G where E(G∗ ) = E(G ) we have i ∈ [t] such that |xTi AG∗ yi − xTi AG yi | > s1 /8m. Recall that G is the set of all graphs with n vertices and m edges, where the weights of the edges are from the set [s1 , s1 /8m2 , s2 ]. Now, for reconstructing the edges of the graph we use the same algorithm as in [8]. The algorithm is presented in Figure 1. The query complexity of the algorithm is obvious. As for the correctness, given a graph G, define G ∈ G, such that G is equivalent to G after we round each weight of edge in G to the closest number that is a multiple of s1 /8m2 . Obviously, since G−G has at most m non-zero entries and each entry is bounded by s1 /16m2 we have that for all i ∈ [t] |xTi AG yi − xTi AG yi | ≤ s1 m/16m2 = s1 /16m.
106
N.H. Bshouty and H. Mazzawi
Algorithm Edge Reconstruct 1. 2. 3. 4. 5. 6. 7. 8. 9.
For all (xi , yi ) ∈ S Ask xTi AG yi . End for. For all G ∈ G Define D(G ) = (d1 , d2 , . . . , dt ) where di = |xTi AG yi − xTi AG yi | if ψs1 /16m (D(G )) = 0 return G . End if. End for.
Fig. 1. Algorithm for reconstructing the set of edges of G
On the other hand, for any graph G∗ ∈ G that differs from G in at least one edge, we have |xTi AG yi − xTi AG∗ yi | = |xTi AG yi − xTi AG yi − (xTi AG∗ yi − xTi AG yi )|.
(4)
Now, since G , G∗ ∈ G, we have i ∈ [t] such that |xTi AG∗ yi − xTi AG yi | > s1 /8m.
(5)
By (4) and (5), together with the fact that |xTi AG yi − xTi AG yi | ≤ s1 /16m we get |xTi AG yi − xTi AG∗ yi | > s1 /16m.
5
Conclusions and Open Problems
In this paper, we proved the existence of an optimal non-adaptive algorithm for reconstructing the edge set of a hidden weighted graph, given that the weights of the edges are real numbers bounded by n−a and nb for any constants a and b. An open question is: Can we remove the condition on the weights? That is, is there an algorithm for reconstructing a hidden weighted graph where the weights of the edges are (unbounded) real numbers? Also, while the problem of finding optimal constructive polynomial time algorithm for reconstructing a hidden graph was solved for the adaptive case [15], the problem is still open in the non-adaptive case. That is, the problem of finding an explicit construction for algorithms in the non-adaptive setting is still open.
Reconstructing Weighted Graphs with Minimal Query Complexity
107
References 1. Aigner, M.: Combinatorial Search. John Wiley and Sons, Chichester (1988) 2. Alon, N., Asodi, V.: Learning a Hidden Subgraph. SIAM J. Discrete Math. 18(4), 697–712 (2005) 3. Alon, N., Beigel, R., Kasif, S., Rudich, S., Sudakov, B.: Learning a Hidden Matching. SIAM J. Comput. 33(2), 487–501 (2004) 4. Angluin, D., Chen, J.: Learning a Hidden Graph Using O(log n) Queries per Edge. In: Shawe-Taylor, J., Singer, Y. (eds.) COLT 2004. LNCS (LNAI), vol. 3120, pp. 210–223. Springer, Heidelberg (2004) 5. Angluin, D., Chen, J.: Learning a Hidden Hypergraph. Journal of Machine Learning Research 7, 2215–2236 (2006) 6. Bouvel, M., Grebinski, V., Kucherov, G.: Combinatorial Search on Graphs Motivated by Bioinformatics Applications: A Brief Survey. In: Kratsch, D. (ed.) WG 2005. LNCS, vol. 3787, pp. 16–27. Springer, Heidelberg (2005) 7. Bshouty, N.H.: Optimal Algorithms for the Coin Weighing Problem with a Spring Scale. In: COLT (2009) 8. Choi, S., Kim, J.H.: Optimal Query Complexity Bounds for Finding Graphs. In: STOC, pp. 749–758 (2008) 9. Du, D., Hwang, F.K.: Combinatorial group testing and its application. Series on applied mathematics, vol. 3. World Science (1993) 10. Erd¨ os, P.: On a lemma of Littlewood and Offord. Bulletin of the American Mathematical Society 51, 898–902 (1945) 11. Grebinski, V., Kucherov, G.: Optimal Reconstruction of Graphs Under the Additive Model. Algorithmica 28(1), 104–124 (2000) 12. Grebiniski, V., Kucherov, G.: Reconstructing a hamiltonian cycle by querying the graph: Application to DNA physical mapping. Discrete Applied Mathematics 88, 147–165 (1998) 13. Grebinski, V.: On the Power of Additive Combinatorial Search Model. In: Hsu, W.-L., Kao, M.-Y. (eds.) COCOON 1998. LNCS, vol. 1449, pp. 194–203. Springer, Heidelberg (1998) 14. Littlewood, J.E., Offord, A.C.: On the number of real roots of a random algebraic equation. III. Mat. Sbornik 12, 277–285 (1943) 15. Mazzawi, H.: Optimally Reconstructing Weighted Graphs Using Queries (manuscript) 16. Reyzin, L., Srivastava, N.: Learning and Verifying Graphs using Queries with a Focus on Edge Counting. In: Hutter, M., Servedio, R.A., Takimoto, E. (eds.) ALT 2007. LNCS (LNAI), vol. 4754, pp. 285–297. Springer, Heidelberg (2007) 17. Sperner, E.: Ein Satz ber Untermengen einer endlichen Menge. Math. Z. 27, 544– 548 (1928)
6
Appendix
In this Appendix we prove Lemma 4. We first prove few preliminary results. Lemma 10. Let X1 , X2 , . . . , Xt be a random variables such that there is si > s where Pr[Xi = si ] = 1/2 and Pr[Xi = 0] = 1/2. Let λ1 , . . . , λt ∈ {−1, 1}. Then there is a constant c such that c max Pr[r ≤ λ1 X1 + λ2 X2 + · · · + λt Xt < r + s] ≤ √ . r t
108
N.H. Bshouty and H. Mazzawi
Proof. Consider the lattice L = i {0, si } with the partial order ≺= i ≺i where 0 ≺i si if λi = 1 and si ≺i 0 if λi = −1. It is easy to see that the set of all solutions of r ≤ X1 + X2 + · · · + Xt < r + s is an anti-chain in L. This follows the result. Lemma 11. Let X1 , X2 , . . . , Xt be a random variables such that there is si > s where Pr[Xi = si ] = pi and Pr[Xi = 0] = 1 − pi . Let λ1 , . . . , λt ∈ {−1, 1}. Then there is a constant c such that for every r c max Pr[r ≤ λ1 X1 + λ2 X2 + · · · + λt Xt < r + s] ≤ . r t min(p , 1 − p ) i i i=1 Proof. We will assume w.l.o.g that pi < 1/2 and therefore min(pi , 1 − pi ) = pi . Otherwise, we can replace Xi with Yi = si − Xi . Let Yi be random variable that is equal to 1 with probability 2pi and 0 with probability 1 − 2pi . Let Zi be a random variable that is equal to si with probability 1/2 and 0 with probability 1/2. It is easy to see that Xi = Yi Zi . Let t Y = Y1 + · · · + Yt and P = i=1 pi . Notice that E[Y ] = 2P . Then by Lemma 10 and Chernoff bound we have Pr[r ≤ λ1 X1 + λ2 X2 + · · · + λt Xt < r + s] = Pr[r ≤ λ1 Y1 Z1 + λ2 Y2 Z2 + · · · + λt Yt Zt < r + s] ≤ Pr[r ≤ λ1 Y1 Z1 + λ2 Y2 Z2 + · · · + λt Yt Zt < r + s | Y ≥ P ] + P r[Y < P ] c ≤ √ + e−P/4 . P This completes the proof. Lemma 12. Let X1 and X2 be random variables and s be any fixed real number. Suppose X1 takes values x1 , x2 , . . . , x with probabilities p1 , . . . , p , respectively. Let Y1 be a random variable that takes values y, x3 , x4 , . . . , x with probabilities p1 + p2 , p3 , . . . , p . Then for y = x1 or y = x2 we have max Pr[r ≤ X1 + X2 < r + s] ≤ max Pr[r ≤ Y1 + X2 < r + s]. r
r
Proof. Suppose maxr Pr[r ≤ X1 + X2 < r + s] = p0 and Pr[r0 ≤ X1 + X2 < r0 + s] = p0 . Now we choose y = x2 if Pr[r0 − x1 ≤ X2 < r0 − x1 + s] ≤ Pr[r0 − x2 ≤ X2 < r0 − x2 + s]
(6)
and y = x1 , otherwise. Suppose (6) is true. Then y = x2 and max Pr[r ≤ X1 + X2 < r + s] = Pr[r0 ≤ X1 + X2 < r0 + s] r = Pr[X1 = x] Pr[r0 − x ≤ X2 < r0 − x + s] x
= p1 Pr[r0 − x1 ≤ X2 < r0 − x1 + s] + p2 Pr[r0 − x2 ≤ X2 < r0 − x2 + s]+
Reconstructing Weighted Graphs with Minimal Query Complexity
109
Pr[X1 = x] Pr[r0 − x ≤ X2 < r0 − x + s]
x ∈{x1 ,x2 }
≤ (p1 + p2 ) Pr[r0 − x2 ≤ X2 < r0 − x2 + s] + Pr[X1 = x] Pr[r0 − x ≤ X2 < r0 − x + s] x ∈{x1 ,x2 }
≤ Pr[Y1 = x2 ] Pr[r0 − x2 ≤ X2 < r0 − x2 + s] + Pr[Y1 = x] Pr[r0 − x ≤ X2 < r0 − x + s] x ∈{x1 ,x2 }
= Pr[r0 ≤ Y1 + X2 < r0 + s] ≤ max Pr[r ≤ Y1 + X2 < r + s]. r
We will call this transformation a merging of the two values x1 and x2 of the random variable X1 into one value in Y1 . We prove the following property of the merging transformation Lemma 13. Let X be a random variable such that Pr[X > s] ≥ p and also Pr[X < 0] ≥ p. Then we can merge the values of X into two values s1 and s2 in a random variable Y such that 1. s2 − s1 ≥ s/2. 2. Pr[X = s1 ] ≥ p and Pr[X = s2 ] ≥ p. Proof. We first merge all the values that are greater than or equal to s, then all the values that are less than or equal to 0 and then those that are between 0 and s. We get a random variable Z that gets 3 values s , s and s where s ≤ 0 < s < s ≤ s , Pr[Z = s ] ≥ p and Pr[Z = s ] ≥ p. Now either s − s ≥ s/2 or s − s ≥ s/2. If s − s > s/2 then we merge s and s and if s − s > s/2 we merge s and s . Now we prove our main result Lemma 14. Let a1 , a2 , . . . , at , b1 , b2 , . . . , bt and s ≥ 0 be real numbers such that bi − ai ≥ s for all i and let X1 , . . . , Xt be independent random variables. Suppose that there is pi with 0 < pi < 1 such that Pr[Xi ≤ ai ] ≥ pi and Pr[Xi ≥ bi ] ≥ pi for all i. Then, there is a constant c such that for any real number r and integer ρ ≥ 1 we have
t cρ Pr r ≤ Xi < r + ρs ≤ . t i=1 p i i=1 Proof. By Lemma 13 we can merge the values of each Xi into a new random variable Yi that takes two values ai and bi where bi − ai ≥ s/2, Pr[Yi = ai ] ≥ pi , Pr[Yi = bi ] ≥ pi and
t t s s ≤ max Pr r ≤ . Xi < r + Yi < r + max Pr r ≤ r r 2 2 i=1 i=1 We will assume without loss of generality that ai = 0 for all i. Otherwise consider the random variables Yi − ai . Now by Lemma 11 we get the result.
Learning Unknown Graphs Nicol` o Cesa-Bianchi1 , Claudio Gentile2 , and Fabio Vitale3 1
2
Dipartimento di Scienze dell’Informazione Universit` a degli Studi di Milano, Italy
[email protected] Dipartimento di Informatica e Comunicazione Universit` a dell’Insubria, Varese, Italy
[email protected] 3 Dipartimento di Scienze dell’Informazione Universit` a degli Studi di Milano, Italy
[email protected]
Abstract. Motivated by a problem of targeted advertising in social networks, we introduce and study a new model of online learning on labeled graphs where the graph is initially unknown, and the algorithm is free to choose the next vertex to predict. After observing that natural nonadaptive exploration/prediction strategies (like depth-first with majority vote) badly fail on simple binary labeled graphs, we introduce an adaptive strategy that performs well under the hypothesis that the vertices of the unknown graph (i.e., the members of the social network) can be partitioned into a few well-separated clusters within which labels are roughly constant (i.e., members in the same cluster tend to prefer the same products). Our algorithm is efficiently implementable and provably competitive against the best of these partitions. Keywords: online clustering.
1
learning, graph prediction, unknown graph,
Introduction
Consider the advertising problem of targeting each member of a social network (where ties between individuals indicate a certain degree of similarity in tastes and interests) with the product he/she is most likely to buy. Unlike previous approaches to this problem —see, e.g., [20]— we consider the more interesting scenario where the network and the preferences of network members for the products in a given set are initially unknown, apart from those of a single “seed member”. We assume there exists a mechanism to explore the social network by discovering new members connected (i.e., with similar interests) to members that are already known. This mechanism can be implemented in different ways, e.g., by providing incentives or rewards to members with undiscovered connections. Alternatively, if the network is hosted by a social network service (like FacebookTM ), the service provider itself may release the needed pieces of information. Since each discovery of a new member is presumably costly, the goal of the marketing strategy is to R. Gavald` a et al. (Eds.): ALT 2009, LNAI 5809, pp. 110–125, 2009. c Springer-Verlag Berlin Heidelberg 2009
Learning Unknown Graphs
111
minimize the number of new members not being offered their preferred product. In this respect the task may then be formulated as the following sequential problem: At each step t find the member qt , among those whose preferred product we already know, who is most likely to have undiscovered connections that have the same preferred product as qt . Once this member qt is identified, we obtain (through the above-mentioned mechanism) a connection it to whom we may advertise qt ’s preferred product. In order to make the problem easier for the advertising agent, we make the simplifying assumption that once a product is advertised to a member the agent may observe the member’s true preference, and thus know whether the decision made was optimal. This social network advertising task can be naturally cast as a graph prediction problem where an agent sequentially explores the vertices and edges of an unknown graph with unknown labels (i.e., product preferences) assigned to its vertices. The online exploration proceeds as follows: At each time step t, the agent selects a known node qt having unexplored edges, receives a new vertex it adjacent to qt , and is required to output a prediction yt for the (unknown) label yt associated with it . Then yt is revealed, and the algorithm incurs a loss ( yt , yt ) measuring the discrepancy between prediction and true label. Thus, in some sense, the agent is learning to explore the graph along directions that, given past observations, look easier to predict. Our basic measure of performance is the agent’s cumulative loss ( y1 , y1 )+ · · ·+ ( yn , yn ) over a sequence of n predictions. In order to leverage on the assumption that connected members tend to prefer the same products [20], we design agent strategies that perform well to the extent that the underlying graph labeling y = (y1 , . . . , yn ) is regular. That is, the graph can be partitioned into a small number of weakly interconnected clusters (subgroups of network members) such that labels in each cluster are all roughly similar. In the case of binary labels and zero-one loss, a common measure of label regularity for an n-vertex graph G with labels y = (y1 , . . . , yn ) ∈ {−1, +1}n is the cutsize ΦG (y). This is the number of edges (i, j) in G whose endpoints vertices have disagreeing labels, yi = yj . The cumulative loss bound we prove in this paper holds for general (real-valued) labels and is expressed in terms of a measure of regularity that, in the special case of binary labels, is often significantly smaller than the cutsize ΦG (y), and never larger than 2ΦG (y). Furthermore, unlike ΦG (y), which may be even quadratic in the number of nodes, our measure of label regularity is never vacuous (i.e., it is never larger than n). In the paper we also show that the algorithm achieving this bound is suitable to large scale applications because of its small time and memory requirements. 1.1
Related Work
Online prediction of labeled graphs has been also studied in a “transductive” learning model, different from the one studied here. In this model the graph G (without labels) is known in advance, and the task is to sequentially predict the unknown labels of an adversarially chosen permutation of G’s vertices. A technique proposed in [10] for transductive binary prediction is to embed the graph into a linear space using the kernel defined by the Laplacian pseudoinverse
112
N. Cesa-Bianchi, C. Gentile, and F. Vitale
—see [16, 19], and then run the standard (kernel) Perceptron algorithm for predicting the vertex labels. This approach guarantees that the number of mistakes is bounded by a quantity that depends linearly on the cutsize ΦG (y). Further results involving the prediction of node labels in graphs with known structure include [2, 3, 6, 9, 11, 12, 13, 14, 15, 17]. Our exploration/prediction model also bears some similarities to the graph exploration problem introduced in [8], where the measure of performance is the overall number edge traversals sufficient to ensure that each edge has been traversed at least once. Unlike that approach, we do not charge any cost for visits of the same node beyond the first visit. Moreover, in our setting depth-first exploration performs badly on simple graphs with binary labels (see discussion in Sect. 2), whereas depth-first traversal is optimal in the setting of [8] for any undirected graph —see [1]. Finally, as we explain in Sect. 3, our exploration/prediction algorithm incrementally builds a spanning tree whose total cost is equal to the algorithm’s cumulative loss. The problem of constructing a minimum spanning tree online is also considered in [18], although only for graphs with random edge costs.
2
The Exploration/Prediction Model
Let G = (V, E) be an unknown undirected and connected graph with vertex set V = {1, 2, . . . , n} and edge set E ⊆ V × V . We use y = (y1 , . . . , yn ) ∈ Y n to denote an unknown assignment of labels yi ∈ Y to the vertices i ∈ V , where Y is a given label space, e.g., Y = R or Y = {−1, +1}. We consider the following protocol between a graph exploration/prediction algorithm and an adversary. Initially, the algorithm receives an arbitrary vertex i0 ∈ V and its corresponding label y0 . For all subsequent steps t = 1, . . . , n − 1, let Vt−1 ⊆ V be the set of vertices visited in the first t − 1 steps, where we conventionally set V0 = {i0 }. Then: 1. The algorithm chooses node qt ∈ Vt−1 ; at this time the algorithm knows that qt has unexplored edges (i.e., edges connecting qt to unseen nodes in V \ Vt−1 ), though the number and destination of such edges is currently unknown to the algorithm. 2. The adversary chooses a node it ∈ V \ Vt−1 that is adjacent to qt ; 3. All edges (it , j) ∈ E connecting it to previously visited vertices j ∈ Vt−1 are revealed (including edge (qt , it )); 4. The algorithm predicts the label yt of it with yt ; 5. The label yt is revealed, and the algorithm incurs a loss. At each step t = 1, . . . , n − 1, the loss of the algorithm is ( yt , yt ), where : Y × Y → R+ is a fixed and known function measuring the discrepancy between yt and yt . For example, if Y = R, then we may set ( yt , yt ) = | yt − yt |. The algorithm’s goal is to minimize its cumulative loss ( y1 , y1 ) + · · · + ( yn , yn ). Note that the edges (qt , it ), for t = 1, . . . , n − 1, form a spanning tree for G. It is important to note that standard nonadaptive graph exploration strategies (combined with simple prediction rules) are suboptimal in this setting. For
Learning Unknown Graphs
113
this purpose, consider the strategy depthFirst, performing a depth-first visit of G (partially driven by the adversarial choice of it ) and predicting the label of it through the adjacent node qt in the spanning tree generated by the visit. In the binary classification case, when Y = {−1, +1} and ( y , y) = I{y =y} (zero-one loss), the graph cutsize ΦG (y) is an obvious mistake bound achieved by such a strategy. Figure 1 shows an example where depthFirst makes Ω |V | mistakes. This high number of mistakes is not due to the choice of the prediction rule. Indeed, the same large number of mistakes is achieved by variants of depthFirst where the predicted label is determined by the majority vote of all labels (or just of the mistaken ones) among the adjacent nodes seen so far. This holds even when the graph labeling is consistent with the majority vote predictor based on the entire graph. Similar examples can be constructed to show that visiting the graph in breadth-first order can cause Ω |V | mistakes.
Fig. 1. A binary labeled graph with three clusters where depthFirst can make Ω |V | mistakes. Edges are either arrow edges or grey edges. Arrow edges indicate predictions, and numbers on such edges denote the adversarial order of presentation. For instance edge 3 (connecting a −1 node to a +1 node) says that depthFirst uses the −1 label associated with the start node (the current qt node) to predict the +1 label associated with the end node (the current it node). As a matter of fact, in this example depthFirst could also predict yt through a majority vote of the labels of previously observed nodes that are adjacent to it . Dark grey nodes are the mistaken nodes (for simplicity, ties are mistakes in this figure). Notice that in the dotted area we could add as many (mistaken) nodes as we like, thus making the graph cutsize ΦG (y) arbitrarily close to |V |. These nodes would still be mistaken even if the majority vote were restricted to previously mistaken (and adjacent) nodes. This is because depthFirst is forced to err on the left-most node of the right-most cluster.
These algorithms fail mainly because their exploration strategy is oblivious to the sequence of revealed labels. Next, we show an adaptive exploration strategy that takes advantage of the revealed structure of the labeled graph in order to make a substantially smaller number of mistakes. Our algorithm cga (Clustered Graph Algorithm) learns the next “good” node qt ∈ Vt−1 to explore, and achieves a cumulative loss bound based on a notion of cluster/labeling regularity called
114
N. Cesa-Bianchi, C. Gentile, and F. Vitale
merging degree. This notion arises naturally as a by-product of our analysis, and can be considered a natural measure of cluster similarity of independent interest.
3
Regular Partitions and the Clustered Graph Algorithm
We are interested in designing exploration/prediction strategies that work well to the extent the underlying graph G can be partitioned into a small number of weakly connected regions (the “clusters”) such that labels on the vertices in each cluster are similar. Before defining this property formally, we need a few key auxiliary definitions. Given a path s1 , . . . , sd in G, a notion of path length λ(s1 , . . . , sd ) can be defined which is naturally related to the prediction loss. A reasonable choice might be λ(s1 , . . . , sd ) = maxk=2,...,d (sk−1 , sk ), where we conventionally write (st−1 , st ) instead of (yst−1 , yst ) when the labeling is understood from the context. Note that, in the binary classification case, if the nodes s1 , . . . , sd are either all positive or all negative, then λ(s1 , . . . , sd ) = 0. In general, we say that λ is a path length assignment if it satisfies λ(s1 , . . . , sd−1 , sd ) ≥ λ(s1 , . . . , sd−1 ) ≥ 0
(1)
for each path s1 , . . . , sd−1 , sd in G. As we see in Sect. 5, condition (1) helps in designing efficient algorithms. Given a path length assignment λ, denote by Pt (i, j) the set of all paths connecting node i to node j in Gt = (Vt , Et ), the subgraph containing all nodes Vt and edges Et that have been observed during the first t steps. The distance dt (i, j) between i and j is the length of the shortest path between i and j in Gt , i.e., dt (i, j) = minπ∈Pt (i,j) λ(π). A partition P of V in subsets (or clusters) C is regular if, for all C ∈ P and for all i ∈ C, maxj∈C d(i, j) < mink ∈C d(i, k), where d(i, j), without subscript, denotes the length of the shortest path between i and j in the whole graph G. See Fig. 2 for an example. In a regular partition each node is closer to every node in its cluster than to any other node outside. When −d(·, ·) is taken as similarity function, our notion of regular partition becomes equivalent to the Apresjan clusters in [4] and to the strict separation property of [5]. It is easy to see that according to (1) all subgraphs induced by the clusters on a regular partition are connected graphs. Note that every labeled graph G = (V, E) has at least two regular partitions, since both the trivial partitions P = {V } and P = {1}, {2}, . . . , {|V |} are regular. Moreover, if labels are binary then the notion of regular partition is equivalent to the natural partition made up of the smallest number of clusters C, each one including only nodes with the same label. We now introduce an algorithm, cga, that takes advantage of regular partitions. As we show in Sect. 4, the cumulative loss of cga can be expressed in terms of the best regular partition of G with respect to the unknown labeling y ∈ Rn . At each time step t, cga sets yt to be the (known) label yqt of the selected vertex qt ∈ Vt−1 . Hence, the algorithm’s cumulative loss is the cost of the spanning tree with edges (qt , it ) : t = 1, . . . , |V | − 1 where edge (qt , it ) has cost
Learning Unknown Graphs
5.6 5.2
d = 2.0 d = 0.4
0.7 5.1
0.1
d = 2.9 d = 0.8
5.6
5.2
0.7
0.1
5.1
0.9
0.9 5.4
115
5.4
3.8
3.8 3.1 2.9
3.1
3.5 2.9
3.9 3.6
3.2
d = 2.0 d = 0.8
4.0
3.5 3.9 3.6
3.2 4.0
Fig. 2. Two copies of a graph with real labels yi associated with each vertex i. On the left, a shortest path connecting the two nodes enclosed in double circles is shown. The path length is maxt (sk−1 , sk ), where (i, j) = |yi −yj |. The thick black edge is incident to the nodes achieving the max in the path length expression. On the right, the vertices of the same graph have been clustered to form a regular partition. The diameter of a cluster C (the maximum of the pairwise distances between nodes of C) is denoted by d. Similarly, d denotes the minimum of the pairwise distances (i, j), where i ∈ C and j ∈ V \ C. Note that d is determined by one of the thick black edges connecting C with the rest of the graph, while d is determined by the two nodes incident to the thick gray edge. The partition is regular, hence d < d holds for each cluster.
(i, j) = (yi , yj ). The key to controlling this cost, however, is the specific rule the algorithm uses to select the next qt based on Gt−1 . The approach we propose is simple. If there exists a regular partition of G with few elements, then it does not really matter how the spanning tree is built within each element, since the cost of all these different trees will be small anyway. What matters the most is the cost of the edges of the spanning tree that join two distinct elements of the partition. In order to keep this cost small, our algorithm learns to select qt so as to avoid going back to the same region many times. This is based on the following notions. Fix an arbitrary subset C ⊆ V . The inner border ∂C of C is the set of all nodes i ∈ C that are adjacent to a node j ∈ C (the dark grey nodes in the picture at the side). The outer border ∂C of C is the set of all the nodes j ∈ C that are adjacent to at least one node in the inner border of C (the light grey nodes). We are now ready to define the exploration/ prediction rule of our algorithm. At each time t, cga selects and predicts the label of a node adjacent to the node in the inner border of Vt−1 which is closest to the previously predicted node it−1 . Formally, yt = yqt
where
qt = argmin dt−1 (it−1 , q). q∈∂V t−1
(2)
116
N. Cesa-Bianchi, C. Gentile, and F. Vitale
Fig. 3. The behavior of cga displayed on the binary labeled graph of Fig. 1. The length of a path s1 , . . . , sd is measured by maxk (sk−1 , sk ) and the loss is the zero-one loss. The pictorial conventions are as in Fig. 1. As in that figure, the cutsize ΦG (y) of this graph can be made as close to |V | as we like, still cga makes 4 mistakes. For the sake of comparison, recall that the various versions of depthFirst can be forced to err ΦG (y) times on this graph.
We say that cluster C is exhausted at time t if at time t the algorithm has already selected all nodes in C together with its outer border, i.e., if C ∪∂C ⊆ Vt . In the special but important case when labels are binary and the path length is λ(s1 , . . . , sd ) = maxk (sk−1 , sk ) (being the zero-one loss), the choice of node qt in (2) can be defined as follows. If the cluster C where it−1 lies is not exhausted at the beginning of time t, then cga picks any node qt connected to it−1 by a path all contained in Vt−1 ∩ C. On the other hand, if C is exhausted, cga chooses an arbitrary node in Vt−1 . Figure 3 contains a pictorial explaination of the behavior of cga, as compared to depthFirst on the same binary labeled graph as in Fig. 1. As we argue in the next section (Lemma 1 in Sect. 4), a key property of cga is that when choosing qt causes the algorithm to move out of a cluster of a regular partition, then the cluster must have been exhausted. This suggests a fundamental difference between cga and simple algorithms like depthFirst. Evidence of that is provided by comparing Fig. 1 to Fig. 3. cga is seen to make a constant number of binary prediction mistakes on simple graphs where depthFirst makes order of |V | mistakes. The next definition provides our main measure of graph label regularity, which we relate to cga’s predictive ability. Given a regular partition P of the vertices V of an undirected, connected and labeled graph G = (V, E), foreach C ∈ P the merging degree δ(C) of cluster C is defined as δ(C) = min |∂C|, |∂C| . The overall merging degree of the partition, denoted by δ(P) is given by δ(P) = C∈C δ(C). The merging degree δ(C) of a cluster C ∈ P quantifies the amount of interaction between C and the remaining clusters in P. For instance, in Fig. 3 the left-most cluster has merging degree 1, the middle one has merging degree 2, and the right-most one has merging degree 1. Hence this figure shows a case in which the mistake bound of our algorithm is tight. Note that the middle cluster has
Learning Unknown Graphs
117
merging degree 2 no matter how we increase the number of negatively labeled nodes in the dotted area (together with the corresponding outbound edges). In the binary case, it is not difficult to compare the merging degree of a partition to the graph custsize. Since at least one edge contributing to the cutsize ΦG (y) must be incident to each node in an inner or outer border of a cluster, δ(P) is never larger than 2ΦG (y). On the other hand, as suggested for example by Fig. 3, δ(P) is often much smaller ΦG (y) (observe that δ(P) is never larger than n, while ΦG (y) can even be quadratic on dense graphs). Finally, as hinted again by Fig. 3, δ(P) is typically more robust to noise as compared to ΦG (y). For instance, if we flip the label of the left-most node of the cluster on the right, the merging degree of the depicted partition gets affected only by a small amount, whereas the cutsize can decrease significantly.
4
Analysis
This section contains the analysis of cga’s predictive performance. The computational complexity analysis is contained in Sect. 5. For the sake of presentation, we single out the binary classification case since it is an important special case of our setting. Fix an undirected and connected graph G = (V, E). The following lemma is a key property of our algorithm. Lemma 1. Assume cga is run on a graph G with labeling y ∈ Y n , and pick any time step t > 0. Let P be a regular partition and assume it−1 ∈ C, where C is any cluster in P. Then C is exhausted at time t − 1 if and only if qt ∈ C. Proof. First, assume C is exhausted at time t − 1, i.e., C ∪ ∂C ⊆ Vt−1 . Then all nodes in C have been visited, and no node in C has unexplored edges. This implies C ∩ ∂V t−1 ≡ ∅ and that the selection rule (2) makes the algorithm pick qt outside of C. Assume now qt ∈ C. Since each cluster is a connected subgraph, if the labels are binary the prediction rule ensures that cluster C is exhausted. In the general case (when labels are not binary) we can prove by contradiction that C is exhausted by analyzing the following two cases: 1. There exists j ∈ C \Vt−1 . Since the subgraph in cluster C is connected, there is a path in C connecting it−1 to j such that at least one node q ∈ C on this path: (a) has unexplored edges, and (b) belongs to Vt−1 , (i.e., q ∈ ∂V t−1 ), and (c) is connected to it−1 by a path all contained in C ∩ Vt−1 . Since the partition is regular, q is closer to it−1 than to any node outside of C. Hence, by construction —see (2), the algorithm would choose this q instead of qt (due to (c) above), thereby leading to a contradiction. 2. There exists j ∈ ∂C \ Vt−1 . Again, since the subgraph in cluster C is connected, there is a path in C connecting it−1 to a node in ∂C adjacent to j. Then we fall back into the previous case since at least one node q on this path: (a) has unexplored edges, and (b) belongs to Vt−1 , and (c) is connected to it−1 by a path all contained in C ∩ Vt−1 . We begin to analyze the special case of binary labels and zero-one loss.
118
N. Cesa-Bianchi, C. Gentile, and F. Vitale
Theorem 1. If cga is run on an undirected and connected graph G with binary labels then the total number m of mistakes satisfies m ≤ δ(P), where P is the partition of V made up of the smallest number of clusters, each including only nodes with the same label.1 The key idea to the proof of this theorem is the following. Fix a cluster C ∈ P. In each time step t when both qt and it belong to C a mistake never occurs. The remaining time steps are of two kinds only: (1) Incoming lossy steps, where node it belongs to the inner border of C; (2) outgoing lossy steps, where it belongs to the outer border of C. With each such step we can thus uniquely associate a node it in either (inner or outer) border of C. The overall loss involving C, however, is typically much smaller than the sum of border cardinalities. This because, in general, in each given cluster incoming and outgoing steps alternate, since the algorithm first enters and then leaves the cluster. Hence, incoming and outgoing steps must occur the same number of times, and their sum must then be at most twice the minimum of the size of borders (what we called merging degree of the cluster). The only exception to this alternating pattern occurs when a cluster gets exhausted. In this case an incoming step is not followed by any outgoing step for the exhausted cluster. Proof of Theorem 1. Index by 1, . . . , |P| the clusters in P. We abuse the notation and use P also to denote the set of cluster indices. Let k(t) be the index of the cluster which it belongs to, i.e., it ∈ Ck(t) . We say that step t is a lossy step if yˆt = yt , i.e. the label of qt is different from the label of it . A step t in which a mistake occurs is incoming for cluster i (denoted by ∗ → i) if qt ∈ Ci and it ∈ Ci , and it is outgoing for cluster i (denoted by i → ∗) if qt ∈ Ci and it ∈ Ci . An outgoing step for cluster Ci is regular if the previous step in which the algorithm made a mistake is incoming for Ci . All other outgoing steps are reg called irregular. Let M→i (Mi→ ) be the set of all incoming (regular outgoing) irr lossy steps for cluster Ci . Also, let Mi→ be the set of all irregular outgoing lossy steps for Ci . reg For each i ∈ P, define an injective mapping μi : Mi→ → M→i as follows (see reg Fig. 4 for reference): Each lossy step t in Mi→ is mapped to the previous step t = μi (t) when a mistake occurred. Lemma 1 insures that such step must be reg incoming for i since t is a regular outgoing step. This shows that |Mi→ | ≤ |M→i |. Now, let t be any irregular outgoing step for some cluster, t be the last lossy step occurred before time t, and set j = k(t ). The very definition of an irregular lossy step, combined with Lemma 1, allows us to conclude that t is the last lossy step involving cluster Cj . This implies that t cannot be followed by an outgoing lossy step j → ∗. Hence t is not in the image of μj , and the previous inequality reg reg for |Mi→ | can be refined as |Mi→ | ≤ |M→i | − Ii . Here Ii is the indicator function of the following event: “The very last lossy step t such that either qt or it belong irr to Ci is incoming for Ci ”. We now claim that i∈P Ii ≥ i∈P |Mi→ |. In fact, if we let t be an irregular lossy step and i be the index of the cluster for which 1
Note that such a P is a regular partition of V . Moreover, one can show that for this partition the bound in the theorem is never vacuous.
Learning Unknown Graphs
119
the previous lossy step t is incoming, the fact that t is irregular implies that Ci must be exhausted between time t and time t, which in turn implies that Ii = 1, since t must be the very last lossy step involving cluster Ci . Hence reg irr irr |M→i | − Ii + |Mi→ |Mi→ ∪ Mi→ |≤ | ≤ |M→i |. (3) m= i∈P
i∈P
i∈P
Next, for each i ∈ P we define two further injective mappings that associate with each incoming lossy step ∗ → i a vertex in the inner border of Ci and a vertex in the outer border of Ci . This shows that |M→i | ≤ min |∂Ci |, |∂Ci | = δ(Ci ) for each i ∈ P. Together with (3) this would complete the proof (see again Fig. 4 for a pictorial explanation).
ν1 (s)
μi (t)
s
t
ν2 (s)
Fig. 4. Sequence (starting from the left) of incoming and regular outgoing lossy steps involving a given cluster Ci . We only show the border nodes contributing to lossy steps. We map injectively each regular outgoing lossy step t to the previous (incoming) lossy step μi (t). We also map injectively each incoming lossy step s to the node ν1 (s) in the inner border, whose label was predicted at time s. Finally, we map injectively s also to the node ν2 (s) in the outer border that caused the previous (outgoing) lossy step for the same cluster.
The first injective mapping ν1 : M→i → ∂Ci is easily defined: ν1 (t) = it ∈ Ci . This is an injection because the algorithm can incur loss on a vertex at most once. The second injective mapping ν2 : M→i → ∂Ci is defined in the following way. Let M→i be equal to {t1 , . . . , tk }, with t1 < · · · < tk . If t = t1 then ν2 (t) is simply qt ∈ ∂Ci . If instead t = tj with j ≥ 2, then ν2 (t) = it ∈ ∂Ci , where t is an outgoing lossy step i → ∗, lying between tj−1 and tj . Note that cluster Ci cannot be exhausted after step tj−1 since another incoming lossy step ∗ → i occurs at time tj > tj−1 . Combined with Lemma 1 this guarantees the existence of such a t . Moreover, no subsequent outgoing lossy steps i → ∗ can mispredict the same label yit .
As we already noted, the edges (qt , it ) produced during the online functioning of the algorithm form a spanning tree T for G. Therefore cga’s number of mistakes m is always equal to ΦT (y). This shows that an obvious lower bound on m is the total number of clusters |P|, i.e., the cost of the minimum spanning tree for G. In fact, it is not difficult to prove that an adaptive adversary can always force
120
N. Cesa-Bianchi, C. Gentile, and F. Vitale
any algorithm working within our learning protocol to make Ω(|P|) mistakes. This simple observation can be strengthened so as to match the upper bound in Theorem 1. Theorem 2. For all undirected and connected graphs G with n nodes and degree bounded by a constant, for all K < n, and for any (randomized) exploration/prediction strategy, there exists a labeling y of G’s vertices such that the strategy makes at least K/2 mistakes (in expectation) with respect to the algorithm’s internal randomization, while δ(P) = O(K). The above lower bound, whose proof is omitted due to space limitations, can actually be shown to hold even in cases when G does not have bounded degree nodes, like cliques or general trees. We now turn to the general case of nonbinary labels. The following definitions are useful for espressing the cumulative loss bound of our algorithm: Let P be a regular partition of the vertex set V and fix a cluster C ∈ P. We say that edge (qt , it ) causes an inter-cluster loss at time t if one of the two nodes of this edge lies in ∂C and the other lies in ∂C. Edge (qt , it ) causes an intra-cluster loss when both qt and it are in C. We denote by (C) the largest inter-cluster loss in C, i.e., (C) = max (yi , yj ) . i∈∂C, j ∈∂C, (i,j)∈E
Also max is the maximum inter-cluster loss in the whole graph G, i.e., max = P P −1 ¯P = |P| maxC∈P (C). We also set for brevity (C). Finally, we define C∈P ε(C) = maxTC (i,j)∈E(TC ) (yi , yj ), where the max is over all spanning trees TC of C and E(TC ) is the edge set of TC . Note that ε(C) bounds from above2 the total loss incurred in all steps t where qt and it both belong to C. In the above definition, (C) is a measure of connectivity of C to the remaining clusters, ε(C) is a measure of “internal cohesion” of C, while max and ¯P give P global distance measures among the clusters within P. The following theorem shows that cga’s cumulative loss can be bounded in terms of the regular partition P that best trades off total intra-cluster loss (expressed by ε(C)), against total inter-cluster loss (expressed by δ(C) times the largest inter-cluster loss (C)). It is important to stress that cga never explicitely computes this optimal partition: It is the selection rule for qt in (2) that guarantees this optimal behavior. Theorem 3. If cga is run on an undirected and connected graph G with arbitrary real labels, then the cumulative loss can be bounded as 2
|V | cga’s cumulative loss is t=1 (qt , it ), where the edges (qt , it ), t = 1, . . . , |V | − 1 form a spanning tree for G; hence the subset of such edges which are incident to nodes in C form a spanning forest for C. Our definition of ε(C) takes into account that the total loss associated with the edge set of a spanning tree TC for C is at least as large as the total loss associated with the edge set E(F) of any spanning forest F for C such that E(F) ⊆ E(TC ).
n t=1
Learning Unknown Graphs
121
max
ε(C) + (C)δ(C) , ( yt , yt ) ≤ min |P| P − ¯P +
(4)
P
C∈P
where the minimum is over all regular partitions P of V . Remark 1. If is the zero-one loss, then the bound in (4) reduces to n t=1
( yt , yt ) ≤ min P
ε(C) + δ(C) .
(5)
C∈P
This shows that in the binary case the total number of mistakes can also be bounded by the maximum number of edges connecting different clusters that can be part of a spanning tree for G. In the binary case (5) achieves its minimum either on the trivial partition P = {V } or on the partition made up of the smallest number of clusters C, each one including only nodes with the same label (as in Theorem 1). In most cases, the nontrivial regular partition is the minimizer of (5), so that the intra-cluster term ε(C) disappears. Then the bound only includes the sum of merging degrees (w.r.t. that partition), thereby recovering the bound in Theorem 1. However, in certain degenerate cases, the trivial partition P = {V } turns out to be the best one. In such a case, the right-hand side of (5) becomes ε(V ) which, in turn, is bounded by ΦG (y). The proof of Theorem 3 is similar to the one for the binary case, hence we only emphasize the main differences. Let P be a regular partition of V . Clearly, no matter how each C ∈ P is explored, the contribution to the total loss of (qt , it ) for qt , it ∈ C is bounded by ε(C). The remaining losses contributed by any cluster C are of two kinds only: losses on incoming steps, where the node it belongs to the inner border of C, and losses on outgoing steps, where it belongs to the outer border of C. As for the binary case, with each such step we can thus associate a node in the inner and the outer border of C, since incoming and outgoing step alternate for each cluster. The exception is when a cluster is exhausted which, at first glance, seems to requires adding an extra term as big as max times the size |P| of the partition (this term could have a significant P impact for certain graphs). However, as explained in the proof below, max can P ¯P . In fact, in certain be replaced by the potentially much smaller term max − P cases this extra term disappears, and the final bound we obtain is just (5). Proof of Theorem 3. Fix an arbitrary regular partition P of V and index by 1, . . . , |P| the clusters in it. We abuse the notation and use P also to denote the set of cluster indices. We crudely upper bound the total loss incurred during intra-cluster lossy steps by C∈P ε(C). Hence, in the rest of the proof we focus on bounding the total loss incurred during inter-cluster lossy steps only. We say that step t is a lossy step if (qt , it ) > 0, and we distinguish between intracluster lossy steps (when qt and it belong to the same cluster) and inter-cluster lossy steps (when qt and it belong to different clusters). We define incoming and outgoing (regular and irregular) inter-cluster lossy steps for a given cluster Ci reg irr (and the relative sets M→i , Mi→ and Mi→ ) as in the binary case proof, as well
122
N. Cesa-Bianchi, C. Gentile, and F. Vitale
reg as the injective mapping μi . In the binary |Mi→ | by |M→i | − Ii . case we bounded In a similar fashion, we now bound t∈M reg t by (Ci ) |M→i | − Ii , where we i→ set for brevity t = (qt , it ). We can write
irr (Ci ) |M→i | − Ii + max t ≤ P |Mi→ | reg irr i∈P t∈Mi→ ∪Mi→
i∈P
≤
i∈P
≤
(Ci )|M→i | +
max − (C ) j P
j∈P : Ij =1
(Ci )|M→i | + |P| max − ¯P , P
i∈P
irr where the second inequality follows from i∈P Ii ≥ i∈P |Mi→ | (as for the regular partition considered in the binary case). The proof is concluded after defining the two injective mapping ν1 and ν2 as in the binary case, and bounding again |M→i | through δ(Ci ).
5
Computational Complexity
In this section we briefly describe an efficient implementation of cga, and discuss some improvements for the special case of binary labels. This implementation shows that cga is especially useful when dealing with large scale applications. Recall that the path length assignment λ is a parameter of the algorithm and satisfies (1). In order to develop a consistent argument about cga’s time and space requirements, we need to make assumptions on the time it takes to compute this function. If we are given the distance between any pair of nodes i and j, and the loss (j, j ) for any j adjacent to j, we assume to be able to compute in constant time the length of the shortest path i, . . . , j, j . This assumption is easily seen to hold for many natural path length assignments λover graphs, for instance λ(s1 , . . . , sd ) = maxk (sk−1 , sk ) and λ(s1 , . . . , sd ) = k (sk−1 , sk ) —note that both fulfill (1). Because of the above assumption on the path length λ, in the general case of real labels cga can be implemented using the well-known Dijkstra’s algorithm for single-source shortest path (see, e.g., [7, Ch. 21]). After all nodes in Vt−1 and all edges incident to it have been revealed, cga computes the distance between it and any other node in Vt−1 by invoking Dijkstra’s algorithm on the sub-graph Gt , so that cga can easily find node qt+1 . If Dijkstra’s algorithm is implemented with Fibonacci heaps [7, Ch. 25], the total time required for predicting all |V | labels is3 O |V ||E| + |V |2 log |V | . On the other hand, the space complexity is always linear in the size of G. We now sketch the binary case. The additional assumption λ(s1 , . . . , sd ) = maxk (sk−1 , sk ) allows us to exploit the simple structure of regular partitions. 3
In practice, the actual running time is often far less than O |V ||E| + |V |2 log |V | , since at each time step t Dijkstra’s algorithm can be stopped as soon as the node of ∂V t−1 nearest to it in Gt has been found.
Learning Unknown Graphs
123
Coarsely speaking, we maintain information about the current inner border and clusters, and organize this information in a balanced tree, connecting the nodes lying in the same cluster through specially designed lists. In order to describe this implementation, it is important to observe that, since the graph is revelead incrementally, it might be the case that a single cluster C in G at time t happens to be split into several disconnected parts in Gt . We call sub-cluster each maximal set of nodes that are part of the same uniformly labeled and connected subgraph of Gt . The main data structures we use (further details are omitted due to space limitations) for organizing the nodes observed so far by the algorithm combine the following: – A self-balancing binary search tree T containing the labeled nodes in Vt . We will refer to nodes in Vt and to nodes in T interchangeably. – Given a sub-cluster C, all nodes in C ∩ ∂V t are connected via a special list called border sub-cluster list. The remaining nodes in C are connected through a list called internal sub-cluster list. – All nodes in each sub-cluster C ⊆ Vt are linked to a special time-varying set called sub-cluster record. This record enables access to the first and last element of both the border and the internal sub-cluster list of C. The subcluster record also contains the size of C. The above data structures are intended to support the following main operations, which are executed in the following order at each time step t, just after the algorithm has selected qt : (1) insertion of it ; when it is chosen by the adversary cga also receives the list N (it ) of all nodes in Vt−1 adjacent to it ; (2) merging of subclusters required after the disclosure of yt ; (3) update of border and internal sub-cluster lists (since some nodes in ∂V t−1 are not in ∂V t ); (4) choice of qt+1 . The merging operation can be implemented as union-by-rank in standard union-find data structures (e.g., [7,Ch. 22]). The overall running time for |V | nodes is smaller than O |V | log |V | . In fact, the dominating cost in the time complexity is the cost for reaching at each time t the nodes of Vt−1 adjacent to it . Each of these it ’s neighbors can be bijectively associated with an edge of E, the height of tree T being at most logarithmic in V . Hence the overall running time for predicting |V | labels is O |E| log |V | + |V | log |V | = O |E| log |V | , which is the best one can hope for (an obvious lower bound is |E|) up to a logarithmic factor. As for space complexity, it is important to stress that on every step t the algorithm first stores and then “throws way” the received node list N (it ) (in the worst case, the length of N (it ) is linear in |V |). The space complexity is therefore O(|V |). This optimal use of space is one of the most important practical strengths of cga, since the algorithm never needs to store the whole graph seen so far.
6
Conclusions and Ongoing Research
We have presented a first step towards the study of problems related to learning (labeled) graph exploration strategies. This is a significant departure from more
124
N. Cesa-Bianchi, C. Gentile, and F. Vitale
standard approaches assuming prior knowledge of the underlying graph structure (e.g., [2, 3, 6, 9, 10, 11, 12, 13, 14, 17] and references therein). We are currently investigating to what extent our approach can be extended to weighted graphs. In order to exploit the benefits of edge weights, our protocol in Sect. 2 could be modified to let cga observe the weights of all edges incident to the current node. Whenever the weights of intra-cluster edges are heavier than those of inter-cluster ones, our algorithm can take advantage of the additional weight information. This calls for an analysis being able to capture the interaction between node labels and edge weights. Acknowledgments. We would like to thank the ALT 2009 reviewers for their comments which greatly improved the presentation of this paper. This work was supported in part by the PASCAL2 Network of Excellence under EC grant 216886. This publication only reflects the authors’ views.
References [1] Albers, S., Henzinger, M.: Exploring unknown environments. SIAM Journal on Computing 29(4), 1164–1188 (2000) [2] Blum, A., Chawla, S.: Learning from labeled and unlabeled data using graph mincuts. In: Proc. 18th ICML. Morgan Kaufmann, San Francisco (2001) [3] Blum, A., Lafferty, J., Rwebangira, M., Reddy, R.: Semi-supervised learning using randomized mincuts. In: Proc. 21st ICML. ACM Press, New York (2004) [4] Bryant, D., Berry, V.: A Structured family of clustering and tree construction methods. Advances in Applied Mathematics 27, 705–732 (2001) [5] Balcan, N., Blum, A., Vempala, S.: A discriminative framework for clustering via similarity functions. In: Proc. 40th STOC. ACM Press, New York (2008) [6] Cesa-Bianchi, N., Gentile, C., Vitale, F.: Fast and optimal prediction of a labeled tree. In: Proc. 22nd COLT. Omnipress (2009) [7] Cormen, T.H., Leiserson, C.E., Rivest, R.L.: Introduction to Algorithms. MIT Press, Cambridge (1990) [8] Deng, X., Papadimitriou, C.H.: Exploring an unknown graph. In: Proc. 31st FOCS, pp. 355–361. IEEE Press, Los Alamitos (1990) [9] Hanneke, S.: An analysis of graph cut size for transductive learning. In: Proc. 23rd ICML, pp. 393–399. ACM Press, New York (2006) [10] Hebster, M., Pontil, M.: Prediction on a graph with the Perceptron. In: NIPS, vol. 19, pp. 577–584. MIT Press, Cambridge (2007) [11] Herbster, M.: Exploiting cluster-structure to predict the labeling of a graph. In: Freund, Y., Gy¨ orfi, L., Tur´ an, G., Zeugmann, T. (eds.) ALT 2008. LNCS (LNAI), vol. 5254, pp. 54–69. Springer, Heidelberg (2008) [12] Herbster, M., Lever, G., Pontil, M.: Online prediction on large diameter graphs. In: NIPS, vol. 22. MIT Press, Cambridge (2009) [13] Herbster, M., Pontil, M., Rojas-Galeano, S.: Fast prediction on a tree. In: NIPS, vol. 22. MIT Press, Cambridge (2009) [14] Herbster, M., Lever, G.: Predicting the labelling of a graph via minimum pseminorm interpolation. In: Proc. 22nd COLT. Omnipress (2009) [15] Joachims, T.: Transductive Learning via Spectral Graph Partitioning. In: Proc. 20th ICML, pp. 305–312. AAAI Press, Menlo Park (2003)
Learning Unknown Graphs
125
[16] Kondor, I., Lafferty, J.: Diffusion kernels on graphs and other discrete input spaces. In: Proc. 19th ICML, pp. 315–322. Morgan Kaufmann, San Francisco (2002) [17] Pelckmans, J., Shawe-Taylor, J., Suykens, J., De Moor, B.: Margin based transductive graph cuts using linear programming. In: Proc. 11th AISTAT. JMLR Proceedings Series, pp. 360–367 (2007) [18] Remy, J., Souza, A., Steger, A.: On an online spanning tree problem in randomly weighted graphs. Combinatorics, Probability and Computing 16, 127–144 (2007) [19] Smola, A., Kondor, I.: Kernels and regularization on graphs. In: Sch¨ olkopf, B., Warmuth, M.K. (eds.) COLT/Kernel 2003. LNCS (LNAI), vol. 2777, pp. 144– 158. Springer, Heidelberg (2003) [20] Yang, W.S., Dia, J.B.: Discovering cohesive subgroups from social networks for targeted advertising. Expert Systems with Applications 34, 2029–2038 (2008)
Completing Networks Using Observed Data Tatsuya Akutsu1 , Takeyuki Tamura1 , and Katsuhisa Horimoto2 1
Bioinformatics Center, Institute for Chemical Research, Kyoto University Gokasho, Uji, Kyoto 611-0011, Japan {takutsu,tamura}@kuicr.kyoto-u.ac.jp 2 Computational Biology Research Center 2-42 Aomi, Koto-ku, Tokyo 135-0064, Japan
[email protected]
Abstract. This paper studies problems of completing a given Boolean network (Boolean circuit) so that the input/output behavior is consistent with given examples, where we only consider acyclic networks. These problems arise in the study of inference of signaling networks using reporter proteins. We prove that these problems are NP-complete in general and a basic version remains NP-complete even for tree structured networks. On the other hand, we show that these problems can be solved in polynomial time for partial k-trees of bounded (constant) indegree if a logarithmic number of examples are given.
1
Introduction
Inference of biological networks, which include genetic networks, protein-protein interaction networks and signaling networks, is an important topic in bioinformatics and computational systems biology. For inference of genetic networks, extensive studies have been done in this decade. The objective of this problem is, given a series of gene expression profiles (a series of states of all genes under various environment and/or time steps), to infer a function along with input genes that regulates each gene, where a set of functions constitutes a genetic network. In inference of genetic networks, it is assumed that the states of all genes are observable under each environment and/or each time step though there exist some noise. This assumption is reasonable because we can observe expression levels of all genes (or almost all genes) by using such technologies as DNA microarray and DNA chip. However, this assumption is not reasonable when we want to infer signaling networks (i.e., signaling pathways). In this case, we need to observe activity levels or quantities of proteins. Unfortunately, it is quite difficult to observe such kind of data, especially in living organisms. Reporter proteins (or reporter genes) are usually employed, each of which is associated with one or some kinds of proteins [16]. However, both designing reporter proteins and introducing reporter proteins to cells are hard tasks. In particular, introducing multiple types of reporter
This work is partially supported by the Cell Array Project from NEDO, Japan and by a Grant-in-Aid ‘Systems Genomics’ from MEXT, Japan.
R. Gavald` a et al. (Eds.): ALT 2009, LNAI 5809, pp. 126–140, 2009. c Springer-Verlag Berlin Heidelberg 2009
Completing Networks Using Observed Data
127
proteins is quite hard. Therefore, we can only assume that the activity levels of one or a few kinds of proteins under various environments are observed in analysis of signaling networks. While it is almost impossible to infer the network only from such little information, we can utilize knowledge in literature and databases. Thus, it is reasonable to assume that we have a preliminary network model of the target signaling pathway where some parts are unclear or invalid. Using observed data on the activity levels of a single or a few types of proteins under various environment, it may be possible to modify a preliminary network model so that it is consistent with the observed data. Among many ways to modify the network model, it is reasonable to make the minimum modification, by following the principle of Occam’s razor. This motivates us to study network completion problems. In this paper, we assume a Boolean network model [12] as a model of biological networks because it is a fundamental model, a lot of theoretical and practical studies have been done [12], and it has also been applied to analysis of signaling networks [10]. We assume that the network topology is given (i.e., a set of input nodes to each node is known) and Boolean functions are already assigned to a subset of nodes. We also assume that a set of nodes is divided into external nodes, internal nodes and output nodes, where only the activity levels of external and output nodes can be observed. Output nodes correspond to proteins whose activity levels are observed by reporter proteins, where we mainly consider the case that there exists only one output node because it is very difficult to introduce multiple reporter proteins. External nodes correspond to proteins whose activity levels are controlled by stimuli given from outside of the cell (e.g., environment), where these nodes can also be regarded as input nodes to the network. Furthermore, we assume that the network is acyclic because the state of the output node may not be determined uniquely if there exist cycles. Therefore, we can assume that the state of output node is determined (through internal nodes) from the states of external nodes. Then, a basic version of the network completion problem is to determine Boolean functions for unassigned nodes so that the resulting network is consistent with given set of examples ( i.e., a series of external and output states). We also consider variants of the problem in which Boolean functions are assigned to all nodes but the minimum number of modifications (e.g., modification of Boolean functions, deletions of edges) are allowed. We show that these problems are NP-complete and the basic version remains NP-complete even for tree structured networks. On the other hand, we show that these problems can be solved in polynomial time for partial k-trees of bounded (constant) indegree if a logarithmic number of examples are given. There exist several related studies. As mentioned before, a lot of studies have been done on inference of genetic networks from gene expression profiles. However, most of such studies are based on statistical or heuristic approaches and only a few studies have been done from viewpoints of computational and/or sample complexities. Akutsu et al. proposed a strategy to identify a genetic network under a Boolean model using disruption and overexpression of multiple genes [2]. They analyzed combinatorial and computational complexities and showed that it
128
T. Akutsu, T. Tamura, and K. Horimoto
is possible to identify a network in polynomial time using O(n2D ) experiments, where n is the number of nodes (i.e., number of genes) and D is the maximum indegree (i.e., fan-in) of a network. They also analyzed the average case sample complexity to identify a Boolean network when random gene expression patterns are given and showed that O(log n) patterns are enough if D is bounded by a constant [1]. Ideker et al. also studied a similar Boolean model and gave more practical strategies for acyclic networks using information theoretic criteria [11]. Mochizuki used a Boolean model and developed a method to estimate an upper bound of the number of steady states from the network topology and observed data without inferring Boolean functions [14]. In these studies, it is assumed that states of all nodes are observable. Angluin et al. considered another model (called learning by value injection queries) in which the state of only one node (i.e., output node) is observable while the states of an arbitrary subset of nodes can be specified [4]. They showed that this problem is NP-hard in general and needs an exponential number of queries in the worst case. They also showed that this problem can be solved in polynomial time using a polynomial number of queries if the networks are in class NC1 or AC0. This framework was further extended to analog circuits [5] and probabilistic circuits [6]. This framework is close to ours, but is different in two respects; value injection is possible for arbitrary nodes in their model whereas value injection is not allowed in our framework (because it is not biologically plausible to perform value injection queries to many genes/proteins); network topology is not given in their framework whereas it is given in our framework. Of course, a lot of studies have been done on exact and approximate learning of Boolean functions [13]. However, the results are not directly applicable to our problems because we assume that the network topology is given and a Boolean function assigned to each node is arbitrary but of small indegee.
2
Problem Definitions
In this paper, we only consider acyclic Boolean networks as in [4,5,6]. Though the states of nodes in a usual Boolean network are updated synchronously, we need not consider time steps because the states of all nodes in acyclic Boolean networks are determined uniquely from the states of external nodes and thus an acyclic Boolean network is equivalent to an acyclic Boolean circuit. As a model of signaling networks, we define a Boolean network with external, internal and output nodes as follows. A Boolean network G(V, F ) consists of a set V = {v1 , . . . , vn } of nodes and a list F = (f1 , . . . , fn ) of Boolean functions, where each node takes a Boolean value (i.e., 0 or 1), and a Boolean function fi (vi1 , . . . , vil ) with inputs from specified nodes vi1 , . . . , vil is assigned to each of internal and output nodes vi . We use x to denote the negation of x, and use ∧, ∨ and ⊕ to denote AND, OR and XOR, respectively. We use IN (vi ) to denote the set of input nodes vi1 , . . . , vil to vi . We allow that some vij are not relevant (i.e., these vij do not directly affect the state of vi ). For each G(V, F ), we associate a directed graph G(V, E) defined by E = {(vj , vi ) | vj ∈ IN (vi )}. We use deg(vi ) to denote the indegree of vi (i.e., |IN (vi )| = deg(vi )). In this paper, we assume
Completing Networks Using Observed Data
129
that G(V, E) is acyclic. We also assume that there exists only one output node, where some of the results can be extended for multiple output nodes. We assume w.l.o.g. that v1 , . . . , vh are external nodes (whose indegrees are 0) and vn is the output node. Each node takes either 0 or 1 and the state of node vi is denoted by vˆi . For an internal or output node vi , vˆi is determined by vˆi = fi (ˆ vi1 , . . . , vˆili ). We have assumed so far that all fi are known. However, fi may not be known for some nodes vi whereas IN (vi ) are known. Such a node is called an incomplete node. A Boolean network is called incomplete if it contains an incomplete node, otherwise it is called complete. An (h + 1)-dimensional 0-1 vector e is called an example, where the first h entries correspond to the external nodes and the last entry corresponds to the output node. An example e is called positive if eh+1 = 1, otherwise it is called negative. A complete Boolean network G(V, F ) is consistent with e if vˆn = eh+1 holds under the condition that vˆi = ei holds for i = 1, . . . , h. We define a basic version of the network completion problem as follows (see also Fig. 1). Definition 1. BNCMPL-1 Instance: An incomplete Boolean network G(V, F ), a set of examples {e1 , . . . , em }, Question: Is there an assignment of Boolean functions fi to incomplete nodes so that the resulting network G(V, F ) is consistent with all examples ? An assignment satisfying the above condition is called a completion. In the above, a set of nodes to which Boolean functions are assigned is specified. However, existing knowledge about the target network may contain mistakes. In such a case, it might be useful to modify Boolean functions for the minimum number of nodes while keeping the network structure. Therefore, we define a variant of the network completion problem as follows. Definition 2. BNCMPL-2 Instance: A complete Boolean network G(V, F ), a set of examples {e1 , . . . , em }, and a positive integer L, Question: Is there an assignment of Boolean functions fi to at most L nodes so that the resulting network G(V, F ) is consistent with all examples ? In this definition, we allow that the algorithm can override the complete nodes (i.e., other Boolean functions can be assigned to nodes for which Boolean functions are already assigned). As a variant of BNCMPL-2, we can consider the problem of minimizing the number L of nodes for which other Boolean function should be assigned. This variant can be solved by solving BNCMPL-2 from L = 0 to n. Note that deletion of an edge can be regarded as a modification of Boolean function and thus can be handled in BNCMPL-2 because we allow that some nodes in IN (vi ) are not relevant1 . In this paper, we assume in most cases that the maximum indegree is bounded by a constant D. This assumption is reasonable because it is quite hard in general to learn Boolean functions with many inputs, and O(2n ) bits are required to represent a Boolean function if an arbitrary Boolean function is allowed. 1
All the results in this paper are valid even if all nodes in IN (vi ) must be relevant.
130
3
T. Akutsu, T. Tamura, and K. Horimoto
Hardness Results
First, we show that BNCMPL-1 is NP-complete even if only one positive example is given. Proposition 1. BNCMPL-1 is NP-complete even if one positive example is given and D = 2. Proof. Since it is obvious that BNCMPL-1 is in NP, we show that it is NP-hard by means of a polynomial time reduction from 3-SAT [9] (see also Fig. 1). Let c1 , . . . , cM be clauses over Boolean variables x1 , . . . , xN . Then, 3-SAT is the problem of deciding whether there is an assignment of 0-1 values to x1 , . . . , xN that satisfies all the clauses (i.e., the values of all clauses are 1). We construct an incomplete network G(V, F ) as follows2 , where we first assume that nodes with large indegree are allowed. Let V = {v1 , . . . , v2N +M+1 } and let {v1 , . . . , vN } be a set of external nodes. Let clause ci be defined as ci = gi (xi1 , xi2 , xi3 ). For each node v2N +i (i = 1, . . . , M ), we assign a Boolean function v2N +i = gi (vN +i1 , vN +i2 , vN +i3 ). For the output node v2N +M+1 , we assign a Boolean function defined by v2N +M+1 = v2N +1 ∧ v2N +2 · · · ∧ v2N +M . For each i = 1, . . . , N , let vN +i be an incomplete node such that IN (vN +i ) = {vi }. Therefore, either vN +i = vi or vN +i = vi is assigned to vN +i 3 . Finally, we let e = (1, 1, 1, . . . , 1). Then, it is straight-forward to see that there exists a completed network G(V, F ) if and only if there exists a satisfying assignment for c1 , . . . , cM . It is also seen that the reduction can be done in polynomial time.
v12 Output Node v9 v5
v10
v11
v6 v1
v7 v2
v3
v8 v4
Imcomplete Node External Nodes
Fig. 1. Reduction from 3-SAT instance {x1 ∨ x2 ∨ x3 , x1 ∨ x3 ∨ x4 , x2 ∨ x3 ∨ x4 } to BNCMPL-1. v4+i = vi (resp. v4+i = vi ) corresponds to xi = 1 (resp. xi = 0) for i = 1, . . . , 4.
2 3
Construction can be simplified if we use internal nodes with degree 0. Since we allow non-relevant input nodes, it is also possible that vN+i = 0 or vN+i = 1 is assigned. All the results in this paper are valid even if such an assignment is taken into account.
Completing Networks Using Observed Data
131
Furthermore, we can modify the construction for the case of D = 2 by encoding each Boolean function assigned to v2N +i for i = 1, . . . , M + 1 using Boolean functions of arity 2 along with additional nodes. Though the above result is rather straight-forward, we can strengthen the above proposition for tree structured networks. Theorem 1. BNCMPL-1 is NP-complete even if the network has a tree structure and D = 2. Proof. We show a polynomial time reduction from 3-SAT to this special case. Different from the proof in Prop. 1, we use examples to encode clauses. Let c1 , . . . , cM be clauses over x1 , . . . , xN in an instance of 3-SAT. From this instance, we construct G(V, F ) as follows (see Fig. 2). Let V = {v1 , . . . , v6N +1 }. For convenience, we let yi = vi , zi = vN +i , wi = v2N +i , pi = v3N +i , qi = v4N +i , and ri = v5N +i , where yi , zi , and wi are external nodes. For the output node v6N +1 , we assign v6N +1 = r1 ∨ r2 ∨ · · · ∨ rN . For each of qi and ri , we assign qi = pi ⊕ zi and ri = qi ∧ wi . Each pi is an incomplete node with only one input yi . Clearly, the resulting network has a tree structure. Next, we create M examples where all examples are positive. For each clause cj = lj1 ∨ lj2 ∨ lj3 , we create an example ej such that – for i = 1, . . . , N , eji = ej2N +i = 1 if xi appears in cj as a positive or negative literal, otherwise eji = ej2N +i = 0, – for i = 1, . . . , N , ejN +i = 1 iff. xi appears in cj as a negative literal, – ej3N +1 = 1. Then, we show below that there exists a completed network G(V, F ) if and only if there exists a satisfying assignment for c1 , . . . , cM . Assume that there exists a satisfying assignment. For each i = 1, . . . , N , we assign pi = yi to pi if bi = 1, otherwise pi = yi to pi . Then, the state of ri for
OR
ri qi
AND
Incomplete Node
XOR
pi yi
zi
wi
External Nodes
Fig. 2. Reduction from 3-SAT instance to BNCMPL-1 for trees. For each variable xi , a subnetwork shown in this figure is constructed.
132
T. Akutsu, T. Tamura, and K. Horimoto
example ej is equal to the state of the literal corresponding to xi if xi appears in clause cj , otherwise it is 0. Therefore, we can see that the state of v6N +1 is 1 for all examples. Conversely, assume that there exists a required completion. For each i = 1, . . . , N , we let bi = 1 if pi = yi is assigned to pi , bi = 0 otherwise. Then, we can see that all clauses are satisfied. The above reduction can be clearly done in polynomial time. As in the proof of Prop. 1, we can encode the output node using nodes of indegree 2. In the above, we assumed that negation nodes can be used. The following theorem states that BNCMPL-1 remains NP-complete even if only AND/OR nodes are allowed, where the use of 3-Coloring was inspired from [15]. Theorem 2. BNCMPL-1 is NP-complete even if only AND/OR nodes of D = 2 are allowed. Proof. We show a polynomial time reduction from 3-Coloring [9]. 3-Coloring is, given an undirected graph G0 (V0 , E0 ) with N vertices, to decide whether or not there exists a mapping χ from a set of vertices in G to {1, 2, 3} such that χ(xi ) = χ(xj ) holds for all {xi , xj } ∈ E0 . From G0 (V0 , E0 ), we construct an incomplete network G(V, F ) consisting of AND/OR nodes as follows (see also Fig. 3). First, we create the following nodes: – c1 , c2 , c3 and the output node o, – yi , zi , pi , qi , c1i , c2i , c3i , wi12 , wi13 , wi23 , ri12 , ri13 , ri23 for i = 1, . . . , N , – spq i,j for all (p, q) ∈ {(1, 2), (1, 3), (2, 1), (2, 3), (3, 1), (3, 2)} and for all {i, j} ∈ E0 where i < j, where c1 , c2 , c3 , yi s, zi s, wipq s are external nodes, and cpi s are incomplete nodes. For each cpi , we let IN (cpi ) = {cp , yi }. For each non-external node, assignment of 1 2 3 a Boolean function is done by pi = zi ∧ pi , ripq = wipq ∧ chi ∧ cki , i , qi = pq ci ∨ ci∨ cpq pq h k si,j = ci ∧ cj ∧ yi ∧ yj , and o = ( qi ) ∨ ( ri ) ∨ si,j . (i) o
(ii)
1
o
qi
OR
OR
pi
AND
c
1 o
r 12 i
OR
s 12 i,j
AND
OR
1 i
(iii)
0
c
c 2i
zi
c1
c2
c3
1
0
0
0
3 i
AND
c 1i
yi 1
c
c 2i
1 i
c
2 j
w12i
c1
c2
yi
c1
c2
yi
yi
1
0
0
1
0
0
1
1
Fig. 3. Reduction from 3-Coloring to BNCMPL-1. Parts (i), (ii) and (iii) put constraints that at least one color is assigned to each node, two colors cannot be assigned to each node, and different colors must be assigned to neighboring nodes, respectively.
Completing Networks Using Observed Data
133
For each example e, the entry corresponding to each external/output node v is denoted by e(v). Then, examples are given as follows. (i) For each i ∈ {1, . . . , N }, we create e such that e(zi ) = e(yi ) = e(o) = 1, and e(v) = 0 for the other v. (ii) For each i ∈ {1, . . . , N }, we create e such that e(wipq ) = e(yi ) = 1, and e(v) = 0 for the other v. (iii) For each {i, j} ∈ E0 where i < j, we create e such that e(yi ) = e(yj ) = e(o) = 1, and e(v) = 0 for the other v. Then, we can show that there exists a valid 3-coloring for G0 (V0 , E0 ) if and only if there is a required completion for G(V, F ). Furthermore, this reduction can be done in polynomial time. As in the proof of Prop. 1, we can encode each node with indegree more than 2 using nodes of indegree 2. We can also prove that BNCMPL-2 is NP-complete, where the proof is a bit involved because Boolean functions assigned to any subset of nodes of cardinality L can be modified. Theorem 3. BNCMPL-2 is NP-complete. Proof. BNCMPL-2 is clearly in NP. In order to prove NP-hardness, we replace each T (ri ) in the proof of Thm. 1 by a large (but polynomial size) acyclic subnetwork Gi with an output node ri (see also Fig. 4), where T (v) denotes the subtree of a tree T induced by v and its descendants. Let N = 2N + 1. In order to construct Gi , we make N copies of T (ri ) where yi , zi , wi and pi are shared by all copies, and pi = yi is initially assigned to each pi (recall that there exists no incomplete node in BNCMPL-2). Let rij be the jth copy of ri . Each Gi consists of N + 1 copies of an identical subnetwork: Ci1 , Ci2 , · · · , CiN +1 . The subnetwork C has N inputs and N outputs. The value of each output is 1 if at least N + 1 of the inputs are 1, otherwise it is 0. Therefore, each Cik has N copies of a majority circuit. Since a majority circuit can be realized by an acyclic network using a polynomial number of logic gates with fan-in 2 (using half adders), the size of each Cij is polynomial of N . The inputs of Ci1 are rij s. The inputs of Cij+1 are the outputs of Cij . The first output (it can be arbitrary) of CiN +1 is an input to the output node o, which is the disjunction of all inputs. Since the size of Cij is polynomial, the whole network G(V, F ) has polynomial size and can be constructed in polynomial time. In addition to the same examples as in Thm. 1, one negative example eM+1 is given, which is defined by eM+1 = 0 for all i = 1, . . . , 3N + 1. i Hereafter, we show that there exists a required completion for G(V, F ), {e1 , . . . , eM+1 }, and L = N iff. there exists a satisfying assignment for 3-SAT. Assume that (b1 , . . . , bN ) is a satisfying assignment for 3-SAT. For each i such that bi = 0, we replace pi = yi assigned to pi with pi = yi . It is to be noted that we need to replace at most N Boolean functions. By these modifications, we obtain a completed network G(V, F ). Hereafter, vˆ(e) denotes the state of node v in G(V, F ) when example e is given. Let ci = li1 ∨ li2 ∨ li3 where i ∈ {1, . . . , M }.
134
T. Akutsu, T. Tamura, and K. Horimoto o 1
Ci
r’1
OR
N+1
C1
r i1
r i2
r iN’
q 1i
q 2i
q N’ i
2
C1
1
C1
r’N
N+1
CN
2
CN
1
CN
pi yi
z i wi Fig. 4. Reduction from 3-SAT to BNCMPL-2
Since ci is satisfied by (b1 , . . . , bN ), one of li1 , li2 and li3 must be 1 for each i. We assume w.l.o.g. that li1 takes value 1. Then, we can see that rˆij1 (ei ) = 1 holds for all j = 1, . . . , N , from which rˆi1 (ei ) = 1 and oˆ(ei ) = 1 follow. Furthermore, we can see that oˆ(eM+1 ) = 0. Therefore, there exists a required completion. Conversely, assume that there exists a required completion G(V, F ). Then, we create an assignment (b1 , . . . , bN ) by letting bi = 0 iff. pi = yi is assigned to pi . If no nodes other than pi s are changed4 , we can see that (b1 , . . . , bN ) is a satisfying assignment, as in the proof of Thm. 1. Therefore, we consider the case that some nodes other than pi s are changed. Since at most N nodes are changed, we can see that at least N + 1 outputs of each Cij take the value di (ej ) defined by di (ej ) =
pi (ej ) = 1 ∧ zˆi (ej ) = 0) ∨ (ˆ pi (ej ) = 0 ∧ zˆi (ej ) = 1)), 1, if (w ˆi (ej ) = 1) ∧ ((ˆ 0, otherwise.
We can also see that at least one Cij remains unchanged for each i. If CiN +1 is unchanged, we can see that rˆi (ej ) = di (ej ) holds for all j. If CiN +1 is changed, there must exists unchanged Cij for some j < N + 1. Then, we can see that for each i ∈ {1, . . . , N }, either one of the following holds: rˆi (ej ) = di (ej ) for all j, rˆi (ej ) = di (ej ) for all j, rˆi (ej ) = 0 for all j, rˆi (ej ) = 1 for all j. 4
We say that a node is changed if the assigned Boolean function is replaced.
Completing Networks Using Observed Data
135
Therefore, oˆ(ej ) can be represented by a Boolean function of d1 (ej ), . . ., dN (ej ). Here we note that for each ej (j < M +1), rˆi (ej ) = rˆi (eM+1 ) holds for at least 1 and at most 3 ri s (among N ri s). Furthermore, we can see that such ri s = eM+1 are included in {rj 1 , rj 2 , rj 3 } where cj = lj1 ∨ lj2 ∨ lj3 . Since ej3N +1 3N +1 j M+1 holds, rˆi (e ) = rˆi (e ) holds for at least one i ∈ {j1 , j2 , j3 }. For such i, di (ej ) = di (eM+1 ) holds. Since di (eM+1 ) = 0, we have di (ej ) = 1. Although the output node may be changed, oˆ(ej ) = d1 (ej ) ∨ · · · ∨ dN (ej ) always holds for any j in the resulting network G(V, F ). Hence, pˆi (ej ) = 1 holds if zˆi (ej ) = 0, otherwise (i.e., zˆi (ej ) = 1) pˆi (ej ) = 0 holds. If zˆi (ej ) = 0 holds, xi appears in cj positively. Since pˆi (ej ) = 1 holds in this case, bi = 1 holds and thus cj is satisfied. Similarly, if zˆi (ej ) = 1 holds, we can see that bi = 0 holds and cj is satisfied. Therefore, there exists a satisfying assignment for 3-SAT. The above reduction can be clearly done in polynomial time. We can encode the output node using nodes of indegree two 5 .
4
Algorithms for Tree-Like Networks
In Section 3, we showed that BNCMPL-1 is NP-complete even for tree structured networks of bounded indegree (i.e., the maximum indegree is bounded by a constant). However, we can show that BNCMPL-1 and BNCMPL-2 are solved in polynomial time for tree structured and tree-like networks of bounded constant indegree if the number of examples is small (i.e., O(log n)). Considering the case of a small number of examples is meaningful because a small number of data are usually available in analysis of signaling networks. 4.1
Algorithms for Tree-Structured Networks
In this subsection, we present an algorithm for tree-structured networks, which is to be extended for partial k-trees in Section 4.2. The algorithm is based on dynamic programming and is similar to that in [3] where additional ideas are introduced. We give the algorithm as a part of the proof of the following theorem. Theorem 4. BNCMPL-1 is solved in polynomial time if the network structure is a rooted tree of bounded indegree and the number of examples is O(log n). Proof. We assume for simplicity that all non-external nodes are of indegree 2, while the proof and algorithm can be extended for any trees of bounded constant indegree D with keeping the polynomial time complexity. Let Fv be an assignment of functions to unassigned nodes in T (v). Then, vˆ(ej , Fv ) denotes the state of v under assignment Fv when an example ej is given. Let a be a 0-1 vector of size m (recall that m is the number of examples), 5
It should be noted that the theorem still holds if nodes encoding the output node are changed by completion.
136
T. Akutsu, T. Tamura, and K. Horimoto
where the ith coordinate of a corresponds to the ith example. We define a dynamic programming table S[v, a] by 1, if there exists Fv such that vˆ(ej , Fv ) = aj for all j = 1, . . . , m, S[v, a] = 0, otherwise. It is to be noted that for each node v, we examine all possible combinations of 0-1 values against all m examples. Therefore, the size (i.e., the number of entries) of S is n · 2m and thus is polynomial if m = O(log n). The table S[v, a] can be computed as follows. Recall that an external node vi corresponds to the ith entry of each example. Thus, for an external node vi , S[vi , a] is computed by 1, aj = eji holds for all j = 1, . . . , m, S[vi , a] = 0, otherwise. For a non-external and non-incomplete node v, let fv be the Boolean function assigned to v in G(V, F ). For a non-external node v, let v L and v R be the left and right children of v, respectively. Then, S[v, a] is computed by ⎧ ⎨ 1, if there exists (f, aL , aR ) such that S[v L , aL ] = 1 and R S[v R , aR ] = 1, and f (aL S[v, a] = j , aj ) = aj for all j = 1, . . . , m, ⎩ 0, otherwise, where f is an arbitrary Boolean function with arity 2 if v is an incomplete node, otherwise f = fv . Since the number of possible Boolean functions f is 2 D 22 = 16 (22 for the case of maximum indegree D), S[v, a] can be computed m in O(m · 2 · 2m ) = O(m · 22m ) time per entry. Since there are n · 2m entries, the total time for constructing the dynamic programming table is O(mn · 23m ). Once this table is constructed, we finally check whether or not S[vn , a] = 1 holds for a such that aj = ejh+1 holds for all j = 1, . . . , m. It is straight-forward to see that this algorithm correctly works in O(mn · 23m ) time. We can modify the above proof for BNCMPL-2. Theorem 5. BNCMPL-2 is solved in polynomial time if the network structure is a rooted tree of bounded indegree and the number of examples is O(log n). Proof. For simplicity, we assume that all non-external nodes are of indegree 2. Let Fv be an assignment of Boolean functions to nodes in T (v). Let size(Fv ) be the number of nodes such that a Boolean function assigned by Fv is different from the original one. We define a DP table S[v, a, l] for l = 0, . . . , L by ⎧ ⎨ 1, if there exists Fv such that vˆ(ej , Fv ) = aj for all j = 1, . . . , m and size(Fv ) = l, S[v, a, l] = ⎩ 0, otherwise. Then, for an external node vi , S[vi , a, l] is computed by 1, aj = eji holds for all j = 1, . . . , m, and l = 0, S[vi , a, l] = 0, otherwise.
Completing Networks Using Observed Data
137
For a non-external node v, S[v, a, l] is computed by S[v, a, l] = ⎧ 1, if there is (aL , aR , lL , lR ) such that S[v L , aL , lL ] = 1 and S[v R , aR , lR ] ⎪ ⎪ ⎪ R ⎪ ⎨ = 1, fv (aL j , aj ) = aj for all j = 1, . . . , m, and l = lL + lR , 1, if there is (f, aL , aR , lL , lR ) such that S[v L , aL , lL ] = 1 and S[v R , aR , lR ] ⎪ R ⎪ = 1, f (aL ⎪ j , aj ) = aj for all j = 1, . . . , m, and l = lL + lR + 1, ⎪ ⎩ 0, otherwise. It is straight-forward to see that this algorithm correctly works in O(mn2 L3 ·23m ) time. It is also possible to minimize the number of errors (i.e., to minimize |{j|ejh+1 = vˆn (ej )}) for BNCMPL-1 (resp. BNCMPL-2) by examining all S[vn , a] (resp. S[vn , a, l]) if m = O(log n). 4.2
Algorithms for Partial k-Trees
We can extend the above mentioned algorithms for partial k-trees. A partial k-tree is a graph with treewidth at most k, where the treewidth is defined via tree decomposition [8]. A tree decomposition of a graph G(V, E) is a pair T (VT , ET ), (Bt )t∈VT , where T (VT , ET ) is a rooted tree and (Bt )t∈VT is a family of subsets of V such that (see also Fig. 5) – For every v ∈ V , B −1 (v) = {t ∈ VT |v ∈ Bt } is nonempty and connected in T, – For every edge {u, v} ∈ E, there exists t ∈ VT such that u, v ∈ Bt . The width of the decomposition is defined as maxt∈VT (|Bt |−1) and the treewidth is the minimum of the widths among all the tree decompositions of G. We present an algorithm for BNCMPL-1 on partial k-trees as the main part of the proof of the following theorem, where we assume that k is a constant. G(V,E)
T(VT ,ET )
A
A
B B
C
C
D E
D
E
Fig. 5. Example of tree decomposition with treewidth 2
138
T. Akutsu, T. Tamura, and K. Horimoto
Theorem 6. BNCMPL-1 is solved in polynomial time if the network structure is a partial k-tree of bounded indgree and the number of examples is O(log n). Proof. For simplicity, we assume that all non-external nodes are of indegree 2. For each non-external node vi , f, a, aL , aR is called a secondary assignment R if f (aL j , aj ) = aj holds for all j = 1, . . . , m, where f = fi if vi is a complete node (otherwise f is arbitrary)6 . For each external node vi , a is called a secondary assignment if aj = eji holds for all j = 1, . . . , m. Let Av = f, a, aL , aR and Au = g, b, bL , b (or Au = b if u is an external node) be secondary assignments for v and u, respectively. We say that Av is consistent with Au if u is the first input node of v and aL = b, u is the second input node of v and aR = b, or, u is not an input node of v (see Fig. 6). Once the concept of secondary assignment is defined, we can apply the standard dynamic programming technique for partial k-trees [8]. Suppose that G is a partial k-tree when G is regarded as an undirected graph. We can compute a tree decomposition T (VT , ET ) of width k using Bodlaender’s algorithm [7,8].
For t ∈ VT , B(t) denotes the set of nodes in G defined by B(t) = t ∈des(t) Bt , where des(t) is the set of t and its descendants in T . For a set of nodes U = {vi1 , . . . , vi|U | } in G, let A(U ) = Ai1 , . . . , Ai|U | be a tuple of secondary assignments, where Aij is a secondary assignment for vij . We define a set of consistent tuples Ass(U ) by Ass(U ) = {A(U ) | Aij is consistent with Aij for all j = j }. We also define Ass(t) for t ∈ VT by Ass(t) = Ass(Bt ). Furthermore, we define a set of extensible and consistent tuples ExtAss(t) by ExtAss(t) = {A(Bt ) | A(Bt ) is a sub-tuple of A(B(t)) ∈ Ass(B(t))}. That is, ExtAss(t) is the set of consistent tuples for Bt each of which can be extensible to a consistent tuple for B(t). v A1 =
vL
A2 =
,
1 0 1 1
,
0 0 1 1
1 0 0 1
,
1 0 1 1
,
1 1 0 1
,
vR
,
1 0 1 0
A3 =
,
0 1 1 0
,
1 0 1 0
,
1 1 0 0
Fig. 6. All of A1 , A2 , and A3 are consistent secondary assignments. A1 and A2 are consistent, whereas A1 and A3 are not consistent. 6
We use secondary assignment in order to discriminate from assignment defined in Section 2.
Completing Networks Using Observed Data
139
For each node t ∈ VT , Ass(t) can be computed in O(km · 23(k+1)m ) time because the number of possible secondary assignments for each node in G is O(23m ) and thus the number of possible tuples is O(23(k+1)m ), and O(km) time is enough to check the consistency of a tuple, where we assume that the maximum indegree is bounded by 2. If t is a leaf in T , we let ExtAss(t) := Ass(t). Otherwise, let t1 , . . . , tgt be the children of t in T and we assume that ExtAss(ti )s have been already computed. For two tuples A(Bt ) for Bt and A(Bti ) for Bti , A(Bt ) and A(Bti ) are said to be compatible if the same secondary assignments are assigned to v for each v ∈ Bt ∩ Bti . Then, we can compute ExtAss(t) by ExtAss(t) := {A(Bt ) | A(Bt ) ∈ Ass(t) is compatible with A(Bti ) ∈ ExtAss(ti ) for all i = 1, . . . , gt }. Then, it is straight-forward to see that BNCMPL-1 has a required completion iff. ExtAss(r) = {} where r is the root of T . For the example of Fig. 5, we compute ExtAss(C) from Ass(D) and Ass(E), ExtAss(B) from ExtAss(C), and ExtAss(A) from ExtAss(B). Note that the output node can be located outside Br (e.g., it is not located in A but in D in Fig. 5). 6(k+1)m · kmgt ) time per t. Since Clearly, ExtAss(t) can be computed in O(2 6(k+1)m g = O(n), the total computation time is O((2 · km + q(k)) · n), where t t O(q(k) · n) is the time complexity of Bodlaender’s algorithm, which works in linear time for a constant k. If m = O(log n), this time complexity is polynomial. Furthermore, we can extend the algorithm and analysis for the case of the maximum indegree bounded by a constant D. We can extend this result for BNCMPL-2 where details are omitted. Corollary 1. BNCMPL-2 is solved in polynomial time if the network structure is a partial k-tree of bounded indegree and the number of examples is O(log n).
5
Concluding Remarks
In this paper, we have studied problems of completing networks from example data. We have shown that the problems are NP-complete in general but can be solved in polynomial time for partial k-trees of bounded indegree if a logarithmic number of examples are given. Extension of the model and algorithms for networks with cycles is an important future work because real biological networks contain cycles. For that purpose, it might be helpful to use feedback vertex sets because a network becomes acyclic if vertices in a feedback vertex set are removed. Other future works include extension of BNCMPL-2 for handling insertions of edges, analysis of PAC-type learning models [13] as well as probabilistic extensions, and development of practical algorithms.
Acknowledgment We would like to thank Atsushi Mochizuki, Ryoko Morioka and Shigeru Saito for helpful discussions.
140
T. Akutsu, T. Tamura, and K. Horimoto
References 1. Akutsu, T., Miyano, S., Kuhara, S.: Identification of genetic networks from a small number of gene expression patterns under the Boolean network model. In: Proc. Pacific Symposium on Biocomputing 1999, pp. 17–28 (1999) 2. Akutsu, T., Kuhara, S., Maruyama, O., Miyano, S.: Identification of genetic networks by strategic gene disruptions and gene overexpressions under a Boolean model. Theoretical Computer Science 298, 235–251 (2003) 3. Akutsu, T., Hayashida, M., Ching, W.-K., Ng, M.K.: Control of Boolean networks: Hardness results and algorithms for tree structured networks. Journal of Theoretical Biology 244, 670–679 (2007) 4. Angluin, D., Aspnes, J., Chen, J., Wu, Y.: Learning a circuit by injecting values. In: Proc. 38th Annual ACM Symposium on Theory of Computing, pp. 584–593 (2006) 5. Angluin, D., Aspnes, J., Chen, J., Reyzin, L.: Learning large-alphabet and analog circuits with value injection queries. Machine Learning 72, 113–138 (2008) 6. Angluin, D., Aspnes, J., Chen, J., Eisenstat, D., Reyzin, L.: Learning acyclic probabilistic circuits using test paths. In: Proc. 21st Annual Conference on Learning Theory, pp. 169–180 (2008) 7. Bodlaender, H.L.: A linear-time algorithm for finding tree-decompositions of small treewidth. SIAM Journal on Computing 25, 1305–1317 (1996) 8. Flum, J., Grohe, M.: Parameterized Complexity Theory. Springer, Berlin (2006) 9. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman and Co., New York (1979) 10. Gupta, S., Bisht, S.S., Kukreti, R., Jain, S., Brahmachari, S.K.: Boolean network analysis of a neurotransmitter signaling pathway. Journal of Theoretical Biology 244, 463–469 (2007) 11. Ideker, T.E., Thorsson, V., Karp, R.M.: Discovery of regulatory interactions through perturbation: inference and experimental design. In: Proc. Pacific Symposium on Biocomputing 2000, pp. 302–313 (2000) 12. Kauffman, S.A.: The Origins of Order: Self-organization and Selection in Evolution. Oxford Univ. Press, NY (1993) 13. Kearns, M.J., Vazirani, U.V.: An Introduction to Computational Learning Theory. MIT Press, Cambridge (1994) 14. Mochizuki, A.: Structure of regulatory networks and diversity of gene expression patterns. Journal of Theoretical Biology 250, 307–321 (2008) 15. Pitt, L., Valiant, L.G.: Computational limitations on learning from examples. Journal of the ACM 35, 965–984 (1988) 16. Tokumoto, Y., Horimoto, K., Miyake, J.: TRAIL inhibited the cyclic AMP responsible element mediated gene expression. Biochemical and Biophysical Research Communications 381, 533–536 (2009)
Average-Case Active Learning with Costs Andrew Guillory1, and Jeff Bilmes2 1
Computer Science and Engineering University of Washington
[email protected] 2 Electrical Engineering University of Washington
[email protected]
Abstract. We analyze the expected cost of a greedy active learning algorithm. Our analysis extends previous work to a more general setting in which different queries have different costs. Moreover, queries may have more than two possible responses and the distribution over hypotheses may be non uniform. Specific applications include active learning with label costs, active learning for multiclass and partial label queries, and batch mode active learning. We also discuss an approximate version of interest when there are very many queries.
1
Motivation
We first motivate the problem by describing it informally. Imagine two people are playing a variation of twenty questions. Player 1 selects an object from a finite set, and it is up to player 2 to identify the selected object by asking questions chosen from a finite set. We assume for every object and every question the answer is unambiguous: each question maps each object to a single answer. Furthermore, each question has associated with it a cost, and the goal of player 2 is to identify the selected object using a sequence of questions with minimal cost. There is no restriction that the questions are yes or no questions. Presumably, complicated, more specific questions have greater costs. It doesn’t violate the rules to include a single question enumerating all the objects (Is the object a dog or a cat or an apple or...), but for the game to be interesting it should be possible to identify the object using a sequence of less costly questions. With player 1 the human expert and player 2 the learning algorithm, we can think of active learning as a game of twenty questions. The set of objects is the hypothesis class, the selected object is the optimal hypothesis with respect to a training set, and the questions available to player 2 are label queries for data points in the finite sized training set. Assuming the data set is separable, label queries are unambiguous questions (i.e. each question has an unambiguous answer). By restricting the hypothesis class to be a set of possible labelings of
This material is based upon work supported by the National Science Foundation under grant IIS-0535100 and by an ONR MURI grant N000140510388.
R. Gavald` a et al. (Eds.): ALT 2009, LNAI 5809, pp. 141–155, 2009. c Springer-Verlag Berlin Heidelberg 2009
142
A. Guillory and J. Bilmes
the training set (i.e. the effective hypothesis class for some other possibly infinite hypothesis class), we can also ensure there is a unique zero-error hypothesis. If we set all question costs to 1, we recover the traditional active learning problem of identifying the target hypothesis using a minimal number of labels. However, this framework is also general enough to cover a variety of active learning scenarios outside of traditional binary classification. – Active learning with label costs. If different data points are more or less costly to label, we can model these differences using non uniform label costs. For example, if a longer document takes longer to label than a shorter document, we can make costs proportional to document length. The goal is then to identify the optimal hypothesis as quickly as possible as opposed to using as few labels as possible. This notion of label cost is different than the often studied notion of misclassification cost. Label cost refers to the cost of acquiring a label at training time where misclassification cost refers to the cost of incorrectly predicting a label at test time. – Active learning for multiclass and partial label queries. We can directly ask for the label of a point (Is the label of this point “a”, “b”, or “c”?), or we can ask less specific questions about the label (Is the label of this point “a” or some other label?). We can also mix these question types, presumably making less specific questions less costly. These kinds of partial label queries are particularly important when examples have structured labels. In a parsing problem, a partial label query could ask for the portion of a parse tree corresponding to a small phrase in a long sentence. – Batch mode active learning. Questions can also be queries for multiple labels. In the extreme case, there can be a question corresponding to every subset of possible single data point questions. Batch label queries only help the algorithm reduce total label cost if the cost of querying for a batch of labels is in some cases less than the of sum of the corresponding individual label costs. This is the case if there is a constant additive cost overhead associated with asking a question or if we want to minimize time spent labeling and there are multiple labelers who can label examples in parallel. Beyond these specific examples, this setting applies to any active learning problem for which different user interactions have different costs and are unambiguous as we have defined. For example, we can ask questions concerning the percentage of positive and negative examples according to the optimal classifier (Does the optimal classifier label more than half of the data set positive?). This abstract setting also has applications outside of machine learning. – Information Retrieval. We can think of a question asking strategy as an index into the set of objects which can then be used for search. If we make the cost of a question the expected computational cost of computing the answer for a given object, then a question asking strategy with low cost corresponds to an index with fast search time. For example, if objects correspond to points in n and questions correspond to axis aligned hyperplanes, a question asking strategy is a kd-tree.
Average-Case Active Learning with Costs
143
– Compression. A question asking strategy produces a unique sequence of responses for each object. If we make the cost of a question the log of the number of possible responses to that question, then a question asking strategy with low cost corresponds to a code book for the set of objects with small code length [5]. Interpreted in this way, active learning, information retrieval, and compression can be thought of as variations of the same problem in which we minimize interaction cost, computation cost, and code length respectively. In this work we consider this general problem for average-case cost. The object is selected at random and the goal is to minimize the expected cost of identifying the selected object. The distribution from which the object is drawn is known but may not be uniform. Previous work [11, 6, 1, 3, 4] has shown simple greedy algorithms are approximately optimal in certain more restrictive settings. We extend these results to our more general setting.
2
Preliminaries
We first review the main result of Dasgupta [6] which our first bound extends. We assume we have a finite set of objects (for example hypotheses) H with |H| = n. A randomly chosen h∗ ∈ H is our target object with a known positive π(h) defining the distribution over H by which h∗ is drawn. We assume minh π(h) > 0 and |H| > 1. We also assume there is a finite set of questions q1 , q2 , ...qm each of which has a positive cost c1 , c2 , ...cm . Eachquestion qi maps each object to a response from a finite set of answers A h,i {qi (h)} and asking qi reveals qi (h∗ ), eliminating from consideration all objects h for which qi (h) = qi (h∗ ). An ∗ active learning algorithm continues asking questions until h has been identified (i.e. we have eliminated all but one of the elements from H). We assume this is possible for any element in H. The goal of the learning algorithm is to identify h∗ with questions incurring as little cost as possible. Our result bounds the expected cost of identifying h∗ . We assume that the distribution π, the hypothesis class H, the questions qi , and the costs ci are known. Any deterministic question asking strategy (e.g. a deterministic active learning algorithm taking in this known information) produces a decision tree in which internal nodes are questions and the leaves are elements of H. The cost of a query tree T with respect to a distribution π, C(T, π), is ∗ ∗ defined to be the expected cost of identifying h when h is chosen according to π. We can write C(T, π) as C(T, π) = h∈H π(h)cT (h) where cT (h) is the cost to identify h as the target object. cT (h) is simply the sum of the costs of the questions along the path from the root of T to h. We define πS to be π restricted and normalized w.r.t. S. For s ∈ S, πS (s) = π(s)/π(S), and for s ∈ / S, πS (s) = 0. Tree cost decomposes nicely. Lemma 1. For any tree T and any S = i S i with ∀i,j S i ∩ S j = ∅, S =∅ C(T, πS ) = πS (S i )C(T, πS i ) i
144
A. Guillory and J. Bilmes
Algorithm 1. Cost Sensitive Greedy Algorithm 1: S ⇐ H 2: repeat 3: i = argmax Δi (S, πS )/ci i
4: S ⇐ {s ∈ S : qi (s) = qi (h∗ )} 5: until |S| = 1
We define the version space to be the subset of H consistent with the answers we have received so far. Questions eliminate elements from the version space. For a question qi and a particular version space S ⊆ H, we define S j {s ∈ S : qi (s) = j}. With this notation the dependence on qi is suppressed but understood by context. As shorthand, for a distribution π we define π(S) = s∈S π(s). On average, asking question qi shrinks the absolute mass of S with respect to a distribution π by Δi (S, π)
π(S j ) π(S j )2 ( π(S k )) = π(S) − π(S) π(S)
j∈A
k=j
j∈A
We call this quantity the shrinkage of qi with respect to (S, π). We note Δi (S, π) is only defined if π(S) > 0. If qi has cost ci , we call Δi (S,π) the shrinkage-cost ci ratio of qi with respect to (S, π). In previous work [6, 1, 3], the greedy algorithm analyzed is the algorithm that at each step chooses the question qi that maximizes the shrinkage with respect to the current version space Δi (S, πS ). In our generalized setting, we define the cost sensitive greedy algorithm to be the active learning algorithm which at each step asks the question with the largest shrinkage-cost ratio Δi (S, πS )/ci where S is the current version space. We call the tree generated by this method the greedy query tree. See Algorithm 1. Adler and Heeringa [1] also analyzed a cost-sensitive method for the restricted case of questions with two responses and uniform π, and our method is equivalent to theirs in this case. The main result of Dasgupta [6] is that, on average, with unit costs and yes/no questions, the greedy strategy is not much worse than any other strategy. We repeat this result here. Theorem 1. Theorem 3 [6] If |A| = 2 and ∀i ci = 1, then for any π the greedy query tree T g has cost at most C(T g , π) ≤ 4C ∗ ln 1/(min π(h)) h∈H
where C ∗ = minT C(T, π). For a uniform, π, the log term becomes ln |H|, so the approximation factor grows with the log of the number of objects. In the non uniform case, the greedy algorithm can do significantly worse. However, Kosaraju et al. [11] and Chakaravarthy et al. [3] show a simple rounding method can be used to remove dependence on π . We first give an extension to Theorem 1 to our more general setting. We then
Average-Case Active Learning with Costs
145
show we how to remove dependence on π using a similar rounding method. Interestingly, in our setting this rounding method introduces a dependence on the costs, so neither bound is strictly better although together they generalize all previous results.
3
Cost Independent Bound
Theorem 2. For any π the greedy query tree T g has cost at most C(T g , π) ≤ 12C ∗ ln 1/(min π(h)) h∈H
∗
where C minT C(T, π). What is perhaps surprising about this bound is that the quality of approximation does not depend on the costs themselves. The proof follows part of the strategy used by Dasgupta [6]. The general approach is to show that if the average cost of some question tree is low, then there must be at least one question with high shrinkage-cost ratio. We then use this to form the basis of an inductive argument. However, this simple argument fails when only a few objects have high probability mass. We start by showing the shrinkage of qi monotonically decreases as we eliminate elements from S. Lemma 2. Extension of Lemma 6 [6] to non binary queries. If T ⊆ S ⊆ H, and T = ∅ then, ∀i, π, Δi (T, π) ≤ Δi (S, π). Proof. For |S| = 1 the result is immediate since |T | ≥ 1 and therefore S = T . We show that if |S| > 2, removing any single element a ∈ S \ T from S does not increase Δi (S, π). The lemma then follows since we can remove all of S \ T from S an element at a time. Assume w.l.o.g. a ∈ S k for some k. Here let A A \ {k} Δi (S − {a}, π) =
π(S j )(π(S) − π(S j ) − π(a)) (π(S k ) − π(a))(π(S) − π(S k )) + π(S) − π(a) π(S) − π(a) j∈A
We show that this is term by term less than or equal to π(S j )(π(S) − π(S j )) π(S k )(π(S) − π(S k )) + Δi (S, π) = π(S) π(S) j∈A
For the first term π(S k )(π(S) − π(S k )) (π(S k ) − π(a))(π(S) − π(S k )) ≤ π(S) − π(a) π(S) because π(S) ≥ π(S k ) and π(a) ≥ 0. For any other term in the summation, π(S j )(π(S) − π(S j )) π(S j )(π(S) − π(S j ) − π(a))) ≤ π(S) − π(a) π(S) because π(S) − π(S j ) ≥ π(a) ≥ 0 and π(S) > π(a). Obviously, the same result holds when we consider shrinkage-cost ratios.
146
A. Guillory and J. Bilmes
Corollary 1. If T ⊆ S ⊆ H, and T = ∅ then for any i, π, Δi (T, π)/ci ≤ Δi (S, π)/ci . We define the collision probability of a distribution v over Z to be CP(v) 2 z∈Z v(z) This is exactly the probability two samples from v will be the same and quantifies the extent to which mass is concentrated on only a few points (similar to inverse entropy). If no question has a large shrinkage-cost ratio and the collision probability is low, then the expected cost of any query tree must be high. Lemma 3. Extension of Lemma 7 [6] to non binary queries and non uniform costs. For any set S and distribution v over S, if ∀i Δi (S, v)/ci < Δ/c, then for any R ⊆ S with R = ∅ and any query tree T whose leaves include R c C(T, vR ) ≥ v(R)(1 − CP(vR )) Δ Proof. We prove the lemma with induction on |R|. For |R| = 1, CP(vR ) = 1 and the right hand side of the inequality is zero. For R > 1, we lower bound the cost of any query tree on R. At its root, any query tree chooses some qi with cost ci that divides the version space into Rj for j ∈ A. Using the inductive hypothesis we can then write the cost of a tree as c C(T, vR ) ≥ ci + vR (Rj ) (v(Rj )(1 − CP(vRj ))) Δ j∈A c (vR (Rj )2 − vR (Rj )2 CP(vRj )) = ci + v(R) Δ j∈A c vR (Rj )2 − CP(vR )) = ci + v(R)(1 − 1 + Δ j∈A
Here we used vR (Rj )2 CP(vRj ) = vR (Rj )2 vRj (r)2 = vR (r)2 = CP(vR ) j∈A
j∈A
We now note v(R)(1 −
j∈A vR (R
r∈R
r∈Rj
j 2
) ) = v(R) −
j∈A
v(Rj )2 /v(R) = Δi (R, v)
c c v(R)(1 − CP(vR )) − Δi (R, v) Δ Δ Δci − Δi (R, v)c c = v(R)(1 − CP(vS )) + Δ Δ
C(T, vR ) ≥ ci +
Using Corollary 1, Δi (R, v)/ci ≤ Δi (S, v)/ci ≤ Δ/c, so Δci − Δi (R, v)c ≥ 0 and therefore c C(R, vS ) ≥ v(R)(1 − CP(vR )) Δ which completes the induction.
Average-Case Active Learning with Costs
147
This lower bound on the cost of a tree translates into a lower bound on the shrinkage-cost ratio of the question chosen by the greedy tree. Corollary 2. Extension of Corollary 8 [6] to non binary queries and non uniform costs. For any S ⊆ H with S = ∅ and query tree T whose leaves contain S, there must be a question qi with Δi (S, πS )/ci ≥ (1 − CP(πS ))/C(T, πS ) Proof. Suppose this is not the case. Then there is some Δ/c < (1 − CP(πS ))/ C(T, πS ) such that ∀i Δi (S, πS )/ci ≤ Δ/c. By Lemma 3 (with v πS , R S), C(T, πS ) ≥ πS (S) which is a contradiction.
c (1 − CP(πS )) > πS (S)C(T, πS ) = C(T, πS ) Δ
A special case which poses some difficulty for the main proof is when for some S ⊆ H we have CP(πS ) > 1/2. First note that if CP(πS ) > 1/2 one object h0 has more than half the mass of S. In the lemma below, we use R S \ {h0 }. Also let δi be the relative mass of the hypotheses in R that are distinct from h0 w.r.t. question qi . δi πR ({r ∈ R : qi (h0 ) = qi (r)}) In other words, when question qi is asked, R is divided into a set of hypotheses that agree with h0 (these have relative mass 1 − δi ) and a set of hypotheses that disagree with h0 (these have relative mass δi ). Dasgupta [6] also treats this as a special case. However, in the more general setting treated here the situation is more subtle. For yes or no questions, the question chosen by the greedy query tree is also the question that removes the most mass from R. In our setting this is not necessarily the case. The left of Figure 1 shows a counter example. However, we can show the fraction of mass removed from R by the greedy query tree is at least half the fraction removed by any other question. Furthermore, to handle costs, we must instead consider the fraction of mass removed from R per unit cost.
Fig. 1. Left: Counter example showing that when a single hypothesis h0 contains more than half the mass, the query with maximum shrinkage is not necessarily the query that separates the most mass from h0 . Right: Notation for this case.
148
A. Guillory and J. Bilmes
In this lemma we use π{h0 } to denote the distribution which puts all mass on h0 . The cost of identifying h0 in a tree T ∗ is then C ∗ (h0 ) C(T ∗ , π{h0 } ). Lemma 4. Consider any S ⊆ H and π with CP(πS ) > 1/2 and π(h0 ) > 1/2. Let C ∗ (h0 ) = C(T ∗ , π{h0 } ) for any T ∗ whose leaves contain S. Some question qi has δi /ci > 1/C ∗ (h0 ). Proof. There is always a set of questions indexed by the set I with total cost ∗ i∈I ci ≤ C (h0 ) that distinguish h0 from R within S. In particular, the set ∗ of questions used to identify h0 in T satisfy this. Since the set identifies h0 , i∈I δi ≥ 1 which implies i∈I
ci ∗ C (h
Because ci /C ∗ (h0 ) ∈ (0, 1] and that δi /ci ≥ 1/C ∗ (h0 ).
δi ≥ 1/C ∗ (h0 ) 0 ) ci
i∈I
ci /C ∗ (h0 ) ≤ 1, there must be a qi such
Having shown that some query always reduces the relative mass of R by 1/C ∗ (h0 ) per unit cost, we now show that the greedy query tree reduces the mass of R by at least half as much per unit cost. Lemma 5. Consider any π and S ⊆ H with CP(πS ) > 1/2, π(h0 ) > 1/2, and a corresponding subtree TSg in the greedy tree. Let C ∗ (h0 ) = C(T ∗ , π{h0 } ) for any T ∗ whose leaves contain S. The question qi chosen by TSg has δi /ci > 1/(2C ∗ (h0 )). Proof. We prove this by showing that the fraction removed from R per unit cost by the greedy query tree’s question is at least half that of any other question. Combining this with Lemma 4, we get the desired result. We can write the shrinkage of qi in terms of δi . Here let A A \ {qi (h0 )}. Since π(S qi (h0 ) ) = π(h0 ) + (π(S) − δi π(R)), and π(S) − π(S qi (h0 ) ) = δi π(R), we have that Δi (S, πS ) = (πS (h0 ) + (1 − δi )πS (R))δi πS (R) + πS (S j )(πS (S) − πS (S j )) j∈A
We use j∈A πS (S j ) = δi πS (R). We can then upper bound the shrinkage using πS (S) − πS (S j ) ≤ 1 Δi (S, πS ) ≤ (πS (h0 ) + (1 − δi )πS (R))δi πS (R) + δi πS (R) ≤ 2δi πS (R) and lower bound the shrinkage using πS (h0 ) > 1/2 and πS (S) − πS (S j ) > πS (h0 ) + (1 − δi )πS (R) for any j ∈ A Δi (S, πS ) ≥ 2(πS (h0 ) + (1 − δi )πS (R))δi πS (R) ≥ δi πS (R)
Average-Case Active Learning with Costs
149
Let qi be any question and qj be the question chosen by the greedy tree giving Δj (S, πS )/cj ≥ Δi (S, πS )/ci . Using the upper and lower bounds we derived, we then know 2δj πS (R)/cj ≥ δi πS (R)/ci and can conclude 2δj /cj ≥ δi /ci . Combining this with Lemma 4, δj /cj ≥ 1/(2C ∗ (h0 ). The main theorem immediately follows from the next theorem. Theorem 3. If T ∗ is any query tree for π and T g is the greedy query tree for π, then for any S ⊆ H corresponding to the subtree TSg of T g , C(TSg , πS ) ≤ 12C(T ∗ , πS ) ln
π(S) minh∈S π(h)
Proof. In this proof we use C ∗ (S) as a short hand for C(T ∗ , πS ). Also, we use min(S) for mins∈S π(S). We proceed with induction on |S|. For |S| = 1, C(TSg , πS ) is zero and the claim holds. For |S| > 1, we consider two cases. Case one: CP(πS ) ≤ 1/2 At the root of TSg , the greedy query tree chooses some qi with cost ci that reduces the version space to S j when qi (h∗ ) = j. Let π(S + ) max{π(S j ) : j ∈ A} Using the inductive hypothesis C(TSg , πS ) = ci + πS (S j )C(TS j , πS j ) j∈A
≤ ci +
12πS (S j )C ∗ (S j ) ln
j∈A
≤ ci + 12(
j∈A
π(S j ) min(S j )
πS (S j )C ∗ (S j )) ln
π(S + ) min(S)
Now using Lemma 1, π(S + ) = π(S)πS (S + ), and then ln(1 − x) ≤ −x π(S) + 12C ∗ (S) ln πS (S + ) min(S) π(S) − 12C ∗ (S)(1 − πS (S + )) ≤ ci + 12C ∗ (S) ln min(S)
C(TSg , πS ) ≤ ci + 12C ∗ (S) ln
πS (S + ) ≥ j∈A πS (S j )2 because this sum is an expectation and ∀j πS (S + ) ≥ πS (S j ). From this follows π(S) − 12C ∗ (S)(1 − πS (S j )2 ) min(S) j∈A (1 − j∈A πS (S j )2 )) π(S) ∗ ∗ − 12C (S)ci = ci + 12C (S) ln min(S) ci
C(TSg , πS ) ≤ ci + 12C ∗ (S) ln
150
(1 −
A. Guillory and J. Bilmes
j∈A
πS (S j )2 ) is Δi (S, πS ), so by Corollary 2 and using CP(πS ) ≤ 1/2 π(S) 1 − CP(πS ) − 12C ∗ (S)ci min(S) C ∗ (S) π(S) − 12(1 − CP(πS ))ci = ci + 12C ∗ (S) ln min(S) π(S) ≤ 12C ∗ (S) ln min(S)
C(TSg , πS ) ≤ ci + 12C ∗ (S) ln
which completes this case. Case two: CP(πS ) > 1/2 The hypothesis with more than half the mass, h0 , lies at some depth D in the greedy tree TSg . Counting the root of TSg as depth 0, D ≥ 1. At depth d > 0, let q0 , q1 , ...qd−1 be the questions asked so far, c0 , c1 , ...cd−1 be the costs of these d−1 questions, and Cd = i=0 ci be the total cost incurred. At the root, C0 = 0. At depth d < D, we define Rd to be the set of objects other than h0 that are still in the version space along the path to h0 . R0 S \ {h0 } and for d > 0 Rd Rd−1 \ {h : qd−1 (h) = qd−1 (h0 )}. In other words, Rd is Rd−1 with the objects that disagree with h0 on qd−1 removed. All of the objects in Rd have the same response as h0 for q0 , q1 , ..., qd−1 . The right of Figure 1 shows this case. We first bound the mass remaining in Rd as a function of the label cost incurred so far. For d > 0, using Lemma 5, π(Rd ) ≤ π(R0 )
d−1
(1 −
i=0
ci ∗ 2C (h
0)
) ≤ π(R0 )e−Cd /(2C
∗
(h0 ))
Using this bound, we can bound CD , the cost of identifying h0 (i.e. C(TSg , h0 )). First note that π(RD−1 ) ≥ min(R0 ) since at least one object is left in RD−1 . Combining this with the upper bound on the mass of Rd , we have if D − 1 > 0. CD−1 ≤ 2C ∗ (h0 ) ln(π(R0 )/ min(R0 )) This clearly also holds if D−1 = 0, since, C0 = 0. We now only need to bound the cost of the final question (the question asked at level D − 1). If the final question had cost greater than 2C ∗ (h0 ), then by Lemma 5, this question would reduce the mass of the set containing h0 to less than π(h0 ). This is a contradiction, so the final question must have cost no greater than 2C ∗ (h0 ). CD ≤ 2C ∗ (h0 ) ln
π(R0 ) + 2C ∗ (h0 ) min(R0 )
We use Ad−1 A \ qd−1 (h0 ). Let s ∈ Sdj be the set of objects removed from Rd−1 with the question at depth d − 1 such that qd−1 (s) = j, that is Rd−1 = Rd + j∈A Sdj . Let Sd = j∈A Sdj . The right of Figure 1 illustrates this d−1 d−1 notation. A useful variation of Lemma 1 we use in the following is that for S = S 1 ∪ S 2 and S 1 ∩ S 2 = ∅, π(S)C ∗ (S) = π(S 1 )C ∗ (S 1 ) + π(S 2 )C ∗ (S 2 ).
Average-Case Active Learning with Costs
151
We can write a
π(S)C(TSg , πS ) = π(h0 )CD +
D d=1
b
≤ π(h0 )CD +
D
j∈Ad−1
π(Sdj )(Cd + C(TS j , πS j ))
π(Sd )Cd +
d=1
d
D d=1 j∈A
π(Sdj )12C ∗ (Sdj ) ln
d−1
c
≤ π(h0 )CD + π(R0 )CD + 12π(R0 )C ∗ (R0 ) ln d
≤ 2π(h0 )CD + 12π(R0 )C ∗ (R0 ) ln
d
π(Sdj ) min(Sdj )
π(R0 ) min(R0 )
π(R0 ) min(R0 )
Here a) decomposes the total cost into the cost of identifying h0 and the cost of each branch leaving the path to h0 . For each of these branches the total cost is the cost incurred so far plus the cost of the tree rooted at that branch. b) uses the inductive hypothesis, c) uses ∀i,j Si ∩ Sj = ∅ and d Sd = R0 , and d) uses π(R0 ) < π(h0 ). Continuing a
π(R0 ) π(R0 ) + 1) + 12π(R0 )C ∗ (R0 ) ln min(R0 ) min(R0 ) b π(S) π(S) + 1) + 12π(R0 )C ∗ (R0 ) ln ≤ 4π(h0 )C ∗ (h0 )(ln min(S) min(S)
π(S)C(TSg , πS ) ≤ 4π(h0 )C ∗ (h0 )(ln
where a) uses our bound on CD and b) uses R0 ⊂ S. Finally π(S) π(S) + 12π(R0 )C ∗ (R0 ) ln min(S) min(S) π(S) = π(S)12C ∗ (S) ln min(S)
π(S)C(TSg , πS ) ≤ 12π(h0 )C ∗ (h0 ) ln
π(S) > ln 2 > .5. Dividing both where we use π(S) > 2 min(S) and therefore ln min(S) sides by π(S) gives the desired result.
4
Distribution Independent Bound
We now show the dependence on π can be removed using a variation of the rounding trick used by Kosaraju et al. [11] and Chakaravarthy et al. [3]. The intuition behind this trick is that we can round up small values of π to obtain a distribution π in which ln(1/ minh∈H π (h)) = O(ln n) while ensuring that for any tree T , C(T, π)/C(T, π ) is bounded above and below by a constant. Here n = |H|. When the greedy algorithm is applied to this rounded distribution, the resulting tree gives an O(log n) approximation to the optimal tree for the original
152
A. Guillory and J. Bilmes
distribution. In our cost sensitive setting, the intuition remains the same, but the introduction of costs changes the result. Let cmax maxi ci and cmin mini ci . In this discussion, we consider irreducible query trees, which we define to be query trees which contain only questions with non-zero shrinkage. Greedy query trees will always have this property as will optimal query trees. This property let’s us assume any path from the root to a leaf has at most n nodes with cost at most cmax n because at least one hypothesis is eliminated by each question. Define π to be the distribution obtained from π by adding cmin /(cmax n3 ) mass to any hypothesis h for which π(h) < cmin /(cmax n3 ). Subtract the corresponding mass from a single hypothesis hj for which π(hj ) ≥ 1/n (there must at least one such hypothesis). By construction, we have that mini π (hi ) ≥ cmin /(cmax n3 ). We can also bound the amount by which the cost of a tree changes as a result of rounding Lemma 6. For any irreducible query tree T and π, 1 3 C(T, π) ≤ C(T, π ) ≤ C(T, π) 2 2 Proof. For the first inequality, let h be the hypothesis we subtract mass from when rounding. The cost to identify h , cT (h ) is at most cmax n. Since we subtract at most cmin /(cmax n2 ) mass and cT (h ) ≤ cmax n, we then have C(T, π ) ≥ C(T, π) −
1 cmin cmin ≥ C(T, π) cT (h ) ≥ C(T, π) − cmax n2 n 2
The last step uses and C(T, π) > cmin and n > 2. For thesecond inequality, we add at most cmin /(cmax n3 ) mass to each hypothesis and h cT (h) < cmax n2 , so cmin 3 cmin ≤ C(T, π) C(T, π ) ≤ C(T, π) + cT (h) ≤ C(T, π) + cmax n3 n 2 h∈H
The last step again uses C(T, π) > cmin and n > 2
We can finally give a bound on the greedy algorithm applied to π , in terms of n and cmax /cmin Theorem 4. For any π the greedy query tree T g for π has cost at most cmax )) C(T g , π) ≤ O(C ∗ ln(n cmin where C ∗ minT C(T, π). Proof. Let T be an optimal tree for π and T ∗ be an optimal tree for π. Using Theorem 2, mini π (hi ) ≥ cmin /(cmax n3 ), and Lemma 6. cmax C(T g , π) ≤2C(T g , π ) ≤ 72C(T , π ) ln(n ) cmin cmax cmax ) ≤ 108C(T ∗, π) ln(n ) ≤72C(T ∗, π ) ln(n cmin cmin
Average-Case Active Learning with Costs
5
153
-Approximate Algorithm
Some of the non traditional active learning scenarios involve a large number of possible questions. For example, in the batch active learning scenario we describe, there may be a question corresponding to every subset of single data point questions. In these scenarios, it may not be possible to exactly find the question with largest shrinkage-cost ratio. It is not hard to extend our analysis to a strategy that at each step finds a question qi with Δi (S, πS )/ci ≥ (1 − ) max Δj (S, πS )/cj j
for ∈ [0, 1). One can show > 0 only introduces an 1/(1 − ) factor into the bound. Kosaraju et al. [11] report a similar extension to their result.
6
Related Work
Table 1 summarizes previous results analyzing greedy approaches to this problem. A number of these results were derived independently in different contexts. Our work gives the first approximation result for the general setting in which there are more than two possible responses to questions, non uniform question costs, and a non uniform distribution over objects. We give bounds for two algorithms, one with performance independent of the query costs and one with performance independent of the distribution over objects. Together these two bounds match all previous bounds for less general settings. We also note that Kosaraju et al. [11] only mention an extension to non binary queries (Remark 1), and our work is the first to give a full proof of an O(log n) bound for the case of non binary queries and non uniform distributions over objects.. Our work and the work we extend are examples of exact active learning. We seek to exactly identify a target hypothesis from a finite set using a sequence of queries. Other work considers active learning where it suffices to identify with high probability a hypothesis close to the target hypothesis [7, 2]. The exact and approximate problems can sometimes be related [10]. Table 1. Summary of approximation ratios achieved by related work. Here n is the number of objects, k is the number of possible responses, ci are the question costs, and π is the distribution over objects.
Kosaraju et al. [11] Dasgupta [6] Adler and Heeringa [1] Chakaravarthy et al. [3] Chakaravarthy et al. [4] This paper This paper
k>2 Y N N Y Y Y Y
Non uniform ci Non uniform π Result N Y O(log n) N Y O(log(1/ minh π(h))) Y N O(log n) N Y O(log k log n) N N O(log n) Y Y O(log(1/ minh π(h))) Y Y O(log(n maxi ci / mini ci ))
154
A. Guillory and J. Bilmes
Most theoretical work in active learning assumes unit costs and simple label queries. An exception, Hanneke [9] also considers a general learning framework in which queries are arbitrary and have known costs associated with them. In fact, the setting used by Hanneke [9] is more general in that questions are allowed to have more than one valid answer for each hypothesis. Hanneke [9] gives worst-case upper and lower bounds in terms of a quantity called the General Identification Cost and related quantities. There are interesting parallels between our average-case analysis and this worst-case result. Practical work incorporating costs in active learning [12, 8] has also considered methods that maximize a benefit-cost ratio similar in spirit to the method used here. However, Settles et al. [12] suggests this strategy may not be sufficient for practical cost savings.
7
Open Problems
Chakaravarthy et al. [3] show it is NP-hard to approximate the optimal query tree within a factor of Ω(log n) for binary queries and non uniform π. This hardness result is with respect to the number of objects. Some open questions remain. For the more general setting with non uniform query costs, is there an algorithm with an approximation ratio independent of both π and ci ? The simple rounding technique we use seems to require dependence on ci , but a more advanced method could avoid this dependence. Also, can the Ω(log n) hardness result be extended to the more restrictive case of uniform π? It would also be interesting to extend our analysis to allow for questions to have more than one valid answer for each hypothesis. This would allow queries which ask for a positively labeled example from a set of examples. Such an extension appears non trivial, as a straightforward extension assuming the given answer is randomly chosen from the set of valid answers produces a tree in which the mass of hypotheses is split across multiple branches, affecting the approximation. Much work also remains in the analysis of other active learning settings with general queries and costs. Of particular practical interest are extensions to agnostic algorithms that converge to the correct hypothesis under no assumptions [7, 2]. Extensions to treat label costs, partial label queries, and batch mode active learning are all of interest, and these learning algorithms could potentially be extended to treat these three sub problems at once using a similar setting. For some of these algorithms, even without modification we can guarantee the method does no worse than passive learning with respect to label cost. In particular, Dasgupta et al. [7] and Beygelzimer et al. [2] both give algorithms that iterate through T examples, at each step requesting a label with probability pt . These algorithm are shown to not do much worse (in terms of generalization error) than the passive algorithm which requests every label. Because the algorithm queries for labels for a subset of T i.i.d. examples, the label cost of the algorithm is also no worse than the passive algorithm requesting T random labels. It remains an open problem however to show these algorithms can do
Average-Case Active Learning with Costs
155
better than passive learning in terms of label cost (most likely this will require modifications to the algorithm or additional assumptions).
References [1] Adler, M., Heeringa, B.: Approximating optimal binary decision trees. In: Goel, A., Jansen, K., Rolim, J.D.P., Rubinfeld, R. (eds.) APPROX and RANDOM 2008. LNCS, vol. 5171, pp. 1–9. Springer, Heidelberg (2008) [2] Beygelzimer, A., Dasgupta, S., Langford, J.: Importance weighted active learning. In: ICML (2009) [3] Chakaravarthy, V.T., Pandit, V., Roy, S., Awasthi, P., Mohania, M.: Decision trees for entity identification: approximation algorithms and hardness results. In: PODS (2007) [4] Chakaravarthy, V.T., Pandit, V., Roy, S., Sabharwal, Y.: Approximating decision trees with multiway branches. In: ICALP (2009) [5] Cover, T.M., Thomas, J.A.: Elements of Information Theory, 2nd edn. WileyInterscience, Hoboken (2006) [6] Dasgupta, S.: Analysis of a greedy active learning strategy. In: NIPS (2004) [7] Dasgupta, S., Hsu, D., Monteleoni, C.: A general agnostic active learning algorithm. In: NIPS (2007) [8] Haertel, R., Sepppi, K.D., Ringger, E.K., Carroll, J.L.: Return on investment for active learning. In: NIPS Workshop on Cost-Sensitive Learning (2008) [9] Hanneke, S.: The cost complexity of interactive learning (unpublished, 2006), http://www.cs.cmu.edu/ shanneke/docs/2006/cost-complexityworking-notes.pdf [10] Hanneke, S.: Teaching dimension and the complexity of active learning. In: Bshouty, N.H., Gentile, C. (eds.) COLT 2007. LNCS (LNAI), vol. 4539, pp. 66–81. Springer, Heidelberg (2007) [11] Kosaraju, S.R., Przytycka, T.M., Borgstrom, R.: On an optimal split tree problem. In: Dehne, F., Gupta, A., Sack, J.-R., Tamassia, R. (eds.) WADS 1999. LNCS, vol. 1663, pp. 157–168. Springer, Heidelberg (1999) [12] Settles, B., Craven, M., Friedland, L.: Active learning with real annotation costs. In: NIPS Workshop on Cost-Sensitive Learning (2008)
Canonical Horn Representations and Query Learning Marta Arias1 and Jos´e L. Balc´azar2 1
2
LARCA Research Group, Departament LSI Universitat Polit`ecnica de Catalunya, Spain
[email protected] Departamento de Matem´ aticas, Estad´ıstica y Computaci´ on Universidad de Cantabria, Spain
[email protected]
Abstract. We describe an alternative construction of an existing canonical representation for definite Horn theories, the Guigues-Duquenne basis (or GD basis), which minimizes a natural notion of implicational size. We extend the canonical representation to general Horn, by providing a reduction from definite to general Horn CNF. We show how this representation relates to two topics in query learning theory: first, we show that a well-known algorithm by Angluin, Frazier and Pitt that learns Horn CNF always outputs the GD basis independently of the counterexamples it receives; second, we build strong polynomial certificates for Horn CNF directly from the GD basis.
1
Introduction
The present paper is the result of an attempt to better understand the classic algorithm by Angluin, Frazier, and Pitt [2] that learns propositional Horn formulas. A number of intriguing questions remain open regarding this algorithm; in particular, we were puzzled by the following one: along a run of the algorithm, queries made by the algorithm depend heavily upon the counterexamples selected as answers to the previous queries. It is therefore natural to expect the outcome of the algorithm to depend on the answers received along the run. However, attempts at providing an example of such behavior consistently fail. In this paper we prove that such attempts must in fact fail: we describe a canonical representation of Horn functions in terms of implications, and show that the algorithm of [2] always outputs this particular representation. It turns out that this canonical representation is well-known in the field of Formal Concepts, and bears the name of the authors that, to the best of our knowledge, first described it: the Guigues-Duquenne basis or GD basis [7, 12]. In addition, the GD basis has the important quality of being of minimum size. The GD basis is defined for definite Horn formulas only. We extend the notion of GD basis to general Horn formulas by means of a reduction from general to
Work partially supported by MICINN projects SESAAME-BAR (TIN2008-06582C03-01) and FORMALISM (TIN2007-66523).
R. Gavald` a et al. (Eds.): ALT 2009, LNAI 5809, pp. 156–170, 2009. c Springer-Verlag Berlin Heidelberg 2009
Canonical Horn Representations and Query Learning
157
definite Horn formulas. This reduction allows us to lift the characterization of the output of AFP as the generalized GD basis. Furthermore, the generalized GD representation provides the basis for building strong m+2 polynomial certificates with p(m, n) = m and q(m, n) = m+1 + m + 1 = for the class of general 2 2 Horn formulas, extending a similar construction from [4] which applied only to definite Horn. Some of the technical lemmas and theorems in this paper are based on previous results of [12, 7]; we credit this fact appropriately throughout this presentation. As a general overview, we have adopted the following: the “bullet” operator (• ) of Section 3.1 is directly taken from [12], the “star” operator ( ) is standard in the field of study of Closure Spaces, and the GD basis comes from the Formal Concept Analysis literature. We consider that our contribution here is threefold: first, to understand, translate, and interpret the results from these other fields; second, to recognize the connection of these results to our own; third, to draw new insights into our topic of study thanks to the fruitful combination of our own intuitions and knowledge and the adoption of these outside results. Due to the space limit, a number of proofs, mostly of simple lemmas, have been omitted or just sketched. A longer version containing all proofs is available from the authors’ webpages.
2
Preliminaries
We work within the standard framework in logic, where one is given an indexable set X of propositional variables of cardinality n, Boolean functions are subsets of the Boolean hypercube {0, 1}n, and these functions are represented by logical formulas over the variable set in the standard way. Assignments are partially ordered bitwise according to 0 ≤ 1 (the usual partial order of the hypercube); the notation is x ≤ y. Readers not familiar with standard definitions of assignment, assignment satisfaction or formula entailment (|=), literal, term, clause, etc. should consult a standard textbook, e.g., [6]. A particularity of our work is that we identify assignments x ∈ {0, 1}n with variable subsets α ⊆ X in the standard way by connecting the variable subsets with the bits that are set to 1 in the assignments. We denote this explicitly when necessary with the functions x = BITS(α) and α = ONES(x). Therefore, x |= α iff α ⊆ ONES(x) iff BITS(α) ≤ x. We are only concerned with Horn functions, and their representations using conjunctive normal form (CNF). A Horn CNF formula is a conjunction of Horn clauses. A clause is a disjunction of literals. A clause is definite Horn if it contains exactly one positive literal, and it is negative if all its literals are negative. A clause is Horn if it is either definite Horn or negative. Horn clauses are generally viewed as implications where the negative literals form the antecedent of the implication (a positive term), and the singleton consisting of the positive literal, if it exists, forms the consequent of the clause. Note that both can be empty; if the consequent is empty, then we are dealing with a negative Horn clause. Furthermore, we allow our representations of Horn
158
M. Arias and J.L. Balc´ azar
CNF to deviate slightly from the standard in that we represent clauses sharing the same antecedent together in one implication. Namely, an implication α → β, where both α and β are possibly empty sets of propositional variables, is to be interpreted as the conjunction of definite Horn clauses b∈β α → b if β = ∅, and as the negative clause α → if β = ∅.1 A semantically equivalent interpretation is to see both sets of variables α and β as positive terms; the Horn formula in its standard form is obtained by distributivity on the variables of β. Note that x |= ∅ for any assignment x; however, this is not the case with respect to the right hand sides of nondefinite Horn clauses since, there, by convention, β = ∅ stands for the unsatisfiable. We refer to our generalized notion of conjunction of clauses sharing the antecedent as implication; the term clause retains its classical meaning (namely, a disjunction of literals). Notice that an implication may not be a clause, e.g. (a → bc) corresponds in classical notation to the formula ¬a ∨ (b ∧ c). Thus, (a → bc), (ab → c) and (ab → ∅) are Horn implications but only the latter two are Horn clauses. Furthermore, we often use sets to denote conjunctions, as we do with positive terms, also at other levels: a generic (implicational) CNF i (αi → βi ) is often denoted in this text by {(αi → βi )}i . Parentheses are mostly optional and generally used for ease of reading. Clearly, an assignment x ∈ {0, 1}n satisfies the implication α → β, denoted x |= α → β, if it either fails the antecedent or satisfies the consequent, that is, x |= α or x |= β respectively, where now we are interpreting both α and β as positive terms. A Horn function admits several syntactically different Horn CNF representations; in this case, we say that these representations are equivalent. Such representations are also known as theories or bases for the Boolean function they represent. The size of a Horn function is the minimum number of clauses that a Horn CNF representing it must have. The implication size of a Horn size is defined analogously, but allowing formulas to have implications instead of clauses. Clearly, every clause is an implication, and thus the implication size of a given Horn function is always at most that of its standard size as measured in the number of clauses. Not all Boolean functions are Horn. The following semantic characterization is a well-known classic result proved in the context of propositional Horn logic e.g. in [10]: Theorem 1. A Boolean function admits a Horn CNF basis if and only if the set of assignments that satisfy it is closed under bit-wise intersection. An implication in a Horn CNF H is redundant if it can be removed from H without changing the Horn function represented. A Horn CNF is irredundant if it does not contain any redundant implication. Notice that an irredundant H 1
Notice that this differs from an alternative, older interpretation [11], nowadays obsolete, in which α → β represents the clause (¬x1 ∨ . . . ∨ ¬xk ∨ y1 ∨ . . . ∨ yk ), where α = {x1 , . . . , xk } and β = {y1 , . . . , yk }. Though identical in syntax, the semantics are different; in particular, ours can only represent a conjunction of definite Horn clauses whereas the other represents a general possibly non-Horn clause.
Canonical Horn Representations and Query Learning
159
may still contain other sorts of redundancies, such as consequents larger than strictly necessary. Such redundancies are not contemplated in this paper. Forward chaining. We describe the well-known method of forward chaining for definite Horn functions [6]. Notice that it directly extends to our compressed representation where consequents of clauses can contain more than one variable. Given a definite Horn CNF H = {αi → βi }i and a subset of propositional variables α, we construct chains of subsets of propositional variables α = α(0) ⊂ α(1) ⊂ · · · ⊂ α(k) = α . Each α(i) with i > 0 is obtained from its predecessor α(i−1) in the following way: if BITS(α(i−1) ) satisfies all implications in H, then the process can stop with α(i−1) = α . If, on the other hand, BITS(α(i−1) ) violates some implication αj → βj ∈ H, then α(i) is set to α(i−1) ∪ βj . Similarly, one can construct an increasing chain of assignments x = x(0) < (1) x < · · · < x(k) = x using our bijection α(i) = ONES(x(i) ) and x(i) = BITS(α(i) ) for all i. See [6] as a general reference for the following well-known results. Theorem 3 in particular refers to the fact that the forward chaining procedure is a sound and complete deduction method for definite Horn CNF. Theorem 2. The objects x and α are well-defined and computed by the forward chaining procedure regardless of the order in which implications in H are chosen. Moreover, x and α depend only on the underlying function being represented, and not on the particular choice of representation H; and for each x(i) and α(i) along the way, we have that (x(i) ) = x and (α(i) ) = α . Theorem 3. Let h be a definite Horn function, and let α be an arbitrary variable subset. Then h |= α → b if and only if b ∈ α . Closure operator and equivalence classes. It is easy to see that the operator is extensive (that is, x ≤ x and α ⊆ α ), monotonic (if x ≤ y then x ≤ y , and if α ⊆ β then α ⊆ β ) and idempotent (x = x , and α = α ) for all assignments x, y and variable sets α, β; that is, is a closure operator [4]. Thus, we refer to x as the closure of x w.r.t. a definite Horn function. It should be always clear from the text with respect to what definite Horn function we are taking the closure, hence it is omitted from the notation used. An assignment x is said to be closed iff x = x, and similarly for variable sets. Furthermore, it is not hard to see that closed elements are always positive (by construction via the forward chaining procedure, they must satisfy all implications), and assignments that are not closed are always negative (similarly, they must violate some implication). That is: x |= H if and only if x = x. This closure operator induces a partition over the set of assignments {0, 1}n in the following straightforward way: two assignments x and y belong to the same class if x = y . This notion of equivalence class carries over as expected to the power set of propositional variables: the subsets α and β belong to the same class if α = β . It is worth noting that each equivalence class consists of a possibly empty set of assignments that are not closed and a single closed set, its representative.
160
3
M. Arias and J.L. Balc´ azar
The Guigues-Duquenne Basis for Definite Horn
In this section we characterize and show how to build a canonical basis for definite Horn functions that is of minimum implication size. Our construction is based on the notion of saturation, a notion that has been used already in the context of Horn functions and seems very natural [4, 5]. It turns out that this canonical form is, in essence, the Guigues-Duquenne basis (the GD basis) which was introduced in [7]. Here, we introduce it in a form that is, to our knowledge, novel, although it is relatively close to the approach of [12]. We begin by defining saturation and then prove several interesting properties that serve as the basis for our work. Definition 1. Let B = {αi → βi }i be a basis for some definite Horn function. – We say that B is left-saturated if the following 2 conditions hold: |= αi → βi , for all i; 1. BITS(αi ) 2. BITS(αi ) |= αj → βj , for all i = j. Alternatively, it can be more succintly described by the following equivalence: a basis {αi → βi }i is left-saturated if i = j ⇔ BITS(αi ) |= αj → βj . – We say that B is right-saturated if for all i, βi = αi . Accordingly, we denote right-saturated bases with {αi → αi }i . – We say that a basis B is saturated iff it is left- and right-saturated. Example 1. Let H = {a → b, b → c, ad → e}. – H is not left-saturated: for example, the antecedent of ad → e is such that BITS(ad) |= a → b. One can already see that by including b in the antecedent of the third clause, one avoids this particular violation. – H is not right-saturated because a = abc and, for example, the implication a → b is missing ac from its right-hand-side. – The equivalent H = {a → abc, b → bc, abcd → abcde} is saturated. Lemma 1. Let B = {αi → βi }i be a basis for some definite Horn function h. 1. 2. 3. 4.
If If If If
B B B B
is is is is
left-saturated then B is irredundant. irredundant, then BITS(αi ) |= αi → βi for all i. saturated, then BITS(αi ) |= h and BITS(αi ) |= h hold for all i. saturated, then αi ⊆ αj ⇒ αi ⊂ αj , for all i = j.
Lemma 2. Let B = {αi → αi }i be an irredundant, right-saturated basis. Then, B is left-saturated if and only if the following implication is true for all i = j: αi ⊂ αj ⇒ αi ⊆ αj . The following Lemma is a variant of a result of [12] translated into our notation. We include the proof that is, in fact, missing from [12]. Lemma 3. Let B = {αi → αi }i be a saturated basis for a definite Horn function. Then for all i and β it holds that (β ⊆ αi and β ⊂ αi ) ⇒ β ⊆ αi .
Canonical Horn Representations and Query Learning
161
Proof. Let us assume that the conditions of the implication are true, namely, that β ⊆ αi and β ⊂ αi . We proceed by cases: if β is closed, then β = β and the implication is trivially true since β ⊆ αi clearly implies β ⊆ αi when β = β. Otherwise, β is not closed. Let β = β (0) ⊂ β (1) ⊂ · · · ⊂ β (k) = β be the series of elements constructed by the forward chaining procedure described in Section 2. We argue that if β (l) ⊆ αi and β (l) ⊂ β , then β (l+1) ⊆ αi as well. By repeatedly applying this fact to all the elements along the chain, we arrive at the desired conclusion, namely, β ⊆ αi . Let β (l) be such that β (l) ⊆ αi and β (l) ⊂ β . Thus β (l) violates some implication (αk → αk ) ∈ B. Our forward chaining procedure assigns β (l+1) to β (l) ∪ αk . The following inequalities hold: αk ⊆ β (l) because β (l) |= αk → αk , β (l) ⊆ αi by assumption; hence αk ⊆ αi . Using Lemma 2, and noticing the fact that, actually, αk ⊂ αi since β (l) ⊂ αi (otherwise we could not have β ⊂ αi ), we conclude that αk ⊆ αi . We have that αk ⊆ αi and β (l) ⊆ αi so that β (l+1) = β (l) ∪ αk ⊆ αi as required. The next result characterizes our version of the canonical basis based on the notion of saturation. The proof does rely heavily on Lemma 3, which is adapted from a result from [12]. The connection to saturation and our proof technique are indeed novel. Theorem 4. Definite Horn functions have a unique saturated basis. Proof. Let B1 and B2 be two equivalent and saturated bases. Let a → a be an arbitrary implication in B1 . We show that a → a ∈ B2 as well. By symmetry, this implies that B1 = B2 . By Lemma 1(2), we have that BITS(a) |= B1 and thus BITS(a) must violate some implication b → b ∈ B2 , hence it must hold that b ⊆ a but b ⊆ a. The rest of the proof is concerned with showing that assuming b ⊂ a leads to a contradiction. If so, then b = a and thus a → a ∈ B2 as well as desired. Let us assume then that b ⊂ a so that, by monotonicity, b ⊆ a . If b ⊂ a , then we can use Lemma 3 with αi = a and β = b and conclude that b ⊆ a, contradicting the fact that BITS(a) |= (b → b ). Thus, it should be that b = a . Now, consider b → a ∈ B2 . Clearly b is negative (otherwise, b = b , and then b → b is redundant) and thus it must violate some implication c → c ∈ B1 , namely, c ⊆ b but c ⊆ b. If c = b, then we have a → a ∈ B1 and c → c with c ⊂ a and c = b = a contradicting the fact that B1 is irredundant. Thus, c ⊂ b and so c ⊆ b . If c ⊂ b then we use Lemma 3 as before but with αi = b and β = c and we conclude that c ⊆ b. Again, this means that b |= c → c contradicting the fact that b violates this implication. So the only remaining case is c = b but this means that we have the implications a → a ∈ B1 and c → c ∈ B1 with c ⊂ a but a = c which again makes B1 redundant. 3.1
Constructing the GD Basis
So far, our definition of saturation only tests whether a given basis is actually saturated; we study now a saturation process to obtain the GD basis. New definitions are needed. Let H be any Horn CNF, and α any variable subset. Let
162
M. Arias and J.L. Balc´ azar
H(α) be those clauses of H whose antecedents fall in the same equivalence class as α, namely, H(α) = {αi → βi | αi → βi ∈ H and α = αi } . Given a Horn function H and a variable subset α, we introduce a new operator •:α• is the closure of α with respect to the subset of clauses H \ H(α). That is, in order to compute α• one does forward chaining starting with α but one is not allowed to use the clauses in H(α). This operator has been used in the literature before in related contexts, for example in [12]. Example 2. Let H = {a → b, a → c, c → d}. Then, (ac) = abcd but (ac)• = acd since H(ac) = {a → b, a → c} and we are only allowed to use the clause c → d when computing (ac)• . Computing the GD basis of a definite Horn H. First, saturate every clause C = α → β in H by replacing it with the implication α• → α . Then, remove possibly redundant implications, namely: (1) remove implications s.t. α• = α , and (2) remove duplicates, and (3) remove subsumed implications, i.e., implications α• → α for which there is another implication β • → β s.t. α = β but β • ⊂ α• . Let us denote with GD(H) the implicational definite Horn CNF obtained by applying this procedure to input H. Note that this algorithm is designed to work when given a definite Horn CNF both in implicational or standard form. The procedure can be computed in quadratic time, since finding the closures of antecedent and consequent of each clause can be done in linear time w.r.t. the size of the initial Horn CNF H. Example 3. Let H = {a → b, a → c, ad → e, ab → e}. We compute the closures of the antecedents: a = abce, (ad) = abcde, and (ab) = abce. Therefore, H(a) = {a → b, a → c, ab → e}, H(ad) = {ad → e}, and H(ab) = H(a). Thus, a• = a, (ad)• = abcde, and (ab)• = abce. After saturation of every clause in H, we obtain H = {a → abce, a → abce, abcde → abcde, abce → abce}. It becomes clear that the third clause was, in fact, redundant. Also, the fourth implication is subsumed by the first two (after right-saturation) and we can group the first and second implications together into a single one. Hence, GD(H) = {a → abce}. In the remainder of this Section we show that the given algorithm computes the unique saturated representation of its input. First, we need a simple lemma: Lemma 4. Let H be any basis for a definite Horn CNF over variables X = {x1 , . . . , xn }. For any α, β, γ ⊆ X, the following statements hold: 1. α ⊆ α• ⊆ α ; 2. If H |= β → γ, β ⊆ α• ; but β ⊂ α , then γ ⊆ α• . Lemma 5. The algorithm computing GD(H) outputs the GD basis of H for any definite Horn formula H. Proof. Let H be the input to the algorithm, and let H be its output. We show that H must be saturated. Let α → β be an arbitrary implication in the output H . Because of the initial saturation process, we can refer to this implication
Canonical Horn Representations and Query Learning
163
as α• → α . Clearly, (α• ) = α , and H is right-saturated. It is only left to show that H is left-saturated. By Lemma 4, it must be that α• ⊆ α , but the removal of implications of type (1) guarantees that α• ⊂ α , thus we have that BITS(α• ) |= α• → α and Condition 1 of left-saturation is satisfied. Now let • β → β be any other implication in H . We need to show that BITS(α• ) |= β • → β . Assume by way of contradiction that this is not so, and BITS(α• ) |= β • but BITS(α• ) |= β . That is, β • ⊆ α• but β ⊆ α• . If β • = α• , then β = α , contradicting the fact that both implications have survived type (2) of removal of implications in the algorithm. Thus, β • ⊂ α• , and therefore β ⊆ α must hold as well. It cannot be that β = α because we would have that α• → α is subsumed by β • → β and thus removed from the output H during removal of implications of type (3) (and it is not). Thus, it can only be that β • ⊂ α• and β ⊂ α . But if β ⊂ α , Lemma 4 and the fact that H |= β • → β (notice that saturating clauses does not change the logical value of the resulting formula) guarantee that β ⊆ α• contradicting our assumption that β ⊆ α• . It follows that H is saturated as required. It is clear that GD(H) has at most as many implications as H. Thus, if H is of minimum size, then so is GD(H). This, together with the fact that the GD basis is unique, implies: Theorem 5. [7] The GD basis of a definite Horn function is of minimum implicational size.
4
The Guigues-Duquenne Basis in Query Learning
The classic query learning algorithm by Angluin, Frazier, and Pitt [2] is able to learn Horn CNF with membership and equivalence queries. It was proved in [2] that the outcome of the algorithm is always equivalent to the target concept. However, the following questions remain open: (1) which of the Horn CNF, among the many equivalent candidates, is output? And (2) does this output depend on the specific counterexamples given to the equivalence queries? Indeed, each query depends on the counterexamples received so far, and intuitively the final outcome should depend on that as well. Our main result from this section is that, contrary to our first intuition, the output is always the same Horn CNF: namely, the GD basis of the target Horn function. This section assumes that the target is definite Horn, further sections in the paper lift the “definite” constraint. 4.1
The AFP Algorithm for Definite Horn CNF
We recall some aspects of the learning algorithm as described in [4], which bears only slight, inessential differences with the original in [2]. The algorithm maintains a set P of all the positive examples seen so far. The fact that the target is definite Horn allows us to initialize P with the positive example 1n . The algorithm maintains also a sequence N = (x1 , . . . , xt ) of representative negative
164
M. Arias and J.L. Balc´ azar
examples (these become the antecedents of the clauses in the hypotheses). The argument of an equivalence query is prepared from the list N = (x1 , . . . , xt ) of negative examples combined with the set P of positive examples. The query corresponds to the following intuitive bias: everything is assumed positive unless some (negative) xi ∈ N suggests otherwise, and everything that some xi suggests negative is assumed negative unless some positive example y ∈ P suggests otherwise. This is exactly the intuition in the hypothesis constructed by the AFP algorithm. For the set of positive examples P , denote Px = {y ∈ P x ≤ y}. The hypothesis to be queried, given the set P and the list N = (x1 , . . . , xt ), is denoted H(N, P ) and is defined as H(N, P ) = {ONES(xi ) → ONES( Pxi ) | xi ∈ N } . A positive counterexample is treated just by adding it to P . A negative counterexample y is used to either refine some xi into a smaller negative example, or to add xt+1 to the list. Specifically, let i := min({j M Q(xj ∧ y) is negative, and xj ∧ y < xj } ∪ {t + 1}) and then refine xi into xi = xi ∧ y, in case i ≤ t, or else make xt+1 = y, subsequently increasing t. The value of i is found through membership queries on all the xj ∧ y for which xj ∧ y < xj holds. AFP() 1 2 3 4 5 6 7 8 9 10 11 12 13
N ← () /* empty list */ P ← {1n } /* top element */ t←0 while EQ(H(N, P )) = (“no”, y) do if y |= H(N, P ) then add y to P else find the first i such that xi ∧ y < xi , and xi ∧ y is negative if found then xi ← xi ∧ y else t ← t + 1; xt ← y return H(N, P )
/* y is the counterexample */ /* N = (x1 , . . . , xt ) */ /* that is, xi ≤ y */ /* use membership query */ /* replace xi by xi ∧ y in N */ /* append y to end of N */
Fig. 1. The AFP learning algorithm for definite Horn CNF
The AFP algorithm is described in Figure 1. In order to prove that its output is indeed the GD basis, we need the following lemmas from [4]: Lemma 6 (Lemma 2 from [4]). Along the running of the AFP algorithm, at the point of issuing the equivalence query, for every xi and xj in N with i < j there exists a positive example z such that xi ∧ xj ≤ z ≤ xj . Lemma 7 (Variant of Lemma 1 from [4]). Along the running of the AFP algorithm, at the point of issuing the equivalence query, for every xi and xj in N with i < j and xi ≤ xj , it holds that Pxi ≤ xj .
Canonical Horn Representations and Query Learning
165
Proof. At the time xj is created, we know it is a negative counterexample to the current query, for which it must be therefore positive. That query includes the implicationONES(xi ) → ONES( Pxi ), and xj must satisfy it, and then xi ≤ xj implies Pxi ≤ xj . From that point on, further positive examples may enlarge Pxi and thus reduce Pxi , keeping the inequality. Further negative examples y may reduce xi , again possibly enlarging Pxi and keeping the inequality; or may reduce xj into xj ∧ y. If xi ≤ xj ∧ y anymore, then there is nothing left to prove. Finally, if xi ≤ xj ∧ y, then xi ≤ y, and y is again a negative counterexample that must satisfy the implication ONES(x ) → ONES( P ) as before, so that i x i Pxi ≤ xj ∧ y also for the new value of xj . Our key lemma for our next main result is: Lemma 8. All hypotheses H(N, P ) output by the AFP learning algorithm in equivalence queries are saturated. Proof. Recall that H(N, P ) = {ONES(xi ) → ONES( Pxi ) xi ∈ N }, where Pxi = {y ∈ P xi ≤ y}. Let αi = ONES(xi ) and βi = ONES( Pxi ) for all i so that H(N, P ) = {αi → βi 1 ≤ i ≤ t}. First we show that H(N, |= αi → βi it P ) is left-saturated. To see that xi suffices to note that xi < Pxi since xi is negative but Pxi is positive by Theorem 1, being an intersection of positive examples; thus, these two assignment must be different. Now we show that xi |= αj → βj , for all i = j. If xi |= αj , then clearly xi |= αj → βj . Otherwise, xi |= αj and therefore xj ≤ xi . If i < j, then by Lemma 6 we have that xi ∧ xj ≤ z ≤ xj for some positive z. Then, xi ∧ xj = xj ≤ z ≤ xj , so that xj = z, contradicting the fact that xj is negative whereas z is positive. Otherwise, j < i. We apply Lemma 7: it must hold that Pxj ≤ xi . Thus, in this case, xi |= αj → βj as well because xi |= βj = ONES( Pxj ). It is only left to show that H(N, P ) is right-saturated. Clearly, H(N, P ) is consistent with N and P , that is, x |= H(N, P ) for all x ∈ N and y |= H(N, P ) for all y ∈ P . Take any x ∈ N contributing the implicationONES(x) → ONES( Px ) to H(N, P ). We show that it is right-saturated, i.e., Px = x , where the closure is taken with respect to H(N, P ). We note first that H(N, P ) |= ONES(x) → (ONES(x)) since the closure is taken w.r.t. implications in H(N, P ). By the construction of H(N, P ), all examples y ∈ Px must satisfy it, hence they must satisfy the implication ONES(x) → (ONES(x)) as well. Therefore, since y |= ONES(x) we must have that y |= (ONES(x)) , or equivalently, that x ≤ y. This is true for every such y in Px and thus x ≤ Px . On the other hand, it is obvious that Px ≤ x since the implication ONES(x) → ONES( Px ) of H(N, P ) guarantees that all the variables in Pxare included in the forward chaining process in the final x . So we have x ≤ Px ≤ x as required. Putting Theorem 4 and Lemma 8 together, we obtain: Theorem 6. AFP, run on a definite Horn target, always outputs the GD basis of the target concept.
166
5
M. Arias and J.L. Balc´ azar
A Canonical Basis for General Horn
Naturally, we wish to extend the notion of saturation and GD basis to general Horn functions. We do this via a a prediction-with-membership reduction [3] from general Horn to definite Horn, and use the corresponding intuitions to define a GD basis for general Horn. We use this reduction to generalize our AFP algorithm to general Horn CNF, and as a consequence one obtains that the generalized AFP always outputs a saturated version of the target function. Indeed, for the generalized AFP it is also the case that the output is only dependent on the target, and not on the counterexamples received along the run. Finally, we contruct strong polynomial certificates for general Horn functions direclty in terms of the generalized GD basis, thus generalizing our earlier result of [4]. 5.1
Reducing General Horn CNF to Definite Horn CNF
In this section we describe the intuition of the representation mapping, which we use in the next section to obtain a canonical basis for general Horn functions. For any general Horn CNF H over n propositional variables, e.g. X = {xi | 1 ≤ i ≤ n}, we construct a definite Horn H over the set of n + 1 propositional variables X = X ∪ {f }, where f is a new “dummy” variable; in essence f represents the false (that is, empty) consequent of the negative clauses in H. The relationship between the assignments for H and H are as follows: for assignments of n + 1 variables xb where x assigns to the variables in X and b is the truth value assigned to f , x0 |= H if and only if x |= H, whereas x1 |= H if and only if x = 1n . Define the implication Cf as f → X . Let Hd be the set of definite Horn clauses in H, and Hn = H \ {Hd } the negative ones. Define the mapping g as g(H) = Hd ∪ {¬C → X | C ∈ Hn } ∪ {Cf }. That is, g(H) includes the definite clauses of H, the special implication Cf , and the clauses C that are negative are made definite by forcing all the positive literals, including f , into them. Clearly, the resulting g(H) is definite Horn. Observe that that the new implication Cf is saturated and the ones coming from Hn are right-saturated. Observe also that g is injective: given g(H), we recover H by removing the implication Cf , and by removing all positive literals from any implications containing f . Clearly, g −1 (g(H)) = H, since g −1 is removing all that g adds. 5.2
Constructing a GD-like Basis for General Horn CNF
The notion of left-saturation translates directly into general Horn CNF: Definition 2. Let B = {αi → βi }i be a basis for some general Horn function. Notice that now βi can possibly be empty (it is empty for the negative clauses). Then, B is left-saturated if the following two conditions hold: 1. BITS(αi ) |= αi → βi , for all i;
Canonical Horn Representations and Query Learning
167
2. BITS(αi ) |= αj → βj , for all i = j. For a definite Horn CNF H, right-saturating a clause α → β essentially means that we add to its consequent everything that is implied by its antecedent, namely α . This can no longer be done in the case of general Horn CNF, since we need to take special care of the negative clauses. If β = ∅, we cannot set β to α without changing the underlying Boolean function being represented. The closure x of an assignment x is defined as the closure with respect to all definite clauses in the general Horn CNF. It is useful to continue to partition assignments x in the Boolean hypercube according to their closures x ; however, in the general Horn case, we distinguish a new class (the negative class) of closed assignments that are actually negative, that is, it is possible now that x |= H. These assignments are exactly those that satisfy all definite clauses of H but violate negative ones. Based on this, the negative clauses (those with antecedent α such that BITS(α ) |= B) should be left unmodified, and the definite clauses (those whose antecedents α are such that BITS(α ) |= B) should be right-saturated. Thus, the definition is: Definition 3. Let B = {αi → βi }i be a basis for some general Horn function. Then, B is right-saturated if, for all i, βi = ∅ if αi |= B, and βi = αi otherwise. As for the definite case, “saturated” means that the general Horn CNF in question is both left- and right-saturated. We must see that this is the “correct” definition in some sense: Lemma 9. A basis H is saturated iff H = g −1 (GD(g(H))). Proof. First let us note that the expression g −1 (GD(g(H))) is well-defined. We can always invert g on GD(g(H)), since saturating g(H) does not modify Cf (already saturated) and it does not touch the positive literals of implications containing f since these are right-saturated. Therefore, we can invert it since the parts added by g are left untouched by the construction of GD(g(H)). We prove first that if H is saturated then H = g −1 (GD(g(H))). Assume, then, that H is saturated but H = g −1 (GD(g(H))). Applying g, which is injective, this can only happen if GD(g(H)) = g(H), namely, g(H), as a definite Horn CNF, differs from its own GD basis and, hence, it is not saturated: it must be because some implications other than Cf is not saturated, since this last one is saturated by construction. Also the ones containing f in their consequents are right-saturated, so no change happens in the right-hand-sides of these implications when saturating g(H)). This means that when saturating we must add a literal different from f to the right-hand-side of an implication not containing f or to the left-hand-side of an implication. In both cases, this means that the original H could not be saturated either, contradicting our assumption. It is only left to show that an H such that H = g −1 (GD(g(H))) is indeed saturated. By way of contradiction, assume that H is not saturated but H = g −1 (GD(g(H))). Applying g to both sides, we must have that g(H) = GD(g(H)) so that g(H) is actually saturated. Notice that the only difference between H
168
M. Arias and J.L. Balc´ azar
and g(H) is in the implication Cf and the right-hand-sides of negative clauses in H; g(H) being left-saturated means that so must be H since the left-handsides of H and g(H) coincide exactly (ignoring Cf naturally). Therefore, H is left-saturated as well. It must be that H is not right-saturated, that is, it is either missing some variable in some non-empty consequent, or some clause that should be negative is not. In the first case, then g(H) is missing it, too, and it cannot be saturated. In the second case, then there is a redundant clause in H contradicting the fact that H is left-saturated (see Lemma 1(1)). In both cases we arrive at a contradiction, thus the lemma follows. This gives us a way to compute the saturation (that is, the GD basis) of a given general Horn CNF: Theorem 7. General Horn functions have a unique saturated basis. This basis, which we denote GD(H), can be computed by GD(H) = g −1 (GD(g(H))). Proof. If H is saturated then H = g −1 (GD(g(H))). The uniqueness of such an H follows from the following facts: first, g(H) and g(H ) are equivalent whenever H and H are equivalent; second, GD(g(H)) is unique for the function represented by H (Theorem 4) and third, g −1 is univocally defined since g is injective. Example 4. Let H be the general Horn CNF {a → b, a → c, abc → ∅}. Then, – g(H) = {a → b, a → c, abc → abcf , f → abcf }; – GD(g(H)) = {a → abcf , f → abcf }; – GD(H) = g −1 (GD(g(H))) = {a → ∅}. Similarly to the case of definite Horn functions, GD(H) does not increase the number of new implications, and therefore if H is of minimum size, GD(H) must be of minimum size as well. This, together with the uniqueness of saturated representation implies that: Theorem 8. The GD basis of a general Horn function is of minimum implicational size. 5.3
The AFP Algorithm for General Horn CNF
We study now the AFP algorithm operating on general Horn CNF, by following a detour: we obtain it via reduction to the definite case. We consider, therefore, an algorithm that, for target a general Horn function H, simulates the version of AFP algorithm from Figure 1 on its definite transformation g(H), where g is the representation transformation from Section 5.1. It has to simulate the membership and equivalence oracles for definite Horn CNF that the underlying algorithm expects, by using the oracles that it has for general Horn. Initially, we set P = {1n+1 }, and N = (0n 1) since we know that g(H) is definite and must contain the implication f → X ∪ {f } by construction. In essence, the positive assignment 1n+1 = f and the negative 0n 1 = f • guarantee that
Canonical Horn Representations and Query Learning
169
the implication Cf is included in every hypothesis H(N, P ) that the simulation outputs as an equivalence query. In order to deal with the queries, we use two transformations: we must map examples over the n + 1 variables, asked as membership queries, into examples over the original example space over n variables, although in some cases are able to answer the query directly as we shall see. Upon asking x0 as membership query for g(H), we pass on to H the membership query about x. Membership queries of the form x1 are answered always negatively, except for 1n+1 which is answered positively (in fact query 1n+1 never arises anyway, because that example is in P from the beginning). Conversely, n-bit counterexamples x from the equivalence query with H are transformed into x0. The equivalence queries themselves are transformed according to g −1 . It is readily checked that all equivalence queries belong indeed to the image set of g since Cf ∈ H(N, P ). All together, these functions constitute a prediction-with-membership (pwm) reduction from general Horn to definite Horn, in the sense of [3]. It is interesting to note that if we unfold the simulation, we end up with the original algorithm by Angluin, Frazier and Pitt [2] (obviously, with no explicit reference to our “dummy” f ). Therefore, the outcome of AFP on a general Horn target H comes univocally determined by the outcome of AFP on the corresponding definite Horn function g(H); combining this fact with Theorems 6 and 7 leads to: Theorem 9. The AFP algorithm always outputs the GD basis of the target concept. 5.4
Certificates for General Horn CNF
The certificate dimension of a given concept class is closely related to its learnability in the model of learning from membership and equivalence queries [1,8,9]. Informally, a certificate for a class C of concepts of size at most m is a set of (labeled) assignments that proves that concepts consistent with it must be outside C. The polynomial q(m, n) used below quantifies the cardinality of the certificate set in term of m, the size of the class, and n, the number of variables in the class. The polynomial p(m, n) quantifies the expansion in size allowed in the hypotheses. In this paper, p(m, n) = m and thus we construct strong certificates. In [4] we show how to build strong certificates for definite Horn CNF. Here, we extend this to general Horn CNF, and describe the certificates directly in terms of the generalized GD basis. Due to space limit, we only sketch the proof. Theorem 10. The class of general Horn CNF has strong polynomial certificates m+2 with p(m, n) = m and q(m, n) = m+1 + m + 1 = . 2 2 Proof (Sketch). The argumentation follows, essentially, the same steps as the analogous proof in [4], because, by Lemma 9, the GD basis in the general case is saturated, and therefore all required facts carry over to the general case. Let f be a Boolean function that cannot be represented with m Horn implications.
170
M. Arias and J.L. Balc´ azar
If f is not Horn, then three assignments x, y, x ∧ y such that x |= f , y |= f but x ∧ y |= f suffice. Otherwise, f is a general Horn CNF of implicational size strictly greater than m. Assume that f contains at least m + 1 non-redundant and possibly negative implications {αi → βi }. We define the certificate for f : Qf = {x•i , xi | 1 ≤ i ≤ m + 1, xi = BITS(αi ), βi = ∅} ∪ {x•i | 1 ≤ i ≤ m + 1, xi = BITS(αi ), βi = ∅} ∪ x•i ∧ x•j 1 ≤ i < j ≤ m + 1
It is illustrative to note the relation between this set of certificates for f and its GD basis: the assignments x•i and xi correspond exactly to the left and right-hand-sides of the (saturated) definite implications in GD(f ). For negative clauses, only the (saturated) left-hand-side of the implication x•i matters.
References 1. Angluin, D.: Queries revisited. Theoretical Computer Science 313, 175–194 (2004) 2. Angluin, D., Frazier, M., Pitt, L.: Learning conjunctions of Horn clauses. Machine Learning 9, 147–164 (1992) 3. Angluin, D., Kharitonov, M.: When won’t membership queries help? Journal of Computer and System Sciences 50(2), 336–355 (1995) 4. Arias, M., Balc´ azar, J.L.: Query learning and certificates in lattices. In: Freund, Y., Gy¨ orfi, L., Tur´ an, G., Zeugmann, T. (eds.) ALT 2008. LNCS (LNAI), vol. 5254, pp. 303–315. Springer, Heidelberg (2008) 5. Arias, M., Feigelson, A., Khardon, R., Servedio, R.A.: Polynomial certificates for propositional classes. Inf. Comput. 204(5), 816–834 (2006) 6. Chang, C.-L., Lee, R.C.-T.: Symbolic Logic and Mechanical Theorem Proving. Academic Press, Inc., Orlando (1973) 7. Guigues, J.L., Duquenne, V.: Familles minimales d’implications informatives resultants d’un tableau de donnees binaires. Math. Sci. Hum. 95, 5–18 (1986) 8. Heged¨ us, T.: On generalized teaching dimensions and the query complexity of learning. In: Proceedings of the Conference on Computational Learning Theory, pp. 108–117. ACM Press, New York (1995) 9. Hellerstein, L., Pillaipakkamnatt, K., Raghavan, V., Wilkins, D.: How many queries are needed to learn? Journal of the ACM 43(5), 840–862 (1996) 10. Khardon, R., Roth, D.: Reasoning with models. Artificial Intelligence 87(1-2), 187– 213 (1996) 11. Wang, H.: Toward mechanical mathematics. IBM Journal for Research and Development 4, 2–22 (1960) 12. Wild, M.: A theory of finite closure spaces based on implications. Advances in Mathematics 108, 118–139 (1994)
Learning Finite Automata Using Label Queries Dana Angluin1 , Leonor Becerra-Bonache1,2, , Adrian Horia Dediu2,3 , and Lev Reyzin1, 1
2
Department of Computer Science, Yale University 51 Prospect Street, New Haven, CT, USA {dana.angluin,leonor.becerra-bonache,lev.reyzin}@yale.edu Research Group on Mathematical Linguistics, Rovira i Virgili University Avinguda Catalunya, 35, 43002, Tarragona, Spain 3 “Politehnica” University of Bucharest Splaiul Independentei 313, 060042, Bucharest, Romania
[email protected]
Abstract. We consider the problem of learning a finite automaton M of n states with input alphabet X and output alphabet Y when a teacher has helpfully or randomly labeled the states of M using labels from a set L. The learner has access to label queries; a label query with input string w returns both the output and the label of the state reached by w. Because different automata may have the same output behavior, we consider the case in which the teacher may “unfold” M to an output equivalent machine M and label the states of M for the learner. We give lower and upper bounds on the number of label queries to learn the output behavior of M in these different scenarios. We also briefly consider the case of randomly labeled automata with randomly chosen transition functions.
1
Introduction
The problem of learning the behavior of a finite automaton has been considered in several domains, including language learning and environment learning by robots. Many interesting questions remain about the kinds of information that permit efficient learning of finite automata. One basic result is that finite automata are not learnable using a polynomial number of membership queries. Consider a “password machine”, that is, an acceptor with (n + 2) states that accepts exactly one binary string of length n; the learner may query (2n − 1) strings before finding the one that is accepted. In this case, the learner gets no partial information from the unsuccessful queries. However, Freund et al. [5] show that regardless of the topology of the underlying automaton, if its states are randomly labeled with 0 or 1, then a robot taking
Supported by a Marie Curie International Fellowship within the 6th European Community Framework Programme. This material is based upon work supported under a National Science Foundation Graduate Research Fellowship.
R. Gavald` a et al. (Eds.): ALT 2009, LNAI 5809, pp. 171–185, 2009. c Springer-Verlag Berlin Heidelberg 2009
172
D. Angluin et al.
a random walk on the automaton can learn to predict the labels while making only a polynomial number of errors of prediction. Random labels on the states provide a rich source of information that can be used to distinguish otherwise difficult-to-distinguish pairs of states. In a different setting, Becerra-Bonache, Dediu and Tˆırn˘ auc˘a [3] introduced correction queries to model a kind of correction provided by a teacher to a learner when the learner’s utterance is not grammatically correct. In their model, a correction query with a string w gives the learner not only membership information about w, but also, if w is not accepted, either the minimum continuation of w that is accepted, or the information that no continuation of w is accepted. In certain cases, corrections may provide a substantial amount of partial information for the learner. For example, for a password machine, a prefix of the password will be answered with the rest of the password. We may think of correction queries as labeling each state q of the automaton with the string rq that is the response to any correction query w that arrives at q. In both of these cases, labels on states may facilitate the learning of finite automata: randomly chosen labels in the work of Freund et al. and meaningfully chosen labels in the work of Becerra-Bonache, Dediu and Tˆırn˘ auc˘a. In this paper we explore the general idea of adding labels to the states of an automaton to make it easier to learn. That is, we allow a teacher to prepare an automaton M for learning by adding labels to its states (either carefully or randomly chosen). When the learner queries a string, the learner receives not only the original output of M for that string, but also the label attached to that state by the teacher. In an extension of this idea, we also allow the teacher to “unfold” the machine M to produce copies of a state that may then be given different labels. These ideas are also relevant to automata testing [7] – labeling and unfolding automata can make their structure easier to verify. Depending on how labels are assigned, learning may or may not become easier. If each state is assigned a unique label, the learning task becomes easy because the learner knows which state the machine reaches on any given query. However, if the labels are all the same, they give no additional information and learning may require an exponential number of queries (as in the case of membership queries.) Hence we focus on questions of the following sort. Given an automaton, how can a teacher use a limited set of labels to make the learning problem easier? If random labels are sprinkled on the states of an automaton, how much does that help the learner? How few labels can we use and still make the learning problem tractable? Other questions concern the structure of the automaton itself. For example, we may consider changing the structure of the automaton before labeling it. We also consider the problem of learning randomly labeled automata with random structure.
2
Preliminaries
We consider finite automata with output, defined as follows. A finite automaton M has a finite set Q of states, an initial state q0 ∈ Q, a finite alphabet X of input
Learning Finite Automata Using Label Queries
173
symbols, a finite alphabet Y of output symbols, an output function γ mapping Q to Y and a transition function τ mapping Q × X to Q. We extend τ to map Q × X ∗ to Q in the usual way. A finite acceptor is a finite automaton with output alphabet Y = {0, 1}; if γ(q) = 1 then q is an accepting state, otherwise, q is a rejecting state. In this paper we assume that there are at least two input symbols and at least two output symbols, that is, |X| ≥ 2 and |Y | ≥ 2. For any string w ∈ X ∗ , we define M (w) to be γ(τ (q0 , w)), that is, the output of the state reached from q0 on input w. Two finite automata M1 and M2 are output-equivalent if they have the same input alphabet X and the same output alphabet Y and for every string w ∈ X ∗ , M1 (w) = M2 (w). If M is a finite automaton with output, then an output query with string w ∈ X ∗ returns the symbol M (w). This generalizes the concept of a membership query for an acceptor. That is, if M is an acceptor, an output query with w returns 1 if w is accepted by M and 0 if w is rejected by M . We note that Angluin’s polynomial time algorithm to learn finite acceptors using membership queries and equivalence queries generalizes in a straightforward way to learn finite automata with output using output queries and equivalence queries [2]. If q1 and q2 are states of a finite automaton with output, then q1 and q2 are distinguishable if there exists a distinguishing string for them, namely, a string w such that γ(τ (q1 , w)) = γ(τ (q2 , w)), that is, w leads from q1 and q2 to two states with different output symbols. If M is minimized, every pair of its states are distinguishable, and M has at most one sink state. If d is a nonnegative integer, the d-signature tree of a state q is the finite function mapping each input string z of length at most d to the output symbol γ(τ (q, z)). We picture the d-signature tree of a state as a rooted tree of depth d in which each internal node has |X| children labeled with the elements of X, and each node is labeled with the symbol from Y that is the output of the state reached from q on the input string z that leads from the root to this node. The d-signature tree of a state gives the output behavior in a local neighborhood of the automaton reachable from that state. For any finite automaton M with output, we may consider its transition graph, which is a finite directed graph (possibly with multiple edges and selfloops) defined as follows. The vertices are the states of M and there is an edge from q to q for each transition τ (q, a) = q . Properties of the transition graph are applied to M ; that is, M is strongly connected if its transition graph is strongly connected. Similarly, the out-degree of M is |X| for every node, and the in-degree of M is the maximum number of edges entering any node of its transition graph. For a positive integer k, we define an automaton M to be k-concentrating if there is some set Q of at most k states of M such that every state of M can reach at least one state in Q . Every strongly connected automaton is 1-concentrating. 2.1
Labelings
If M is a finite automaton with output, then a labeling of M is a function mapping Q to a set L of labels, the label alphabet. We use M to construct
174
D. Angluin et al.
a new automaton M by changing the output function to γ (q) = (γ(q), (q)). That is, the new output for a state is a pair incorporating the output symbol for the state and the label attached to the state. For the scenario of learning with labels, we assume that the learner has access to output queries for M for some labeling of the hidden automaton M . For the scenario of learning with unfolding and labels, we assume that the learner has access to output queries for M1 for some labeling of some automaton M1 that is output-equivalent to M . In these two scenarios, the queries will be referred to as label queries. The goal of the learner in either scenario is to use label queries to find a finite automaton M output-equivalent to M . Thus, the learner must discover the output behavior of the hidden automaton, but not necessarily its topology or labeling. We assume the learner is given both X and |Q|.
3
Learning with Labels
First, we show a lower bound on the number of label queries required to learn a hidden automaton M with n states and an arbitrary labeling . Proposition 1. Let L be a finite label alphabet. Learning a hidden automaton with n states and a labeling using symbols from L requires |X|n log(n) Ω 1 + log(|L|)) label queries in the worst case. Proof. Recall that we have assumed that |X| and |Y | are both at least 2; we consider |Y | = 2. Domaratzki, Kisman and Shallit [4] have shown that there are at least (|X| − o(1))n2n−1 n(|X|−1)n distinct languages accepted by acceptors with n states. Because each label query returns one of at most 2 · |L| values, an information theoretic argument gives the claimed lower bound on the number of label queries. As a corollary, when |X| and |L| are constants, we have a lower bound of Ω(n log(n)) label queries. 3.1
Labels Carefully Chosen
In this section, we examine the case where the teacher is given a limit on the number of different labels he may use, and he is able to label the states after examining the automaton. Moreover, the learning algorithm may take advantage of knowing the labeling strategy of the teacher. In this setting the problem takes on an aspect of coding, and indicates the maximum extent to which labeling may facilitate efficient learning. We begin with a simple proposition. Proposition 2. An automaton with n states, helpfully labeled using n different labels, can be learned using |X|n label queries.
Learning Finite Automata Using Label Queries
175
Proof. The teacher assigns a unique integer label between 1 and n to each state. The learner asks a label query with the empty string to determine the output and label of the start state, and then explores the transitions from the start state by querying each a ∈ X. After querying an input string w, the label indicates whether this state has been visited before. If the state is new, the learner explores all the transitions from it by querying wa for each a ∈ X. Thus, after querying at most |X|n strings, the learner knows the structure and outputs of the entire automaton. The lower bound shows that this is asymptotically optimal if the label set L has n elements. We next consider limiting the teacher to a constant number of different labels: a polynomial number of label queries suffices in this case. Theorem 1. For each automaton with n states, there is a helpful labeling using 2|X| different labels such that the automaton can be learned using O(|X|n2 ) label queries. Proof. Given an automaton M of n states, the teacher chooses an outwarddirected spanning tree T rooted at q0 of the transition graph of the automaton, and labels the states of M to communicate T to the learner as follows. The label of state q is the subset of X corresponding to the edges of T from q to other nodes. The label of q directs the learner to q’s children. Using at most n label queries and the structure of T , the learner can create a set S of n input strings such that for each state q of M , there is one string w ∈ S such that τ (q0 , w) = q. In [1], Angluin gives an algorithm for learning a regular language using membership queries given a live complete sample for the language. A live complete sample for a language L is a set of strings P , that for every state q (other than the dead state) of the minimal acceptor for L, contains a string that leads from the start state to q. Given a live complete sample P , a learner can find the regular language using O(k|P |n) membership queries, where k is the size of the input alphabet. A straightforward generalization of this algorithm to automata with output shows that the set S and O(|X|n2 ) output queries can be used to find an automaton output equivalent to M . However, the number of queries, O(n2 ), does not meet the Ω(n log n) lower bound, and the number of different labels is large. For a restricted class of automata, there is a helpful labeling with fewer labels that permits learning with an asymptotically optimal O(n log n) label queries. To appreciate the generality of Theorem 2, we note once more that every strongly connected automaton is 1-concentrating, and as we will see in Lemma 1, automata with a small input alphabet can be unfolded to have small in-degree. Theorem 2. Let k and c be positive integers. Any automaton in the class of c-concentrating automata with in-degree at most k can be helpfully labeled with at most (3k|X| + c) labels so that it can be learned using O(|X|n log(n)) label queries.
176
D. Angluin et al.
Proof. We give the construction for 1-concentrating automata and indicate how to generalize it at the end of the proof. Given a 1-concentrating automaton M the teacher chooses as the root a node reachable from all other nodes in the transition graph of M . The depth of a node is the length of the shortest path from that node to the root. The teacher then chooses a spanning tree T directed inward to the root by choosing a parent for each non-root node. (One way to do this is to let the parent of a node q be the first node reached along a shortest path from q to the root.) The teacher assigns, as part of the label for each node q, an element a ∈ X such that τ (q, a) is the parent of q. The teacher now adds more information to the labels of the nodes, which we call color, using the colors yellow, red, green, and blue. The root is the unique node colored yellow. Let t = log n; t bits are enough to give a unique identifier for every node of the graph. Each node at depth a multiple of (t + 1) is colored red. For each red node v we choose a unique identifier of t bits (c1 , c2 , . . . , ct ) encoded as green and blue labels. Now consider the maximal subtree rooted at v containing no red nodes. For each level i from 1 to the depth of the subtree, all the nodes at level i of the subtree are colored with ci (which is either blue or green.) The teacher has (so far) used 3|X| + 1 labels – a direction and one of three colors per non-root node, and a unique identifier for the root. Given this labeling, the learner can start from any state and reach a localization state whose identifier is known, as follows. The learner uses the parent component of the labels to go up the tree until it passes one red node and arrives at a second red node, or arrives at the root (whichever comes first), keeping track of the labels seen. If the learner reaches the root, it knows where it is. Otherwise, the learner interprets the labels seen between the first and second red node encountered as an identifier for the node v reached. This involves observing at most (2t+2) labels. Thus, even if the in-degree is not bounded, a 1-concentrating automaton can be labeled so that with O(log(n)) label queries the learner can reach a uniquely identified localizing state. If each node of the tree T also has in-degree bounded by k, another component of the label for each non-root node identifies which of the k possible predecessors of its parent it is (numbered arbitrarily from 1 to at most k.) If the learner collects these values on the path from u to its localization node v, then we have an identifier for u with respect to v. Thus it takes O(log(n)) label queries to learn any node’s identifier. If the node has not been encountered before, its |X| transitions must be explored, as in Proposition 2. This gives us a learning algorithm using O(|X|n log(n)) label queries. The labeling uses at most 3k|X| + 1 different labels. If the automaton is c-concentrating for some c > 1, then the teacher selects a set of at most c nodes such that every node can reach at least one of them and constructs a forest of at most c inward directed disjoint spanning trees, and proceeds as above. This increases the number of unique identifiers for the roots from 1 to c. An open question is whether an arbitrary finite automaton with n states can be helpfully labeled with O(1) labels in such a way that it can be learned using O(|X|n log n) label queries.
Learning Finite Automata Using Label Queries
3.2
177
Labels Randomly Chosen
In this section we turn from labels carefully chosen by the teacher to an independent uniform random choice of labels for states from a label alphabet L. With nonzero probability the labeling may be completely uninformative, so results in this scenario incorporate a confidence parameter δ > 0 that is an input to the learner. The goal of the learner is to learn an automaton that is output equivalent to the hidden automaton M with probability at least (1 − δ), where this probability is taken over the labelings of M . Results on random labelings can be used in the careful labeling scenario: the teacher generates a number of random labelings until one is found that has the desired properties. We first review the learning scenario considered by Freund et al. [5]. There is a finite automaton over the input alphabet X = {0, 1} and output alphabet {+, −}, where the transition function and start state of the automaton are arbitrary, but the output symbol for each state is chosen independently and uniformly from {+, −}. The learner moves from state to state in the target automaton according to a random walk (the next input symbol is chosen independently and uniformly from {0, 1}) and, after learning what the next input symbol will be, attempts to predict the output (+ or −) of the next state. After the prediction, the learner is told the correct output and the process repeats with the next input symbol in the random walk. If the learner’s prediction was incorrect, this counts as a prediction mistake. In the first scenario they consider, the learner may reset the machine to the initial state by predicting ? instead of + or −; this counts as a default mistake. In this model, the learner is completely passive, dependent upon the random walk process to disclose useful information about the behavior of the underlying automaton. For this setting they prove the following. Theorem 3 (Freund et al. [5]). There exists a learning algorithm that takes n and δ as input, runs in time polynomial in n and 1/δ and with probability at least (1 − δ) makes no prediction mistakes and an expected O((n5 /δ 2 ) log(n/δ)) default mistakes. The main idea is to use the d-signature tree of a state as the identifier for the state, where d ≥ 2 log(n2 /δ). For this setting, there are at least n4 /δ 2 strings in a signature tree of depth d. The following theorem of Trakhtenbrot and Barzdin’ [8] establishes that signature trees of this depth are sufficient. Theorem 4 (Trakhtenbrot and Barzdin’ [8]). For any natural number d and for any finite automaton with n states and randomly chosen outputs from Y , the probability that for some pair of distinguishable states the shortest distinguishing string is of length greater than d is less than n2 (1/|Y |)d/2 . We may apply these ideas to prove the following.
178
D. Angluin et al.
Theorem 5. For any positive integer s, any finite automaton with n states, over the input alphabet X and output alphabet Y , with its states randomly labeled with labels from a label alphabet L with |L| = |X|s can be learned using n1+4/s O |X| 2/s δ label queries, with probability at least (1 − δ) (with respect to the choice of labeling.) Proof. Assume that the learning algorithm is given n, a bound on the number of states of the hidden automaton, and the confidence parameter δ > 0. It calculates a bound d = d(n, δ) (described below) and proceeds as follows, starting with the empty input string. To explore the input string w, the learning algorithm calculates the d signature tree (in the labeled automaton) of the state reached by w by making label queries on wz for all input strings z of length at most d. This requires O(|X|d ) queries. If this signature tree has not been encountered before, then the algorithm explores the transitions wa for all a ∈ X. Assuming that the labeling is “good”, that is, that all distinguishable pairs of states have a distinguishing string in the labeled automaton of length at most d, then this correctly learns the output behavior of the hidden automaton using O(|X|d+1 n) label queries. To apply Theorem 4, we assume that the hidden automaton M is an arbitrary finite automaton with output with at most n states, input alphabet X and output alphabet Y . The labels randomly chosen from L then play the role of the random outputs in Theorem 4. There is a somewhat subtle issue: states distinguishable in M by their outputs may not be distinguishable in the labeled automaton by their labels alone. Fortunately, Freund et al. [5] have shown us how to address this point. In the first case, if two states of M are distinguishable by their outputs in M by a string of length at most d, then their d signature trees (in the labeled automaton) will differ. Otherwise, if the shortest distinguishing string for the two states (using just outputs) is of length at least d + 1, then generalizing the argument for Theorem 2 in [5] from |Y | = 2 to arbitrary |Y |, the probability that this pair of states is not distinguished by the random labeling by a string of length at most d is bounded above by (1/|Y |)(d+1)/2 . Summing over all pairs of states gives the required bound. Thus, choosing 2 2 n d≥ log , log |L| δ suffices to ensure that the labeling is “good” with probability at least (1 − δ). If we use more labels, the signature trees need not be so deep and the algorithm does not need to make as many queries to determine them. In particular, if |L| = |X|s , then the bound of O(|X|d+1 n) on the number of label queries used by the algorithm becomes n1+4/s O |X| 2/s , δ completing the proof.
Learning Finite Automata Using Label Queries
179
Corollary 1. Any finite automaton with n states can be learned using O(|X|n1+ ) label queries with probability at least 1/2, when it is randomly labeled with |L| = f (|X|, ) labels. Proof. With δ = 1/2 a choice of |L| ≥ |X|4/ suffices.
We remark that this implies that there exists a careful labeling with O(|X|4 ) labels that achieves learnability with O(|X|n2 ) label queries, substantially improving on the size of the label set used in Theorem 1. An open question is whether a random labeling with O(1) labels enables efficient learning of an arbitrary n state automaton with O(n log n) queries with high probability.
4
Unfolding Finite Automata
We now consider giving more power to the teacher. Because many automata have the same output behavior, we ask what happens if a teacher can change the underlying machine (without changing its output behavior) before placing labels on it. In Sections 3.1 and 3.2, the teacher had to label the machine given to him. Now we will examine what happens when a teacher can unfold an automaton before putting labels on it. That is, given M , the teacher chooses another automaton M with the same output behavior as M and labels the states of M for the learner. 4.1
Unfolding and Then Labeling
We first remark that unfolding an automaton M from n to O(n log n) states allows a careful labeling with just 2 labels to encode a description of the machine. Proposition 3. Any finite automaton with n states can be unfolded to have N = O(|X|n log(n) + n log(|Y |)) states and carefully labeled with 2 labels, in such a way that it can be learned using N label queries. Proof. The total number of automata with output having n states, input alphabet X and output alphabet Y is at most n|X|n+1 |Y |n . Thus, N = O(|X|n log(n) + n log(|Y |)) bits suffice to specify any one of these machines. The teacher chooses a ∈ X and unfolds the target automaton M as follows. The strings ai for i = 0, 1, . . . , N − 1 each send the learner to a newly created state, which act (with respect to transitions on other input symbols and output behavior) just as their counterparts in the original machine. The remaining states are unchanged. The unfolded automaton is output equivalent to M . The teacher then specifies M using by labeling these N new states with the bits of the specification of M . The learner simply asks a sequence of N queries on strings of the form ai to receive the encoding of the hidden machine.
180
D. Angluin et al.
This method does not work if we restrict the unfolding to O(|X|n) states, but we show that this much unfolding is sufficient to reduce the in-degree of the automaton to O(|X|). Lemma 1. Let M be an arbitrary automaton of n states. There is an automaton M with the same output behavior as M , with at most (|X| + 1)n states whose in-degree is bounded by 2|X| + 1. Proof. Given M , we repeat the following process until it terminates. While there is some state q with in-degree greater than 2|X| + 1, split q into two copies, dividing the incoming edges as evenly as possible between the two copies, and duplicating all |X| outgoing edges for the second copy of q. It is clear that each step of this process preserves the output behavior of M . To see that it terminates, for each node q let f (q) be the maximum of 0 and din (q) − (|X| + 1), where din (q) is the in-degree of q. Consider the potential function Φ that is the sum of f (q) for all nodes q in the transition graph. Φ is initially at most |X|n − (|X| + 1), and each step reduces it by at least 1 = (|X| + 1) − |X|. Thus, the process terminates after no more than |X|n steps producing an output-equivalent automaton M with no more than (|X| + 1)n states and in-degree at most 2|X| + 1. In particular, an automaton with a sink state of high in-degree will be unfolded by this process to have multiple copies of the sink state. Using this idea for degree reduction, the teacher may use linear unfolding and helpful labeling to enable a strongly connected automaton to be learned with O(n log n) label queries. Corollary 2. For any strongly connected automaton M of n states, there is an unfolding M of M with at most (|X| + 1)n states and a careful labeling of M using O(|X|2 ) labels that allows the behavior of M to be learned using O(|X|2 n log n) label queries. Proof. Given a strongly connected automaton M with n states, the teacher uses the method of Lemma 1 to produce an output equivalent machine M with at most (|X| + 1)n states and in-degree bounded by 2|X| + 1. This unfolding may not preserve the property of being strongly connected, but there is at least one state q that has at most (|X| + 1) copies in the unfolded machine M . Because M is strongly connected, every state of M must be able to reach at least one of the copies of q, so M is (|X| + 1)-concentrating. Applying the method of Theorem 2, the teacher can use 3(2|X| + 1)|X| + (|X| + 1) labels to label M so that it can be learned with O(|X|2 n log n) label queries. We now consider uniform random labelings of the states when the teacher is allowed to choose the unfolding of the machine. Theorem 6. Any automaton with n states can be unfolded to have O(n log(n/δ)) states and randomly labeled with 2 labels, such that with probability at least (1 − δ), it can be learned using O(|X|n(log(n/δ))2 ) queries.
Learning Finite Automata Using Label Queries
181
Proof. Given n and δ, let t = log(n2 /δ). The teacher chooses a ∈ X and unfolds the target machine M to construct the machine M as follows. M has nt states (q, i) where q is a state of M and 0 ≤ i ≤ (t − 1). The start state is (q0 , 0), where q0 is the start state of M . The output symbol for (q, i) is γ(q, ai ), where γ is the output function of M . For 0 < i < (t − 1), the a transition from (q, i) is to (q, (i + 1)). The a transition from (q, t − 1) is to (q , 0), where q = τ (q, at ) and τ is the transition function of M . For all other input symbols b with b = a, the b transition from (q, i) is to (q , 0), where q = τ (q, ai b). To see that M is an unfolding of M , that is, M is output equivalent to M , we show that each state (q, i) of M is output equivalent to state τ (q, ai ) of M . By construction, these two states have the same output. If i < (t − 1) then the a transition from (q, i) is to (q, i + 1), which has the same output symbol as τ (q, ai+1 ). The a transition from (q, t − 1) is to (q , 0), where q = τ (q, at ), which has the same output symbol as τ (τ (q, at−1 ), a). If b = a is an input symbol, then the b transition from (q, i) is to (q , 0) where q = τ (q, ai b), which has the same output symbol as τ (τ (q, ai ), b). Suppose M is randomly labeled with two labels. For each state q of M , define its label identifier in M to be the sequence of labels of (q, i) for i = 0, 1, . . . , (t − 1). For two distinct states q1 and q2 of M , the probability that their label identifiers in M are equal is (1/2)t , which is at most δ/n2 . Thus, the probability that there exist two distinct states q1 and q2 with the same label identifier in M is at most δ. Given n and δ, the learning algorithm takes advantage of the known unfolding strategy to construct states (j, i) for 0 ≤ j ≤ n − 1 and 0 ≤ i ≤ (t − 1) with a transitions from (j, i) to (j, i + 1) for i < (t − 1). It starts with the empty input string and uses the following exploration strategy. Given an input string w that is known to arrive at some (q, 0) in M , the learning algorithm makes label queries on wai for i = 0, 1, . . . , (t − 1) to determine the label identifier of q in M . If this label identifier has not been seen before, the learner uses the next unused (j, 0) to represent q and records the outputs and labels for the states (j, i) for i = 0, 1, . . . , (t − 1). It must also explore all unknown transitions from the states (j, i). If distinct states of M receive distinct label identifiers in M , the learner learns a finite automaton output equivalent to M using O(|X|nt2 ) label queries.
5
Automata with Random Structure
We may also ask whether randomly labeled finite automata are hard to learn “on average”. We consider automata with randomly chosen transition functions and random labels. The model of random structure that we consider is as follows. Let the states be qi for i = 0, 1, . . . , (n − 1), where q0 is the start state. For each state qi and input symbol a ∈ X, choose j uniformly at random from 0, 1, . . . , (n − 1) and let τ (qi , a) = qj .
182
D. Angluin et al.
Theorem 7. A finite automaton with n states, a random transition function and a random labeling can be learned using O(n log(n)) label queries, with high probability. The probability is over the choice of transition function and labeling. Proof. This was first proved by Korshunov in [6]; here we give a simpler proof. Korshunov showed that the signature trees only need to be of depth asymptotically equal to log|X| (log|L| (n)) for the nodes to have unique signatures with high probability. We use a method similar to signature trees, but simpler to analyze. Instead of comparing signature trees for two states to tell whether or not they are distinct, we compare the labels along at most four sets of transitions, which we call signature paths – like a signature tree consisting only of four paths. Lemmas 2 and 3 show that given X and n there are at most four signature paths, each of length 3 log(n), such that for a random finite automaton of n states with input alphabet X and for any pair s1 and s2 of different states, the log6 (n) probability is O that s1 and s2 are distinguishable but not distinguished n3 by any of the strings in the four signature paths. By the union bound, the probability that there exist two distinguishable states that are not distinguished by at least one of the strings in the four signature paths is at most 6 n log (n) O = o(1). n3 2 Hence, by running at most four signature paths, each of length 3 log(n), per newly reached state, we get unique labels on the states. Then for each of the n states, we can find their |X| transitions, and learn the machine, as in Proposition 2. We now turn to the two lemmas used in the proof of Theorem 7. We first consider the case |X| > 2. If a, b, c ∈ X and is a nonnegative integer, let D (a, b, c) denote the set of all strings ai , bi , and ci such that 0 ≤ i ≤ . Lemma 2. Let s1 and s2 be two different states in a random automaton with |X| > 2. Let a, b, c ∈ X and = 3 log(n). The probability that s1 and s2 are distinguishable, but not by any string in D (a, b, c) is O
log6 (n) n3
.
Proof. We analyze the three (attempted) paths from two states s1 and s2 , which we will call πs11 , πs21 , πs31 and πs12 , πs22 , πs32 , respectively. Each path will have length 3 log(n). We define each of the πi as a set of nodes reached by its respective set of transitions. We first look at the probability that the following event does not happen: that both |πs11 | > 3 log(n) and |πs12 | > 3 log(n), and that πs11 ∩ πs12 = ∅, that is the probability that both of these strings succeed in reaching 3 log(n) different states, and that they share no states in common. We call the event that two sets of states π1 and π2 have no states in common, and both have size at least l, S(π1 , π2 , l) (success) and the failure event F (π1 , π2 , l) = S(π1 , π2 , l). So,
Learning Finite Automata Using Label Queries 3 log(n)
P (F (πs11 , πs12 , 3 log(n))) ≤
i=1
i + |πs11 | n
3 log(n)
3 log(n)
+
i=1
i + 3 log(n) n i=1 2 log (n) =O . n
≤2
i + |πs12 | n
183
Now we look at the probability that F (πs21 , πs22 , 3 log(n)) given that we failed on the first paths, or F (πs11 , πs12 , 3 log(n)), with l = 3 log(n), P
F (πs21 , πs22 , l)|F (πs11 , πs12 , l)
3 log(n)
≤
i=1
i + |πs21 | + |πs11 | + |πs12 | n
3 log(n)
+
i=1
i + |πs22 | + |πs11 | + |πs12 | n
3 log(n)
i + 9 log(n) n i=1 2 log (n) =O . n
≤2
Now, we will compute the probability that F (πs31 , πs32 , 3 log(n)) given failures on the previous two pairs of states. Let l = 3 log(n), 3 log(n) i + 25 log(n) 3 3 1 1 2 2 P F (πs1 , πs2 , l)|F (πs1 , πs2 , l), F (πs1 , πs2 , l) ≤ 2 n i=1 2 log (n) =O . n Last, we compute the probability none of these pairs of paths made it to l = 3 log(n), or P (failure) = P F (πs11 , πs12 , l), F (πs21 , πs22 , l), F (πs31 , πs32 , l) P (failure) = P (F (πs11 , πs12 , l)) · P F (πs21 , πs22 , l)|F (πs11 , πs12 , l) · P F (πs31 , πs32 , l)|F (πs11 , πs12 , l), F (πs21 , πs22 , 1) 2 2 2 log (n) log (n) log (n) O O =O n n n 6 log (n) =O . n3 Thus, given two distinct states with corresponding nonoverlapping signature paths of length 3 log(n), the probability that all of the randomly chosen labels along the paths will be the same is 23 lg(n) =
1 n3
= O
log6 (n) n3
probability that no string in D (a, b, c) distinguishes s1 from s2 .
, which is the
184
D. Angluin et al.
When |X| = 2, we do not have enough alphabet symbols to construct three completely independent paths as in the proof of Lemma 2, but four paths suffice. If a, b ∈ X and is a nonnegative integer, let D (a, b) denote the set of all strings ai , bi , abi and bai such that 0 ≤ i ≤ . Lemma 3. Let s1 and s2 be two different states in a random automaton with |X| = 2. Let a, b ∈ X and = 3 log(n). The probability that s1 and s2 are distinguishable, but not by any string in D (a, b) is O
log6 (n) n3
.
The proof of Lemma 3 is a case analysis using reasoning similar to that of Lemma 2; we include an outline. If s1 and s2 are assigned different labels, then they are distinguished by the empty string, so assume that they are assigned the same label. If we consider τ (s1 , a) and τ (s2 , a), there are four cases, as follows. (1) We have τ (s1 , a) = τ (s2 , a) and neither one is s1 or s2 . In this case, an argument analogous to that in Lemma 2 shows that the probability that the paths ai , abi and bi fail to produce a distinguishing string for s1 and s2 is bounded by O(log6 (n)/n3 ). (2) Exactly one of τ (s1 , a) and τ (s2 , a) is in the set {s1 , s2 }. This happens with probability O(1/n), and in this case we can show that the probability that the paths ai and bi do not produce a distinguishing string for s1 and s2 is bounded by O(log4 (n)/n2 ), for a total failure probability of O(log4 (n)/n3 ) for this case. (3) Both of τ (s1 , a) and τ (s2 , a) are in the set {s1 , s2 }. This happens with probability O(1/n2 ), and in this case we can show that the probability that the path bi does not produce a distinguishing string for s1 and s2 is bounded by O(log 2 (n)/n), for a total failure probability of O(log2 (n)/n3 ) for this case. (4) Neither of τ (s1 , a) and τ (s2 , a) is in the set {s1 , s2 }, but τ (s1 , a) = τ (s2 , a). This happens with probability O(1/n), and we proceed to analyze four parallel subcases for τ (s1 , b) and τ (s2 , b). (4a) We have τ (s1 , b) = τ (s2 , b) and neither of them is in the set {s1 , s2 }. We can show that the probability that the paths bi and bai do not produce a distinguishing string for s1 and s2 is bounded by O(log4 (n)/n2 ), for a failure probability of O(log4 (n)/n3 ) in this subcase, because the probability of case (4) is O(1/n). (4b) Exactly one of τ (s1 , b) and τ (s2 , b) is in the set {s1 , s2 }. In this subcase, we can show that the probability that the path bi fails to produce a distinguishing string for s1 and s2 is bounded by O(log2 (n)/n), for a total failure probability in this subcase of O(log 2 (n)/n3 ), because the probability of case (4) is O(1/n) and the probability that one of τ (s1 , b) and τ (s2 , b) is in {s1 , s2 } is O(1/n). (4c) Both of τ (s1 , b) and τ (s2 , b) are in {s1 , s2 }. The probability of this happening is O(1/n2 ), for a total probability of this subcase of O(1/n3 ), because the probability of case (4) is O(1/n). (4d) We have τ (s1 , b) = τ (s2 , b). Then because we are in case (4), τ (s1 , a) = τ (s2 , a) and the labels assigned s1 and s2 are equal, so the states s1 and s2 are equivalent and therefore indistinguishable.
Acknowledgments We would like to thank the anonymous referees for helpful comments.
Learning Finite Automata Using Label Queries
185
References 1. Angluin, D.: A note on the number of queries needed to identify regular languages. Information and Control 51(1), 76–87 (1981) 2. Angluin, D.: Queries and concept learning. Machine Learning 2(4), 319–342 (1987) 3. Becerra-Bonache, L., Dediu, A.H., Tˆırn˘ auc˘ a, C.: Learning DFA from correction and equivalence queries. In: Sakakibara, Y., Kobayashi, S., Sato, K., Nishino, T., Tomita, E. (eds.) ICGI 2006. LNCS (LNAI), vol. 4201, pp. 281–292. Springer, Heidelberg (2006) 4. Domaratzki, M., Kisman, D., Shallit, J.: On the number of distinct languages accepted by finite automata with n states. Journal of Automata, Languages and Combinatorics 7(4) (2002) 5. Freund, Y., Kearns, M.J., Ron, D., Rubinfeld, R., Schapire, R.E., Sellie, L.: Efficient learning of typical finite automata from random walks. Information and Computation 138(1), 23–48 (1997) 6. Korshunov, A.: The degree of distinguishability of automata. Diskret. Analiz. 10(36), 39–59 (1967) 7. Lee, D., Yannakakis, M.: Testing finite-state machines: State identification and verification. IEEE Trans. Computers 43(3), 306–320 (1994) 8. Trakhtenbrot, B.A., Barzdin’, Y.M.: Finite Automata: Behavior and Synthesis. North-Holland, Amsterdam (1973)
Characterizing Statistical Query Learning: Simplified Notions and Proofs Bal´azs Sz¨or´enyi1,2 1
Fakult¨ at f¨ ur Mathematik, Ruhr-Universit¨ at Bochum, D-44780 Bochum, Germany Hungarian Academy of Sciences and University of Szeged, Research Group on Artificial Intelligence, H-6720 Szeged
[email protected]
2
Abstract. The Statistical Query model was introduced in [6] to handle noise in the well-known PAC model. In this model the learner gains information about the target concept by asking for various statistics about it. Characterizing the number of queries required by learning a given concept class under fixed distribution was already considered in [3] for weak learning; then in [8] strong learnability was also characterized. However, the proofs for these results in [3,10,8] (and for strong learnability even the characterization itself) are rather complex; our main goal is to present a simple approach that works for both problems. Additionally, we strengthen the result on strong learnability by showing that a class is learnable with polynomially many queries iff all consistent algorithms use polynomially many queries, and by showing that proper and improper learning are basically equivalent. As an example, we apply our results on conjunctions under the uniform distribution.
1
Introduction
The Statistical Query model (called SQ model for short) was introduced by Kearns [6] as an approach to handle noise in the well-known PAC model. The general idea is that—instead of using random examples as in the PAC model— the learner gains information about the unknown function by asking various statistics (called queries) over the distribution of labeled examples. As it was shown by Kearns [6], any learning algorithm in the SQ model can be transformed to a PAC algorithm without much loss in efficiency. It is even more interesting that the resulting algorithm is robust to noise. He has also shown that many efficient PAC algorithms can also be converted to an efficient SQ algorithm. Despite the power of the model that is apparent from the above results, it is still weaker than the PAC model. Indeed, already in [6] it was shown that the parities, which is a PAC-learnable class, cannot be efficiently learned in the SQ model under the uniform distribution. The proof used an information theoretic
This work was supported in part by the Deutsche Forschungsgemeinschaft Grant SI 498/8-1, and the NKTH grant of the National Technology Programme 2008 (project codename AALAMSRK NTP OM-00192/2008) of the Hungarian government.
R. Gavald` a et al. (Eds.): ALT 2009, LNAI 5809, pp. 186–200, 2009. c Springer-Verlag Berlin Heidelberg 2009
Characterizing Statistical Query Learning: Simplified Notions and Proofs
187
argument, which was generalized later by Blum et al. in [3] to characterize weak learnability of a concept class (where the goal is to do slightly better than random guessing) in the SQ model for the distribution dependent case (i.e., when the underlying distribution is fixed in advance and is known by the learner). The characterization is based on the so called SQ dimension of the class which is, roughly, the maximal size of an “almost orthogonal” system in the class. However, the proof in [3] is rather long and complex. Subsequently Yang gave an alternative, elegant proof for basically the same result [10]. In this paper we present yet another, but much shorter proof, thereby significantly simplifying on both existing proofs. Strong learnability (i.e., when the goal is to approximate the target concept with arbitrary accuracy) of a concept class in the distribution dependent case was first characterized by K¨ obler and Lindner [7] in terms of a general framework for protocols, called the general dimension. Independently Simon in [8] gave another characterization for strong learnability that was based on the SQ dimension (more precisely it was based on the SQ dimension of the class after some translation), and is more of an algebraic flavor. However, both the characterization and the proof are rather complex in [8]; as we shall show in this paper, our simple approach that is successful in characterizing weak learnability, can be also applied for strong learnability, thereby giving an alternative, simple characterization for this problem as well, which might also have the potential to be easier to apply and calculate for concrete concept classes. Recently Feldman has also obtained a simple characterization of strong SQ learnability of a similar flavor [5], however the two papers focus on different perspectives: Feldman is interested in applications to agnostic learning and evolvability, meanwhile our main interest is to find a really simple proof and a unified view of weak and strong learnability. Additionally our approach also reveals that in the distribution dependent case query-efficient learnability is possible if and only if all consistent learning algorithms learn the given concept class query-efficiently.1 As far as we know, this was not known before. We also show that in the distribution dependent case proper learning (i.e., when the queries of the learner are restricted to use functions from the given concept class) is as strong as improper learning, but we would like to point out that this can be easily deduced already from the characterization result of Simon (see Observation 5). Finally we show that in the distribution independent case (i.e., when the learner doesn’t know anything about the underlying distribution) proper and improper learning can differ significantly, and we contrast this with the above mentioned result on their equivalence in the distribution dependent case. Equivalent models. Ben-David et al. have introduced an equivalent model, called learning by distances [2], and have also given upper and lower bound on the minimal number of queries required for learning. However their upper bound 1
Query-efficiency means that the number of queries used by the learner is bounded by some polynomial of the various parameters. When query-efficiency is in focus, then usually no restrictions are set on the running time.
188
B. Sz¨ or´enyi
is exponential in their lower bound (see also our discussion on the topic in Sect. 7) and the paper does not reveal the relation of the model to noise-tolerant PAC learning (which gave the importance of the SQ model). In [11] Yang has introduced the model of honest SQ model using stronger queries and less adversarial settings than the ones used in the SQ model. In [9] it is shown how to apply the results and methods of this paper to prove a somewhat surprising result: the equivalence of the honest and the “pure” SQ model. Organization of the paper. Section 2 contains the formal introduction of the SQ model and also some basic definitions. In Sect. 3 we present our alternative proof for characterizing weak learnability with the SQ dimension, in Sect. 4 we discuss the relation of strong and weak learnability, and then in Sect. 5 we characterize strong learnability. In Sect. 6 we analyze the relation of our strong SQ dimension to the ones of Simon and Feldman. In Sect. 7 as an example, we compute our dimension notion for conjunctions under the uniform distribution. Finally, in Sect. 8 we contrast the result on the equivalence of proper and improper learning in the distribution dependent case with the fact that they occasionally significantly differ in the distribution independent case.
2
Preliminaries
A concept is a mapping from some domain to {−1, 1}. A concept class is a set of concepts with the same domain. A Boolean concept over n variables is a concept of the form {−1, 1}n → {−1, 1}. A family of concept classes is an infinite set {Fn }∞ n=1 , such that each Fn is a concept class. The class of all concepts over some domain X is denoted C(X). The correlation of two functions f, g : X → R under some distribution D over X is defined as f, gD = E[f (ξ)g(ξ)], where ξis a random variable with distribution D. The norm of f under D is f D := f, f D . f is said to be a γ-approximation of g, if f, gD ≥ γ. In the Statistical Query model a learner (or learning algorithm) L can make queries of the form (h, τ ), where τ is a positive constant called tolerance, and h is chosen from some concept class H called the query class. Each such query is answered with some c satisfying |c − f ∗ , hD | ≤ τ, where f ∗ is some fixed concept, called the target concept that is unknown for the learner, and where D is some distribution over the input domain of f ∗ . (Here the learner is supposed to be familiar with D.) The learner succeeds when he finds some function f ∈ H having correlation at least γ with f ∗ for some constant γ > 0 fixed ahead of the learning process. D,L Parameter γ is called accuracy. Let qF ,H (τ, γ) denote the smallest integer q such that L always succeeds in the above setting using at most q queries when the target concept belongs to F . Finally, SLCD F ,H (τ, γ) (or the statistical learning complexD,L ity) is defined to be the minimum value of qF ,H (τ, γ) over all possible learning algorithms L. We would like to emphasize that in this paper we are interested only in the number of queries during the learning process (i.e., the information complexity of learning), and do not consider the running time.
Characterizing Statistical Query Learning: Simplified Notions and Proofs
189
Note that originally in [6] the SQ model allowed much more general queries, but in [4] Bshouty and Feldman have shown that the two models are equivalent.2 We also consider the following variants of the above described learning model. The learning is called proper when F = H, and is called improper when F H. Also, in general, a query (h, τ ) is proper if h ∈ F , otherwise it is improper. The learner is a consistent learner, if | hi , hj D − ci | ≤ τi for i < j, where (hi , τi ) is the i-th query of the learner and ci is the answer for it. Finally, note that in the above definition the learner is supposed to be familiar with the underlying distribution, but the model can also be defined for the case when this is not true. We are mainly interested in the former case (except for Sect. 8), but when we want to explicitly refer to one case or the other, we shall call the former the distribution dependent and the latter the distribution independent case. For simplicity, when it causes no confusion, we omit D from notations like SLCD F ,H (τ, γ) and f, gD , and simply use SLCF ,H (τ, γ) and f, g instead. Definition 1. We say that a family {Fn }∞ n=1 of concept classes is weakly learnable in the SQ model with a family {Hn }∞ n=1 of query classes if there exists some γ(n) > 0 and τ (n) > 0 such that 1/γ(n), 1/τ (n) and SLCFn ,Hn (τ (n), γ(n)) are polynomially bounded in n. {Fn }∞ n=1 is strongly learnable in the SQ model with queries from {Hn }∞ if there exists some τ (n, ) > 0 such that 1/τ (n, ) and n=1 SLCFn ,Hn (τ (n), 1 − ) are polynomially bounded in n and 1/. The following Observation, which we shall apply several times later, is basically the reason for the equivalence of the proper and improper learning in the distribution dependent model. Observation 1. Let f, g and h be arbitrary concepts. If f, h ≥ 1 − and g, h ≥ 1 − , then f, g = (1/2) f + g, f + g − 1 ≥ f + g, h − 1 ≥ 1 − 2. Although this paper mainly considers concepts and concept classes, we would like to point out that all the results remain valid for classes of functions with norm bounded by 1 (which might be tempting to use for example in query classes)—albeit in some cases, when the proof applies Observation 1, the constants get slightly worse.3 The reason for this is the following lemma which is the generalization of Observation 1 for these functions. 2
3
Actually they have shown how to simulate an arbitrary statistical query using two statistical queries that are independent of the target function and two correlation queries. However, when running time is not considered and the underlying distribution is known, one can omit the two former queries and just compute them directly. The choice of 1 as an upper bound for the query function is arbitrary, one can use any other constant instead. (But note that smaller constants would exclude all concepts.) However, unbounded queries should not be allowed, because they make the learning problem trivial. Indeed, for example when the target concept is Boolean over n variables, and one uses tolerance 1/2 and with the function that na query with n i(xi +1)/2 evaluates x ∈ {−1, 1}n to , then the value of the target i=1 (1/) · 2 · 2 concept on inputs with probability at least /2n can be reconstructed from the answer to this query, meanwhile the sum of the probabilities of the rest of the inputs is less than .
190
B. Sz¨ or´enyi
Proposition 1. When f, g, h : {−1, 1}n → {−1, 1} have norm at most 1, and f, h ≥ 1 − and g, h ≥ 1 − , then f, g ≥ 1 − 6. Proof. First of all, by Cauchy-Schwarz, f ≥ f, h ≥ 1 − , and similarly g ≥ 1 − . Using this 2
2
2(1 − 2) − 2 f, g ≤ f + g − 2 f, g 2
= f − g
≤ (f − h + g − h)2 2 ≤ 2 − 2 f, h + 2 − 2 g, h ≤ 8 , implying 1 − 6 ≤ f, g.
Finally for integer d let [d] denote the set {1, . . . , d}.
3
Characterizing Weak Learnability
According to the definition, weak learnability is possible if and only if there exists some polynomial p(n) such that SLCFn ,Hn (1/p(n), 3/p(n)) ≤ p(n) (simply define p(n) to be a polynomial that upper bounds 1/τ (n), 3/γ(n) and SLCFn ,Hn (τ (n), γ(n))). This way the task of weak learning is basically to find functions hn,1 , . . . , hn,p(n) ∈ Hn such that all f ∈ Fn has correlation at least 3/p(n) with at least one of hn,1 , . . . , hn,p(n) . Thus p(n) (and this way SLC itself) can be considered as a kind of covering number. Bshouty and Feldman in [4] make this property explicit in their characterization of weak learnability. On the other hand, the notion of SQ dimension introduced by Blum et al. [3] is rather a packing number in nature: Definition 2. The SQ dimension (or weak SQ dimension) of a class of real valued functions F over some domain X and under distribution D over X, denoted SQDimD F , is the biggest integer d such that F contains some distinct functions f1 , . . . .fd with pairwise correlations between −1/d and 1/d. (Note that SQDim is defined not only for concept classes but also for more general classes; Definition 4 will really make use of this generality.) For simplicity, as mentioned, we use SQDimF instead of SQDimD F when this leads to no confusion. The nice feature of the characterization result in [3] is that it binds the two different type of notions. One direction, namely that SQDimF queries are enough for weakly learning concept class F (properly!) is easy: if {f1 , . . . , fd } is a maximal subset of F fulfilling | fi , fj | ≤ 1/d for 1 ≤ i < j ≤ d, then (due to the maximality) it obviously holds that at least one of them has correlation at least 1/d with the target concept, thus the learner simply needs to query f1 , . . . , fd with tolerance 1/(3d) in order to find an 1/(3d) approximation of it. However
Characterizing Statistical Query Learning: Simplified Notions and Proofs
191
the proof in [3] for the other direction was rather long and complex. Subsequently Yang in [10] gave another, elegant proof for this direction, based on the eigenvalues of the correlation matrix of the concept class.4 Here we show that basically the same result can be derived using a very simple argument, thus significantly simplifying on both of the above mentioned proofs. The proof in some sense follows the same line of thought they use, but lacks the machineries applied in them. Theorem 2. Let F be a concept class and let d := SQDimF . Then any learning algorithm that uses tolerance parameter lower bounded by τ > 0 requires in the worst case at least (dτ 2 − 1)/2 F with accuracy at least τ . √ queries for learning √ In particular, when τ = 1/ 3 d, this means ( 3 d − 1)/2 queries. Proof. Assume that f1 , . . . , fd ∈ F fulfill | fi , fj | ≤ 1/d for i, j ∈ [d] distinct. We show an (adversary) answering strategy that ensures to eliminate only a small number of these functions after each query. Let h be an arbitrary query function used by the learner (having thus norm at most 1) and let A := {i ∈ [d] : fi , h ≥ τ }. Then, by the Cauchy-Schwarz inequality 2 2
|A|2 |A| − 1 h, ≤ |A| + , 1+ fi ≤ fi = fi , fj ≤ d d i∈A
i∈A
i,j∈A
i∈A
meanwhile, by the choice of A it holds that h, i∈A fi ≥ |A|τ, and the two together implies that 1/|A| + 1/d ≥ τ 2 or equivalently, that |A| ≤ d/(dτ 2 − 1). Similar argument shows that at most d/(dτ 2 − 1) of the fi functions have correlation at most −τ with h. Thus at most 2d/(dτ 2 − 1) of the functions will be inconsistent with the answer if the adversary returns 0 to this query. This, in turn, implies the desired lower bound (dτ 2 − 1)/2 on the learning complexity.
It is also worth mentioning that this result is quite tight in the improper case, when the learner can use arbitrary functions of norm 1 in the queries. Indeed, (i+1)·d2/3 if the concept class itself is {f1 , . . . , fd }, then defining gi := j=i·d2/3 +1 fj for √ i = 0, 1, . . . , d1/3 − 1 (assuming for simplicity that 3 d is integer), at least one hi = gi / gi , i = 0, 1, . . . , d1/3 − 1 will have correlation at least
1 1 1 − d2/3 (1) 2/3 2/3 d d + d d2/3 (1/d) √ with the target function. Note that (1) asymptotically equals to 1/ 3 d.
4
Weak and Strong Learning
Aslam and Decatur [1] apply the boosting techniques from the PAC model to SQ learning and show how to use (efficiently) a weak learning algorithm to achieve 4
The correlation matrix of the concept class F = {f1 , . . . , fs } is the s × s matrix C such that Ci,j = fi , fj .
192
B. Sz¨ or´enyi
strong learnability. Their primary concern is the distribution independent case, but their result (combined with results for weak learning) also has the following consequence in the distribution dependent case:
1 2 1 D 5 log , 1 − = O d log , max SLCF ,H D 3d when H ⊇ F, and where d = maxD SQDimD F . However, this does not imply any result on fixed distributions in general. Indeed, when the support of a distribution consists of only a single input, then one query is enough both in the weak and in the strong setting—for any concept class. Thus the gap between the upper bound in the above equation and the number of queries required for strong learning under some given (known) distribution can be as big as possible: exponential versus constant. What is more, we cannot expect to bound the strong SQ dimension under some distribution D using the weak SQ dimension under the same distribution. Indeed, consider for example the uniform distribution and the concept class Fn consisting of all the functions of the form v1 ∨ f , where f is any parity function over variables v2 , . . . , vn . Then |Fn | = 2n−1 , and any two distinct elements (v1 ∨ f ), (v1 ∨ f ) ∈ Fn have correlation 1/2:
1 1 1 + P[f = f ] − 1 = v1 ∨ f, v1 ∨ f = 2 P (v1 ∨ f ) = (v1 ∨ f ) − 1 = 2 2 2 2
(as the parity functions are uncorrelated under the uniform distribution), and so by Theorem 4 strong learning of Fn requires superpolynomial number of queries, meanwhile weak learning requires none.5
5
Characterizing Strong Learnability
In this section we give a complete characterization of strong learnability. More precisely we define a dimension notion that is a generalization of the weak SQ dimension SQDim from Sect. 3, and show that it is closely related to the learning complexity. Definition 3. For a concept class F let d0 (F , γ) denote the largest d such that some f1 , . . . , fd ∈ F fulfill • | fi , fj | ≤ γ for 1 ≤ i < j ≤ d, and • | fi , fj − fk , f | ≤ 1/d for all 1 ≤ i < j ≤ d and 1 ≤ k < ≤ d. Actually, this dimension notion is a kind of combination of the strong SQ dimension of Simon [8] (see also Sect. 6) and Yang [10]. 5
Yang [11] has also shown a similar result for another concept class, but the argument there is more complicated.
Characterizing Statistical Query Learning: Simplified Notions and Proofs
193
Theorem 3. Let F be a concept class and let d := d0 (F , 1 − /2). Then any consistent algorithm that uses tolerance τ ≤ min{1/(4d+4), /4} requires at most d/τ queries to learn F with accuracy 1 − . Specifically, setting τ = min{1/(4d + 4), /4}, the algorithm finds an (1 − )-approximation of the target concept after 4d · max{d + 1, 1/} queries, implying SLCF ,F (τ, 1 − ) ≤ 4d · max{d + 1, 1/}. Proof. Assume that some consistent algorithm used tolerance as above, queried h1 , . . . , hq in this order, and got the answers c1 , . . . , cq in this order. Suppose that for some 1 ≤ i1 < i2 < · · · < i ≤ q and some c ∈ [−1, 1] it holds that
cij ∈ [c − τ, c + τ ] for j = 1, . . . , . The algorithm is thus hij , hik ∈
consistent, [cij − τ, cij + τ ] for 1 ≤ j < k ≤ , consequently hij , hik ∈ [c − 2τ, c + 2τ ] ⊆ [c − 1/(2d + 2), c + 1/(2d + 2)] for 1 ≤ j < k ≤ . Also note that | hi , hj | ≤ |ci | + τ ≤ 1 − /2 for 1 ≤ i < j ≤ q, since c1 , . . . , cq have absolute value less then 1 − 3/4 (as otherwise the algorithm would have successfully terminated). The two together imply however that ≤ d0 (F , 1 − /2). As this was true for any c, it follows that q ≤ d0 (F , 1 − /2)(2/(2τ )).
The proof for the other direction has the same structure as the proof for Theorem 2, with some necessary modifications. Theorem 4. Let F ⊆ C(X) be any concept class for some domain X, and assume d := d0 (F , 1 − 2) ≥ 3. Then if the tolerance τ is bigger than 3/(2d/2), √ √ then SLCF ,C(X) (τ, 1−) ≥ dτ 2 /3. In particular SLCF ,C(X) (1/ 3 d, 1−) ≥ 3 d/3. Proof. Assume 3/(2τ 2 ) ≤ d/2 and let d := 3/(2τ 2 ). By the choice of d there exist f1 , . . . , fd ∈ F and c ∈ (−1 + 2, 1 − 2) satisfying | fi , fj − c| ≤ 1/(2d) for all 1 ≤ i < j ≤ d. We show an (adversary) answering strategy that ensures to eliminate only a small number of the fi functions after each query. Let h ∈ C(X) be an arbitrary query function used by the learner, and assume for simplicity that f1 , h ≥ f2 , h ≥ · · · ≥ fd , h. Define α := fd , h, β := fd−d +1 , h, A := [d ] and B := {d − d + 1, d − d + 2, . . . , d}. Then 1 − ≥ α ≥ β ≥ −1 + whenever d ≥ 3 (recall Observation 1 and that d ≤ d/2 by our assumption on τ ), furthermore A and B are disjoint sets of cardinality d . First note that 2 1 1 fi − fi d d i∈A i∈B 1 = 2 fi 2 + fi 2 + fi , fj (d ) i∈A i∈B i,j∈A:i =j + fi , fj − 2 fi , fj i,j∈B:i =j
i∈A j∈B
1 1 1 1 + d (d − 1) c + − 2(d )2 c − ≤ 2 2d + d (d − 1) c + (d ) 2d 2d 2d 4 2 ≤ + , d d and so, by the Cauchy-Schwarz inequality
194
B. Sz¨ or´enyi
1 1 h, fi − fi d d i∈A
i∈B
1 1 2 4 6 ≤ fi − fi ≤ + ≤ . d d d d d i∈A
i∈B
On the other hand, by the definition of A and B it also holds that 1 1 1 1 h, fi − fi = h, fi − h, fj ≥ α − β , d d d d i∈A
i∈B
i∈A
j∈B
and so α − β ≤ 6/d ≤ 2τ . Thus, answering the learner’s query with (α + β)/2, all but at most 2d −2 functions will be consistent with the answer. This, in turn, implies the desired lower bound d/(2d − 2) ≥ dτ 2 /3 on the learning complexity.
The main result of this section is the following corollary of the two theorems above: Corollary 1. The following statements are equivalent for any family {Fn }∞ n=1 of concept classes under arbitrary (fixed) distribution: – d0 (Fn , 1 − ) is polynomially bounded in n and 1/, – {Fn }∞ n=1 is strongly learnable by some (possibly improper) algorithm, – {Fn }∞ n=1 is strongly learnable by all consistent learning algorithms.
6
Other Dimension Notions for Strong Learnability
In this section we consider the relation of d0 and the strong SQ dimensions of Simon [8] and Feldman [5]. For this let us first introduce SQDim∗ from [8]. Definition 4 ([8]). Given some concept class F, a subclass F of it is (γ, H)trivial for some query class H and constant 0 < γ < 1, if some function h ∈ H has correlation of at least γ with at least half of the functions in F . The remaining subclasses of F are said to be (γ, H)-nontrivial. The strong SQ dimension associated with concept class F and query class H is the func tion SQDim∗F ,H (γ) := supF SQDimF −BF , where F ranges over all (γ, H) nontrivial subclasses of F , and where BF = (1/|F |) f ∈F f . As it turns out below, it doesn’t really matter, which query class is used, as long as it contains the concept class itself. Observation 5. When F ⊆ H, then any (1 − , F )-trivial subset of F is also (1−, H)-trivial, meanwhile, by Observation 1, it also holds that any (1−/2, H)trivial subset of F is also (1 − , F )-trivial. Thus SQDim∗F ,H (1 − ) ≤ SQDim∗F ,F (1 − ) ≤ SQDim∗F ,H 1 − . 2 .
Characterizing Statistical Query Learning: Simplified Notions and Proofs
195
The following equation we shall need later. f, g = f − B, g − B + f, B + g, B − B
2
.
(2)
Theorem 6. For any concept classes F and H satisfying F ⊆ H it holds that max{32/2, 9d20 (F , 1 − 2 /32)} ≥ SQDim∗F ,H (1 − ). Proof. According to Observation 5, it is enough to show that the statement of the theorem holds for H = F . Let F be a (1 − , F )-nontrivial subset of F, and let F0 be a subset of F such that SQDimF0 −BF = |F0 |. Assume furthermore that d := |F0 | ≥ 32/2. Consider the correlation of BF with √ all the functions in F0 . Obviously there exist some c ∈ [−1, 1] and some d ≥ d/3 such√that for some √ distinct functions f1 , . . . , fd ∈ F0 it holds that fi , BF ∈ [c−1/ 9d, c+1/ 9d] for j = 1, . . . , d . Then for arbitrary indices i, j, k, ∈ [d ] fulfilling i = j and k = it holds (using (2)) that | fi , fj − fk , f | =|(fi − BF , fj − BF − fk − BF , f − BF ) + (fi , BF − fk , BF ) + (fj , BF − f , BF )| 2·2 2 ≤ +√ d 9d 3 (3) ≤√ d using that d ≥ 32. To prove the theorem it thus suffices to show that the correlation of any two distinct elements of F has absolute value at most 1 − 2 /32.6 To upper bound fi , fj for some 1 ≤ i < j ≤ d first note that using (2) with f = fi , g = fj and B = BF , and then applying the Cauchy-Schwarz inequality fi , fj ≤
1 + BF (2 − BF ) . d
(4)
Also note that the (1 − , F )-nontriviality of F implies that
|F | 1 1 1 |F | 2 (1 − ) + =1− , BF = 2 g, f ≤ | |F | |F | |F 2 2 2 g∈F
g,f ∈F
and therefore BF ≤ 1 − /2 ≤ 1 − /4. Combining this with (4), and noting that x(2 − x) is monotone increasing on (0, 1) we get that fi , fj ≤ 6
1 2 1 + 1− 1+ =1+ − . d 4 4 d 16
Note that we cannot apply Observation 1 (or Proposition 1) directly to bound fi , fj , because nontriviality only guarantees that none of the fi functions have high correlation with at least half of F , which doesn’t prevent them from having really high correlation with some smaller portion of F . It thus has to be shown that no such set contains another fi .
196
B. Sz¨ or´enyi
Thus, since d ≥ 32/2, we have fi , fj ≤ 1 − 2 /32. Finally, let us give a lower bound for the pairwise correlation. If one pair had correlation less than −1 + 1/32, then, √ according to (3) all other pairs would have correlation at most −1 + 1/32 + 3/ d, implying d 2 fi 0≤ i=1
=
d
2
fi + 2
i=1
fi , fj
1≤i<j≤d
3 1 +√ ≤ d + d (d − 1) −1 + , 32 d
which would lead to a contradiction, as d ≥ 32. Consequently fi , fj ≥ −1 + 2 /32 for 1 ≤ i < j ≤ d .
Theorem 7. Let F and H be concept classes satisfying F ⊆ H. Then d0 (F, 1−) ≤ max{2, 2·SQDim∗F ,F (1−/2)} ≤ max{2, 2·SQDim∗F ,H (1−/4)} . Proof. The second inequality follows from Observation 5. To prove the first inequality, let F := {f1 , . . . , fd } ⊆ F be such that | fi , fj | < 1 − and | fi , fj − fk , f | < 1/d for 1 ≤ i < j ≤ d and 1 ≤ k < ≤ d. Then | fi − BF , fj − BF | d d 1 1 fk , f − (fi , fk + fj , fk ) = fi , fj + 2 d d k,=1 k=1 d d 1 1 ≤ (fi , fj − fi , fk ) + 2 (fk , f − fj , fk ) d d k=1
k,=1
2 ≤ . d Furthermore, by Observation 1, F is (1 − /2, F )-nontrivial.
∗
The dimension notion introduced in [5] is a kind of simplified version of SQDim : Definition 5 ([5]). For concept class F over domain X let SSQ-DIM(F , ) := maxh SQDim{f ∈(F −h): f 2 ≥} , where h ranges over all mappings from X to [−1, 1]. Furthermore the proof of the two theorems above can be easily modified to show: Theorem 8. For any concept class F it holds that max{32, 2/, 9d20(F , 1 − /2)} ≥ SSQ-DIM(F, ) and max{2, 2 · SSQ-DIM(F , 2 /16)} ≥ d0 (F , 1 − ).
Characterizing Statistical Query Learning: Simplified Notions and Proofs
7
197
d0 for Conjunctions under the Uniform Distribution
In this section, as an example, we compute the exact value of d0 for the class of conjunctions under the uniform distribution, up to a constant factor. (Note however that this class is efficiently learnable in the Statistical Query model even distribution independently [6], so d0 is obviously polynomial in n and in 1/.) First of all let us compute the correlation of two conjunctions t and t that have length and respectively, and share exactly s literals (as usual, −1 is interpreted as “true” and 1 as “false”): t, t = E[t · t ] = 1 − 2 P[t = t ]
= 1 − 2(P[t = −1] + P[t = −1] − 2 P[t = t = −1]) if t and t conflict, 1 − 2/2 − 2/2 = 1 − 2/2 − 2/2 + 4/2+ −s otherwise.
(5)
Next we prove a technical lemma we shall need later. Here we apply the convention that for some x ∈ {0, 1}n the number of 1s in x is denoted |x|, and that for x, y ∈ {0, 1}n x ∨ y (resp. x ∧ y) is the vector of length n with 1 on those components that are 1 in at least one of x and y (resp. in both x and y), and is 0 everywhere else. For conjunctions we use similar notations, that is, |t| denotes the number of literals appearing in term t, and t ∧ t denotes the term obtained by joining the literals appearing in terms t and t . Lemma 1. If for some H ⊆ {0, 1}n and for some integer c it holds that |x∨y| = c for arbitrary distinct x, y ∈ H, then |H| ≤ n + 1. Proof. For x ∈ H let xc denote the vector obtained by flipping the bits in x. Then by De Morgan xc ∧ y c = (x ∨ y)c , and thus |xc ∧ y c | = n − c for arbitrary x, y ∈ H. Construct the n × |H| matrix X such that its columns are the vectors from H in an arbitrary order, and let C be the |H| × |H| matrix having n − c in each entry. First of all note that X X − C is a diagonal matrix. If it contains some zero element in the diagonal, then |xc | = n − c for some x ∈ H, implying that for all other y ∈ H y c has 1 everywhere where x does and that each such y c must have 1 at some unique position where the others have 0. This immediately implies |H| ≤ n+1. Otherwise, when X X −C is a nonsingular diagonal matrix, |H| = rank X X − C ≤ rank X X + 1 = rank(X) + 1 ≤ min{n, |H|} + 1 implying the statement of the claim.
Proposition 2. Let Fn be the set of conjunctions over variables v1 , . . . , vn . Then under the uniform distribution d0 (Fn , 1 − ) ≤ 1 + max{2n + 2, 8/2 }. Proof. Let t1 , . . . , td be terms satisfying | ti , tj | ≤ 1− and | ti , tj −tk , t | ≤ 1/d for i, j, k, ∈ [d] fulfilling i = j and k = . Assume for simplicity that td
198
B. Sz¨ or´enyi
is the longest term among them. Then by (5) it holds that 1 − ≥ ti , td ≥ 1 − 4 P[ti = −1], implying −|t | P[ti = −1] = 2 i ≥
and thus
, 4
0 if ti and tj conflict P[ti = tj = −1] = 2−|ti ∧tj | ≥ (/4)2 otherwise
(6)
(7)
for distinct i, j ∈ [d − 1]. Let us assume that 1/d < 2 /8. If for some I ⊆ [d − 1] it holds that all ti , i ∈ I, has the same length, then for any indices i, j, k, ∈ I fulfilling i = j and k = 1 1 2 (5 ) > ≥ | ti , tj − tk , t | = |P[ti = tj = −1] − P[tk = t = −1]| . 32 4d 4 Note that it cannot happen that ti and tj conflict with each other, but tk and t do not—or vice versa—, since by (7) that would mean that the right hand side is at least 2 /16, resulting in a contradiction. So either all ti with i ∈ I conflict each other, or there is no conflicting pair among the terms with index in I. The former case implies that {ti = −1}i∈I are all contradicting events, and (6) so 1 ≥ i∈I P[ti = −1] ≥ |I| · (/4), giving the bound |I| ≤ 4/. In the latter case, since by (7) both 2−|ti ∨tj | and 2−|tk ∨t | are at least 2 /16, we have that 2−|ti ∨tj | > (1/2)2−|tk ∨t | and 2−|tk ∨t | > (1/2)2−|ti∨tj | . This, however implies that |ti ∨ tj | = |tk ∨ t |, and so, by Lemma 1 (applied for H ⊆ {0, 1}n consisting of the vectors that represent some ti with i ∈ I by having 1 on position j iff ti contains variable vj ), I has cardinality at most n + 1. We have just seen that the sum of the number of terms of minimal length and the number of terms of length one more is at most max{2n + 2, 8/}. However, there cannot be distinct indices i, j, k ∈ [d − 1] fulfilling |ti | + 2 ≤ |tj |, |tk |, as otherwise 1 2 > 8 d ≥ | ti , tj − tj , tk | = |2 P[ti = −1] − 4 P[ti = tj = −1] − 2 P[tk = −1] + 4 P[tk = tj = −1]| 1 ≥ · P[ti = −1] 2 (6) , ≥ 8 a contradiction.
Note that this bound is sharp up to a constant factor according to the example below and that the terms consisting of one unnegated variable form an orthogonal
Characterizing Statistical Query Learning: Simplified Notions and Proofs
199
system of cardinality n. It also immediately follows that these results remain tight even if we restrict Fn to be the set of monotone conjunctions over v1 , . . . , vn .7 Example 1. Let Fn be the set of all monotone conjunctions over variables v1, . . . , vn and let Fn () consist of all t ∈ Fn of length . Set := 2− and note that if t1 , t2 ∈ (5)
Fn () share s < variables, then under the uniform distribution | t1 , t2 | = 1 − 4/2 + 4/22−s ≤ 1 − 2. If additionally t3 , t4 ∈ Fn () share s < variables, (5) then | t1 , t2 − t3 , t4 | = 4/22−s − 4/22−s = 2 4 2s − 2s . Now we choose = (n) := c log n for some c > 1 (and thus = (n) = 1/nc ) and s = s(n) := log log n, and prove that d0 (Fn , 1 − ) = Ω(2 ) = Ω(n2c ) by showing that one can find an I ⊆ Fn () of cardinality Ω(n2c ) that contains no two distinct conjunctions sharing more than s variables. Such an I can simply be obtained using the greedy method, since when n− ≥ 2(−s) then for any t ∈ Fn () there n− ≤ 2 n− in Fn () that share at least are exactly −s i=0 i i −s conjunctions n s variables with t, thus (noting that |Fn ()| = ) I can always be expanded by some term when it has cardinality less than n 1 ns n− ∼ s 2 2 −s (using Stirling’s formula).
8
Proper vs. Improper Learning in the Distribution Independent Case
In the distribution dependent case (i.e., when the learner knows the underlying distribution) proper and improper learning are basically the same (recall Corollary 1). In this section we contrast this result showing that in the distribution independent case proper and improper learning can differ significantly. Consider for example the class of singletons: Fn := {fx : x ∈ {−1, 1}n}, where fx evaluates to −1 on x, and evaluates to 1 on every other input. Since Fn is a subset of conjunctions, which was shown by Kearns in [6] to be efficiently learnable in the Statistical Query model, Fn can be learned using polynomially many improper queries. Let us now define for each x, y ∈ {−1, 1}n a distribution Dx,y , which assigns probability 1/2 to both x and y, and assigns probability 0 to every other input. The key observation is that in case of proper learning each query must be one of the fx functions. But then, as long as there are at least two of them that are not yet queried, the adversary can just return 0 as the answer. Finally, when 7
In [2] Ben-David et al. related the learning complexity of a class F to its capacity c(F, ) := min{|G| : ∀f ∈ F ∃g ∈ G s.t. f, g ≥ 1 − }. For Fn = {v1 , . . . vn } this is polynomial in the learning complexity (in specific c(Fn , ) = n) under uniform distribution, but for the monotone conjunctions this is superpolynomial (choose s = log log n in Example 1). The two notions are thus not polynomially related.
200
B. Sz¨ or´enyi
only two singletons—say fx and fy —are unqueried, the adversary chooses one of them as the target concept, and says that the underlying distribution is Dx,y . This way the answers of the adversary remain consistent (no matter how small the tolerance parameter of the learner was), and, at the same time, force the learner to ask at least 2n − 1 queries—even for weakly learning the class.8 It might also worth mentioning that for the singletons SQDimD Fn ≤ 5 under any distribution D, because, denoting by px the probability assigned to input x ∈ {−1, 1}n, 1/6 ≥ fx , fy D = 1 − 2px − 2py implies that at least one of px and py is 5/24 or greater, and thus if six functions from Fn had pairwise correlation at most 1/6, then at least five distinct inputs would have probability 5/24 or greater—a contradiction. This result shows that the number of proper queries required for weakly learning some concept class can differ significantly in the distribution dependent and in the distribution independent case: in some cases it is constant versus exponential. Acknowledgements. I would like to thank Hans Ulrich Simon for suggesting me to work on this topic. I am also thankful to him, Thorsten Doliwa and Michael Kallweit for the motivating discussions on the problem.
References 1. Aslam, J.A., Decatur, S.E.: General bounds on statistical query learning and PAC learning with noise via hypothesis boosting. Inf. Comput. 141(2), 85–118 (1998) 2. Ben-David, S., Itai, A., Kushilevitz, E.: Learning by distances. Inform. Comput. 117(2), 240–250 (1995) 3. Blum, A., Furst, M., Jackson, J., Kearns, M., Mansour, Y., Rudich, S.: Weakly learning DNF and characterizing statistical query learning using fourier analysis. In: Proc. of 26th ACM Symposium on Theory of Computing (1994) 4. Bshouty, N.H., Feldman, V.: On using extended statistical queries to avoid membership queries. Journal of Machine Learning Research 2, 359–395 (2002) 5. Feldman, V.: A complete characterization of statistical query learning with applications to evolvability. In: FOCS 2009 (to appear, 2009) 6. Kearns, M.: Efficient noise-tolerant learning from statistical queries. J. ACM 45(6), 983–1006 (1998) 7. K¨ obler, J., Lindner, W.: The complexity of learning concept classes with polynomial general dimension. Theor. Comput. Sci. 350(1), 49–62 (2006) 8. Simon, H.U.: A characterization of strong learnability in the statistical query model. In: Thomas, W., Weil, P. (eds.) STACS 2007. LNCS, vol. 4393, pp. 393–404. Springer, Heidelberg (2007) 9. Sz¨ or´enyi, B.: Honest queries do not help in the statistical query model (manuscript) 10. Yang, K.: New lower bounds for statistical query learning. J. Comput. Syst. Sci. 70(4), 485–509 (2005); In: Kivinen, J., Sloan, R.H. (eds.) COLT 2002. LNCS (LNAI), vol. 2375, pp. 229–509. Springer, Heidelberg (2002) 11. Yang, K.: On learning correlated boolean functions using statistical queries. In: Abe, N., Khardon, R., Zeugmann, T. (eds.) ALT 2001. LNCS (LNAI), vol. 2225, pp. 59–76. Springer, Heidelberg (2001) 8
This doesn’t contradict the result of Aslam and Decatur [1] mentioned in Section 4, since their boosting algorithm uses improper queries.
An Algebraic Perspective on Boolean Function Learning Ricard Gavald` a1 and Denis Th´erien2 1
Department of Software (LSI), U. Polit`ecnica de Catalunya, Barcelona, Spain
[email protected] 2 School of Computer Science, McGill University, Montr´eal, Qu´ebec, Canada
[email protected]
Abstract. In order to systematize existing results, we propose to analyze the learnability of boolean functions computed by an algebraically defined model, programs over monoids. The expressiveness of the model, hence its learning complexity, depends on the algebraic structure of the chosen monoid. We identify three classes of monoids that can be identified, respectively, from Membership queries alone, Equivalence queries alone, and both types of queries. The algorithms for the first class are new to our knowledge, while those for the other two are combinations or particular cases of known algorithms. Learnability of these three classes captures many previous learning results. Moreover, by using nontrivial taxonomies of monoids, we can argue that using the same techniques to learn larger classes of boolean functions seems to require proving new circuit lower bounds or proving learnability of DNF formulas.
1
Introduction
In his foundational paper [Val84], Valiant introduced the (nowadays called) PAClearning model, and showed that conjunctions of literals, monotone DNF formulas, and k-DNF formulas were learnable in the PAC model. Shortly after, Angluin proposed the (nowadays called) Exact learning from queries model, proved that Deterministic Finite Automata are learnable in this model [Ang87], and showed how to recast Valiant’s three learning results in the exact model [Ang88]. Valiant’s and Angluin’s initial successes were followed by a flurry of PACor Exact learning results, many of them concerning (as in Valiant’s paper) the learnability of Boolean functions, others investigating learnability in larger domains. For the case of Boolean functions, however, progress both in the pure (distribution-free, polynomial-time) PAC model or in the exact learning model has slowed down considerably in the last decade. Certainly, one reason for this slowdown is the admission that these two models do not capture realistically many Machine Learning scenarios. So a lot of the effort has shifted to investigating variations of the original models that accommodate these features (noise tolerance, agnostic learning, attribute efficiency, distribution specific learning, subexponential time, . . . ), and important advances have been made here. R. Gavald` a et al. (Eds.): ALT 2009, LNAI 5809, pp. 201–215, 2009. c Springer-Verlag Berlin Heidelberg 2009
202
R. Gavald` a and D. Th´erien
But another undeniable reason of the slowdown is the fact that it is difficult to find new learnable classes, either by extending current techniques to larger classes or by finding totally different techniques. Many existing techniques seem to be blocked by the frustrating problem of learning DNF, or by our lack of knowledge of basic questions on boolean circuit complexity, such as the power of modular or threshold circuits. In this paper, we use algebraic tools for organizing many existing results on Boolean function learning, and pointing out possible limitations of existing techniques. We adopt the program over a monoid as computing model of Boolean functions [Bar89, BST90]. We use existing, and very subtle, taxonomies of finite monoids to classify many existing results on Boolean function learning, both in the Exact and PAC learning models, into three distinct algorithmic paradigms. The rationale beyond the approach is that the algebraic complexity of a monoid is related to the computational complexity of the Boolean functions it can compute, hence to their learning complexity. Furthermore, the existing taxonomies of monoids may help in detecting corners of learnability that have escaped attention so far because of lack of context, and also in indicating barriers for a particular learning technique. We provide some examples of both types of indications. Similar insights have led in the past to, e.g., the complete classification of the communication complexity of boolean functions and regular languages [TT05, CKK+ 07]. More precisely, we present three classes of monoids that are learnable in three different Exact learning settings: Strategy 1. Groups for which lower bounds are known in the program model, all of which are solvable. Boolean functions computed over these groups can be identified from polynomially many Membership queries and, in some cases, in polynomial or quasipolynomial time. Membership learning in polynomial time is impossible for any monoid which is not a solvable group. Strategy 2. Monoids built as wreath products of DA monoids and p-groups. These monoids compute boolean functions computed by decision lists whose nodes contain MODp gates fed by NC0 functions of the inputs. These are learnable from Equivalence queries alone, hence also PAC-learnable, using variants of the algorithms for learning decision lists and intersection-closed classes. The result can be extended to MODm gates (for nonprime m) with restrictions on their accepting sets. All monoids in this class are nonuniversal (cannot compute all boolean functions), in fact the largest class known to contain only nonuniversal monoids. We argue that proving learnability of the most reasonable extensions of this class (either in the PAC or the Equivalence-query model) requires either new circuit lower bounds or learning DNF. Strategy 3. Monoids in the variety named LGp m Com. Programs over these monoids are simulated by polynomially larger Multiplicity Automata (in the sequel, MA) over the field Fp , and thus are learnable from Membership and Equivalence queries. Not all MA can be translated to programs over such monoids; but all classes of Boolean functions that, to our knowledge, were shown to be
An Algebraic Perspective on Boolean Function Learning
203
learnable via the MA algorithm (except the full class of MA itself) are in fact captured by this class of monoids. We conjecture that this is the largest class of monoids that can be polynomially simulated by MA, hence it defines the limit of what can be learned via the MA algorithm in our algebraic setting. These three classes subsume a good number of the classes of Boolean functions that have been proved learnable in the literature, and we will detail them when presenting each of the strategies. Additionally, with the algebraic interpretation we can examine more systematically the possible extensions these results, at least within our framework. By examining natural extensions of our three classes of monoids, we can argue that any substantial extension of two of our three monoid classes provably requires solving two notoriously hard problems: either proving learnability of DNF formulas or proving new lower bounds for classes of solvable groups. This may be an indication that substantial advance on the learnability of circuit-based classes similar to the ones we capture in our framework may require new techniques. Admittedly, there is no reason why every class of boolean functions interesting from the learning point of view should be equivalent to programs computed over a class of monoids, and certainly our classification leaves out many important classes. Among them are classes explicitly defined in terms of threshold gates, or by read-k restrictions on the variables, or by monotonicity conditions. This is somehow unavoidable in our setting, since threshold gates have no natural analogue on finite monoids, and because multiple reads and variable negation are free in the program model. Similarly, the full classes of MA and DFA cannot be captured in our framework, since for example the notion of automata size is critically sensitive to the order in which the inputs are read, while in the program model variables can always be renamed with no increase in size. Our taxonomy is somehow complementary to those in [HS07, She08] based on threshold functions. Some function classes are captured by both that approach and ours, while each one contains classes not captured by the other.
2 2.1
Background Boolean Functions
We build circuits typically using AND, OR, and MOD gates. We use the generalized model of MODm gates that come equipped with an accepting set A ⊆ [m] shown as superindex; [m] denotes the set {0 . . . m − 1} throughout the paper. A MODA m gate outputs is 1 iff the sum of its inputs mod m is in A. We simply 0 write MODm gates to mean MODA m gates with arbitrary A’s. For each k, NCk is the set of boolean functions depending each on at most k variables. We often compose classes of boolean functions. For two classes C and D, C ◦D denotes functions in C with inputs replaced with functions in D. We denote with DL the class of functions computed by decision lists where each node contains one variable. Therefore, e.g., DL ◦ NC0k are decision lists whose nodes contain boolean functions depending on at most k variables.
204
R. Gavald` a and D. Th´erien
We will typically discuss families of boolean functions, namely sequences {fn }n≥0 where each fn is a function of arity n. Given a class C of families of boolean functions, we use the term “boolean combinations of functions in C” to denote the set of families of functions that can be obtained by combining some fixed number (independent of n) of functions in C; in other words, the functions in ∪k (NC0k ◦ C). We will use the computation model called Multiplicity Automata, MA for short. The following is one of several equivalent definitions; see e.g. [BV96, BBB+ 00] for more details. A multiplicity automaton over an alphabet Σ and a field F is a nondeterministic finite automaton over Σ where we associate an element of F to each transition. The value of the automaton on an input x ∈ Σ is the sum over all accepting paths of the products of the elements along the path, where sum and product are over the field. 2.2
Learning Theory
We assume familiarity with Valiant’s PAC model and especially Angluin’s model of Exact learning via queries. In the Exact model, we use Membership and Equivalence queries. As usual, we measure the resources used by a learning algorithm as a function of the arity of the target function (denoted with n) and the size of the target function within some representation language associated to the class of functions to learn (denoted with s). Longer explanations can be found in the extended version. We will use repeatedly the well-known Composition Theorem (see e.g. [KLPV87]) which states that if a class C (with minor syntactical requirements) is learnable in polynomial time then C ◦ NC0k is also learnable in polynomial time for every fixed k. The result is valid for both the Equivalence-query model and the PAC model, but the proof fails in the presence of Membership queries. 2.3
Monoids and Programs
Recall that a monoid is a set equipped with a binary operation that is associative and has an identity. All the monoids in this paper are finite; some of our statements about monoids might be different or fail for infinite monoids. A group is a monoid where each element has an inverse. A monoid is aperiodic if there is some number t such that at+1 = at for every element a. Only the oneelement monoid is both a group and aperiodic. A theorem by Krohn and Rhodes states that every monoid can be built from groups and aperiodic monoids by repeatedly applying the so-called wreath product. The wreath product of monoids A and B is denoted with A B. Solvable groups, in particular, are precisely those that can be built as iterated wreath product of Abelian groups. Definitions of solvable groups and wreath product can be found in most textbooks on group theory, and in the extended version of this paper. A program over a monoid M is a pair (P, A), where A ⊆ M is the accepting set and P is an ordered list of instructions. An instruction is a triple (i, ai , bi ) whose semantics is as follows: read (boolean) variable xi ; if xi = 0, emit element
An Algebraic Perspective on Boolean Function Learning
205
ai ∈ M , and emit element bi ∈ M if xi = 1. A list of instructions P defines a sequence of elements in M on every assignment w to the variables. We denote with P (w) the product in M of this sequence of elements. If P (w) ∈ A we say that the program accepts w, and that it rejects w otherwise; alternatively, we say that the program evaluates to 1 (resp. 0) on w. The length or size of the program is the number of instructions in P . Each program on n variables thus computes a boolean function from {0, 1}n to {0, 1}. For a monoid M , B(M ) is the set of boolean functions recognized by programs over M . If M is a set of monoids, B(M) is M∈M B(M ). A monoid M is said to divide a monoid N if M is a homomorphic image of a submonoid in N . A set of monoids closed under direct product and division (i.e., taking submonoids and homomorphic images) is called a variety (technically, a pseudovariety since we are dealing with finite monoids). The following varieties will appear in this paper: – Com: All commutative monoids, i.e. those satisfying xy = yx. – Ab: All Abelian groups. Recall that every finite Abelian group is a direct product of a number of groups of the form Zpαi for different primes pi . – Gp : All p-groups, that is, groups of cardinality a power of the prime p. – Gnil : Nilpotent groups. For the purposes of this paper, a group is nilpotent iff it is the direct product of a number of groups, each of which is a pi -group for possibly different pi ’s. All Abelian groups are nilpotent. For interpretation, it was shown in [PT88] that programs over nilpotent groups are equivalent in power to polynomials of constant degree over a ring of the form (Zm )k , i.e., they compute the same set of boolean functions. – G: The variety of all groups. – DA: A variety of aperiodic monoids to be defined in Section 4.2. For interpretation, it was shown in [GT03] that programs over monoids in DA are equivalent in power to decision trees of bounded rank. 2.4
Learning Programs over Monoids: Generalities
Every monoid M defines a set of boolean functions B(M ) with an associated notion of function size, namely the length of the shortest program over M . The general question we ask is thus “given M and a learning model, is B(M ) polynomial-time learnable in that learning model?”. Polynomiality (or other bounds) is on the number of variables and size in M of the target function, denoted with s as already mentioned. For a set of monoids M, we say for brevity “programs over M are learnable” or even “M is learnable” to mean “for every fixed M ∈ M, B(M ) is learnable”, that is, there may be a different algorithm for each M ∈ M, with a different running time. In other words, each algorithm works for a fixed M that it “knows”. Models where a single algorithm must work for a whole class of monoids are possible, but we do not pursue them in this paper. The following easy result is useful to compare the learning complexity of different monoids:
206
R. Gavald` a and D. Th´erien
Fact 1. If M divides N and B(N ) is learnable (in any of the learning models in this paper), then B(M ) is also learnable. In contrast, we do not know whether learnability is preserved under direct product (which is to say, by taking fixed-size boolean combinations of classes of the form B(M )): if it was, many of the open problems in this paper would be resolved, but we have no general argument or counterexample.
3
Learning from Small-Weight Assignments
The small-weight strategy applies to function classes with the following property. n
Definition 1. For an assignment a ∈ {0, 1} , the weight of a is defined as the number of 1s it contains, and denoted w(a). A representation class C is knarrowing if every two different functions f, g ∈ C of the same arity differ on some assignment of weight at most k. (k may actually be a function of some other parameters, such as the arity of f and g or their size in C). The following is essentially proved in [GTT06]. Theorem 2. If C is k-narrowing, then C can be identified with nO(k) Membership queries (and possibly unbounded time). n
The algorithm witnessing this is simple: ask all assignments in {0, 1} of weight at most k, of which there are at most nO(k) . Then find any function f ∈ C consistent with all answers. By the narrowing property, that f must be equivalent to the target. 3.1
Groups with Lower Bounds
It was shown in [Bar89] and [GTT06], respectively, that nonsolvable groups and nongroups can compute any conjunction of variables and their negations by a polynomial-size program. Any class of functions with this property is not nnarrowing, and by a standard adversary argument, it requires 2n Membership queries to be identified. Therefore we have: Fact 3. If M is not a group, or if M is a nonsolvable group, then B(M ) cannot be identified with a subexponential number of Membership queries. Therefore, Membership learnability of classes of the form B(M ) is restricted, at most, to solvable groups. There are two maximal subclasses of solvable groups for which lower bounds on their computational power are known, and in both cases the lower bound is essentially a narrowing property. Fact 4. 1. For every nilpotent group M there is a constant k such that B(M ) is k-narrowing [PT88]. Therefore B(M ) can be identified from nO(k) Membership queries (and possibly unbounded time).
An Algebraic Perspective on Boolean Function Learning
207
2. For every group G ∈ Gp Ab there is a constant c such that B(M ) is (c log s)-narrowing [BST90]. Therefore, programs over G of length s can be identified from nO(log s) Membership queries. The next two theorems give specific, time-efficient versions of this strategy for Abelian groups and Gp Ab groups. These are, to our knowledge, new learning algorithms. Theorem 5. For every Abelian group G, B(G) is learnable from Membership queries in time nc , for a constant c = c(G). Theorem 6. For every G ∈ Gp Ab with p prime, B(G) is learnable from Membership queries in nc log s time, for a constant c = c(G). (Recall that s stands for the length of the shortest program computing the target function). Proofs are given as an Appendix in the extended version. 3.2
Interpretation in Circuit Terms
Let us now interpret these results in circuit terms. It is easy to see that programs over a fixed Abelian group are polynomially equivalent boolean combinations of MODm gates, for some m depending on the group. Theorem 5 then implies: Corollary 1. For every m, fixed-size boolean combinations of MODm gates are learnable from Membership queries in time nc , for c = c(m). Also, it is shown in [BST90] that programs over a fixed group in Gp Ab are polynomially equivalent to MODp ◦ MODm circuits. Such circuits were shown in [BBTV97] to be polynomial-time learnable from Membership and Equivalence queries in polynomial time, by showing that they have small Multiplicity automata – a generalization of their construction is used in Section 5. Theorem 6 shows that Membership queries suffice, if quasipolynomial time is allowed: Corollary 2. For every prime p and every m, functions computed by MODp ◦ MODm circuits of size s are learnable from Membership queries in time nO(log s) . As an example, the 6-element permutation group on 3 points, S3 , can be described as a wreath product Z3 Z2 . Intuitively each permutation can be described by a rotation and a flip, which interact when permutations are composed so direct product does not suffice. Programs over S3 are polynomially equivalent to MOD3 ◦ MOD2 circuits, and our result claims that they are learnable from nc log s Membership queries for some c. 3.3
Open Questions on Groups and Related Work
While programs over nilpotent groups can be identified from polynomially many Membership queries, we have not resolved whether a time-efficient algorithm exists, even in the far more powerful PAC+Membership model. In other words,
208
R. Gavald` a and D. Th´erien
we know that the values of such a program on all small-weight assignments are sufficient to identify it uniquely, but can these values be used to efficiently predict the value of the program on an arbitrary assignment? In circuit terms, by results of [PT88], such programs can be shown to be polynomially equivalent to fixed-size boolean combinations of MODm ◦ NC0 circuits or, equivalent, of polynomials of constant degree over Zm . We are not even aware 0 of algorithms learning a single MODA m ◦ NC circuit for arbitrary sets A. When m is prime, one can use Fermat’s little theorem to make sure that the MODm gate receives only inputs summing to either 0 or 1, at the expense of increasing the arity of the NC0 part. Then, one can set up a set of linear equations where the unknowns are the coefficients of the target polynomial and each small-weight assignment provides an equation with constant term either 0 or 1. The solution of this system must be equivalent to the target function. For solvable groups that are neither nilpotent nor in Gp Ab, the situation is even worse in the sense that we do not have lower bounds on their computational power, i.e., we cannot show that they are weaker than NC1 . Observe that any learning result would establish a separation with NC1 , conditioned to the cryptographic assumptions under which NC1 is nonlearnable. In another direction, while lower bounds do exist for MODp ◦ MODm circuits, we do not have them for MODp ◦ MODm ◦ NC0 ; linear lower bounds for some particular cases were given in [CGPT06]. Let us note that programs over Abelian groups (equivalently, boolean combinations of MODm gates) are particular cases of the multi-symmetric concepts studied in [BCJ93]. Multi-symmetric concepts are there shown to be learnable from Membership and Equivalence queries, while we showed that for these particular cases Membership queries suffice. XOR’s of k-terms and depth-k decision trees are special cases of MODm ◦ NC0 previously noticed to be learnable from Membership queries alone [BK].
4
Learning Intersection-Closed Classes
0 In this section we observe that classes of the form DL◦MODA m ◦NC are learnable from Equivalence queries (for some particular combinations of m and accepting sets A). The algorithm is actually the combination of two well-known algorithms (plus the composition theorem to deal with NC0 ). 1) The algorithm for learning submodules of a module in [HSW90] (though probably known before); 2) the algorithm in the companion paper extending it to nested differences of intersection-closed classes, also in [HSW90]. More interestingly, we show that the classes above have natural algebraic interpretation, and use this interpretation that they may be very close to a stopping barrier for a certain kind of learning.
4.1
The Learning Algorithm
Theorem 7. For every m and k the class DL ◦ MOD[m]−{0} ◦ NC0k is learnable m k from Equivalence queries in time polynomial in m, 22 , and nk .
An Algebraic Perspective on Boolean Function Learning
209
Proof. (Sketch; details are given in the extended version.) By the composition is learnable from Equivalence theorem, it suffices to show that DL ◦ MOD[m]−{0} m gates are queries, and we can even assume that the inputs to these MOD[m]−{0} m variables (no negations, no constants). Now observe that this class of functions is the set of negations of nested differences of functions computed by MOD{0} gates – where we identify a function with the set of assignments where it evaluates to 1. Furthermore, the set represented by every MOD{0} gate is a submodule (a set closed under addition) of Znm , and submodules are intersection-closed. Therefore, the algorithm in [HSW90] for learning nested differences of intersection-closed classes applies, and one can show it learns the class above with nO(1) queries. Note that in Theorem 7 the running time does not depend on the length of the decision list that is being learned. In fact, as a byproduct of this proof one can see that the length of these decision lists can be limited to a polynomial of m and nk without actually restricting the class of functions being computed. Intuitively, this is because there can be only as many linearly independent such MOD gates, and a node whose answer is determined by the previous ones in the decision list can be removed. Thus, for constant m and k, this class can compute O(1) at most 2n n-ary boolean functions and is not universal. Also, note that we claim this result for MODm gates having all but 0 as accepting elements. In the special case that m is a prime p, we can deal with arbitrary accepting sets: Theorem 8. For every prime p, every k, and arbitrary accepting sets A (pos0 sibly distinct at every MOD gate) the class DL ◦ MODA p ◦ NCk is learnable from c Equivalence queries in time n , where c = c(p, k). Proof. By Theorem 7, it suffices to show that that every function in MODA p is 0 equivalent to a polynomially larger function in MOD[p]−{0} ◦ NC , for some k p k depending on p. This is a standard use of Fermat’s little theorem and details are omitted in this version. If we ignore the issue of proper learning and polynomials in the running time, this subsumes at least the following known results: – k-decision lists (which are DL ◦ NC0 ) [Riv87]. k-decision lists in turn subsumed k-CNF and k-DNF, and rank-k decision trees. – Systems of equations over Zm , i.e., DL ◦ MOD[m]−{0} . m – Polynomials of constant degree over finite fields, restricted to boolean functions. When the field has prime cardinality p, these are equivalent to MODp ◦ NC0 . – Strict width-2 branching programs [BBTV97]. This is because it can be shown that these are polynomially simulated by DL ◦ MOD2 ◦ NC0 (proof omitted in this version). These are virtually all known results on learning Boolean functions in the pure PAC model (no Membership queries) that do not involve threshold gates or
210
R. Gavald` a and D. Th´erien
read-restrictions, neither of which can be captured in our algebraic setting. It 0 is interesting that each of these classes, and in fact all of DL ◦ MODA p ◦ NCk , O(1)
functions. They are therefore not universal, i.e. they contain at most 2n cannot represent all boolean functions. This fact seems more significant after we observe (in the next section) that a computationally equivalent class of monoids is in fact the largest one known to contain only non-universal monoids. 4.2
Interpretation in Algebraic Terms
Classes closely related to those in the previous section have clear precise algebraic interpretations. They involve the class DA of monoids, of which we give here an operational definition. Formal definitions can be found e.g. in [Sch76, GT03, Tes03, TT04]. Let M be a monoid in DA. Then the product of elements m1 , . . . , mn in M can be determined by knowing the truth or falsehood of a fixed number of boolean conditions of the form “m1 . . . mn , as a string over M , admits a factorization of the form L0 a1 L1 a2 . . . ak Lk ”, where 1) the ai are elements of M , 2) each Li is a language such that x ∈ Li can be determined solely by the set of letters appearing in x, and 3) the expression L0 a1 L1 a2 . . . ak Lk is unambiguous, i.e., every string has at most one factorization in it. As mentioned already in the introduction, it was shown in [GT03] that programs over monoids in DA are equivalent in power to decision trees in bounded rank [EH89], where the required rank of the decision trees is related to the parameter k in its definition in the particular DA monoid. In particular, programs over a fixed DA monoid can be simulated both by CNF and DNF formulas of size nO(1) and by decision lists with bounded-length terms at the nodes, and can be learned in the PAC and Equivalence-query models [EH89, Riv87, Sim95]. We then have the following characterization: Theorem 9. 1. B(DA Gnil ) = m,k DL ◦ MODm ◦ NC0k = m,k DL ◦ ◦ NC0k . MOD{0} p 2. B(DA Gp ) = m,k DL ◦ MODp ◦ NC0k = m,k DL ◦ MOD{0} ◦ NC0k = p [m]−{0} ◦ NC0k . m,k DL ◦ MODp
The proof is omitted in this version. Intuitively, nilpotent groups provide the “group” behavior of MODm ◦ NC0 and decision lists are equivalent to DA ◦ NC0 . A key ingredient is the fact that a MODpα gate can be simulated by a MODp ◦ NC0 circuit; see e.g. [BT94] for a proof. The difference between parts (1) and (2) is again the possibility of using Fermat’s little theorem reduce gates to singleton accepting sets. From this theorem and Theorem 8, it follows that we can learn programs over DA Gp monoids from Equivalence queries, yet we do not know how to learn (to our knowledge) programs over DA Gnil in any model. This algebraic interpretation lets us explore this gap in learnability and, in particular, the limitation of the learning paradigm in the previous subsection. Since every p-group is nilpotent and it can be shown that DA Gnil monoids can only have nilpotent subgroups, we have
An Algebraic Perspective on Boolean Function Learning
211
DA Gp ⊆ DA Gnil ⊆ DA G ∩ Mnil , where Mnil is the class of monoids having only nilpotent subgroups. Yet, there is an important difference in what we know about DA Gp and DA Gnil . Following [Tes03, TT04], a monoid M is said to have the Polynomial Length Property (or PLP) if every program over M , regardless of its length, is equivalent to another one whose length is polynomial in n. Clearly, every monoid in PLP is nonuniversal, and the converse is conjectured in [Tes03, TT04]. More specifically, the following was shown in [Tes03, TT04]. – Every monoid not in DA G ∩ Mnil is universal. – Every monoid in DA Gp has the PLP, hence is not universal. The question of either PLP or universality is thus open for DA Gnil , sitting between DA Gp and DA G ∩ Mnil , so resolving its learnability may require new insights besides the intersection-closure/submodule-learning algorithm. Note that, contrary to one could think, DA Gnil is not equal to DA G ∩ Mnil : there are monoids that, in this context, can be built by using unsolvable groups and later using homomorphisms to leave only nilpotent groups that cannot be obtained starting from nilpotent groups alone. Current techniques seem insufficient (and may remain unable forever) to analyze even these traces of unsolvability. Are there other extensions of DA Gp that we could investigate from the learning point? The “obvious” is trying to extend the DA or Gp parts separately. For the DA part, it is known [Sch76, Tes03] that every aperiodic monoid not in DA necessarily is divided by one of two well-identified monoids, named U and BA2 . Monoid U is the syntactic monoid of the language {a, b} aa{a, b}, and programs over U are equivalent in power, up to polynomials, to DNF formulas. Therefore, by Fact 1, extending DA in this direction implies learning at least DNF. Monoid BA2 is the syntactic monoid of (ab) , and interestingly, although it is aperiodic, programs over it can be simulated (essentially) by OR gates fed by parity gates. In fact it in DA Gp for every p, so we know it is learnable. If we try to extend on the group part, we have already mentioned that the two classes of groups beyond Gp for which we have lower bounds are Gnil and Gp Ab. We have already discussed the problems concerning DA Gnil . For Gp Ab, they correspond to MODp ◦ MODm circuits, and we showed them to be learnable from Membership queries alone in the previous section. With Equivalence queries, however, learning MODp ◦MODm would also imply learning MODp ◦ MODm ◦ NC0 and, as discussed in the previous section, this seems difficult because we cannot even prove now that these circuits cannot do NC1 . In particular, even learning programs over S3 (i.e. MOD3 ◦ MOD2 circuits) from Equivalence queries alone seems unresolved now.
5
Learning as Multiplicity Automata
The learning algorithm for multiplicity automata [BV96, BBB+ 00] elegantly unified many previous results and also implied learnability of several new classes.
212
R. Gavald` a and D. Th´erien
It has remained one of the “maximal” learning algorithms for boolean functions, in the sense that no later result has superseded it. Theorem 10. [BV96, BBB+ 00] Let F be any finite field. Functions Σ → F represented as Multiplicity Automata over F are learnable from Evaluation and Equivalence queries in time polynomial in the size of the MA and |Σ|. We can use Multiplicity Automata to compute boolean functions as follows: We take Σ = {0, 1}, and some accepting subset A ⊆ F , and the function evaluates to 1 on an input if the MA outputs an element in A, and 0 otherwise. However, as basically argued in [BBTV97] we can use Fermat’s little theorem to turn an MA into one that always outputs either 0 or 1 (as field elements) with only polynomial blowup, and therefore we can omit the accepting subset. In this section we identify a class of monoids that can be simulated by MA’s, but not the other way round. Yet, it can simulate most classes of boolean functions whose learnability was proved via the MA-learning algorithm. Note that it will be impossible to find a class of submonoids that, in our setting, is precisely equivalent (up to polynomial blowup) to the whole class of MA. This is true for the simple reason that the complexity of a function measured as “shortest program length” cannot grow under renaming of input variables: it suffices to change the variable names in the instructions of the program. MA, on the other hand, read their input in the fixed order x1 , . . . , xn , so renaming the input variables in a function can force an exponential growth in MA size. Consider as an example the function ni=1 (x2i−1 = x2i ): clearly, it is computed by the MA of size O(n) that simply checks equality of appropriate pairs of n adjacent letters in its input string. However, its permutation i=1 (xi = x2n−i+1 ) is the palindrome function, whose MA size is roughly 2n over any field. Our characterization uses the notion of Mal’tsev product of two monoids A and B, denoted A m B. We do not define the algebraic operation formally. We use instead the following property, specific to our case [Wei87]: Let M be a monoid in LGp m Com, i.e., the Mal’tsev product of a monoid in Gp by one in Com. Then, the product in M of a string of elements m1 . . . mn can be determined from the truth of a fixed number of logical conditions of the following form: There are elements a1 , . . . , ak in M , a number r ∈ [p], and commutative languages L0 , . . . , Lk over M such that the number of factorizations of m1 . . . mn of the form L0 a1 L1 a2 L2 . . . Lk−1 ak Lk is r modulo p. Contrived as it seems, LGp m Com is a natural borderline in representation theory. Recent and deep work by Margolis et al [AMV05, AMSV09] shows that semigroups in LGp m Com are exactly those that can be embedded into a semigroup of upper-triangular matrices over a field of characteristic p (and any size). The main result in this section is: Theorem 11. Let M be a monoid in LGp m Com. Suppose that M is defined as above by the a boolean combination of at most conditions of length at most k using commutative languages whose monoid has size C. Then every program of length s over M is equivalent to an MA over Fp of size (s + C)c , where c = c(p, , k).
An Algebraic Perspective on Boolean Function Learning
213
Corollary 3. Programs over monoids in LGp m Com are polynomially simulated by MAs over Fp that are direct sums of constant-width MA’s. Proof. (of Theorem 11) (Sketch). Fix a program (P, A) over M of length s. Let m1 , . . . ms be the sequence of elements in M produced by the instructions on P for a given input x1 . . . xn . The value of P for an input, hence whether it belongs to A, can be determined from the truth or falsehood of conditions as described above, each one given by a tuple of letters a1 , . . . , ak and commutative languages L0 , . . . , Lk . For each such condition, we build an MA to check it as follows: The MA is the direct sum of ks MA’s, one for each of the positions where the a0 . . . ak witnessing a factorization could appear. Each MA concurrently checks that each of the chosen positions contains the right ai (when the input variable producing the corresponding element mj is available) and concurrently checks whether the subword wi between ai and ai+1 is in the language Li . Crucially, since Li is in Com, membership of wi in Li can be computed by a fixed-width automaton, regardless of the order in which the variables producing wi are read. The automaton produces 0 if this check fails for some i, and 1 otherwise. It can be checked that the resulting automaton for each choice has size polynomial in s. For each condition L0 a1 L1 . . . ak Lk , counting the number of factorizations mod p amounts to taking the sum of the MA built for all possible guesses and adding them over Fp . To conclude the proof, take all MA’s resulting from the previous construction and raise them to the p-th power. That increases their size by a power of p, and by Fermat’s little theorem they become 0/1-valued. The boolean combination of several conditions can be then expressed by (a fixed number) of sums and products in Fp , with polynomial blowup. We next note that several classes that were shown to be learnable by showing they were polynomially simulated by MA. Theorem 12. The following classes of boolean functions are polynomially simulated by programs over LGp m Com, hence are learnable from Membership and Equivalence queries as MA: – Polynomials over Fp (when viewed as computing boolean functions) – Unambiguous DNF functions; these include decision trees k-term DNF for constant k. – constant-degree, depth-three, ΣΠΣ arithmetic circuits [KS06], when restricted to boolean functions. An interesting case is that of O(log n)-term DNF. It was observed in [Kus97] c log n-term DNF can be rewritten into DFA of size roughly nc , hence learned from Membership and Equivalence queries by Angluin’s algorithm [Ang87]. It is probably false that c log n-term DNF can be simulated by programs over a fixed monoid in LGp m Com. However, we note that for every c and n, we note that for every c and n, c log n-term DNF is simulated by a monoid of size nc that is
214
R. Gavald` a and D. Th´erien
easily computed from c and n and commutative, hence in LGp m Com. (See the extended version for details.) Finally, we conjecture that LGp m Com is the largest class of monoids that are polynomially simulated by MA, hence, the largest class we can expect to learn from MA within our algebraic framework: Conjecture 1. If a monoid M is not in LGp m Com, then programs over M are not polynomially simulated by MA’s over Fp . The proof of this conjecture should be within reach given the characterization m Com: this happens iff the given in [TT06] of the monoids that are not in LGp monoid is divided by either the monoids U or BA2 described before, or by a socalled Tq monoid or by a monoid whose commutator subgroup is not a p-group. It would thus suffice to show that programs over these four kinds of monoids cannot always be polynomially simulated by MA over Fp .
References [AMSV09]
[AMV05]
[Ang87] [Ang88] [Bar89]
[BBB+ 00]
[BBTV97]
[BCJ93] [BK] [BST90] [BT94] [BV96]
Almeida, J., Margolis, S.W., Steinberg, B., Volkov, M.V.: Representation theory of finite semigroups, semigroup radicals and formal language theory. Trans. Amer. Math. Soc. 3612, 1429–1461 (2009) Almeida, J., Margolis, S.W., Volkov, M.V.: The pseudovariety of semigroups of triangular matrices over a finite field. RAIRO - Theoretical Informatics and Applications 39(1), 31–48 (2005) Angluin, D.: Learning regular sets from queries and counterexamples. Information and Computation 75, 87–106 (1987) Angluin, D.: Queries and concept learning. Machine Learning 2, 319–342 (1988) Barrington, D.A.: Bounded-width polynomial-size branching programs recognize exactly those languages in NC1 . Journal of Computer and System Sciences 38, 150–164 (1989) Beimel, A., Bergadano, F., Bshouty, N.H., Kushilevitz, E., Varricchio, S.: Learning functions represented as multiplicity automata. Journal of the ACM 47, 506–530 (2000) Bergadano, F., Bshouty, N.H., Tamon, C., Varricchio, S.: On learning branching programs and small depth circuits. In: Ben-David, S. (ed.) EuroCOLT 1997. LNCS, vol. 1208, pp. 150–161. Springer, Heidelberg (1997) Blum, A., Chalasani, P., Jackson, J.C.: On learning embedded symmetric concepts. In: COLT, pp. 337–346 (1993) Bhshouty, N.H., Kushilevitz, E.: Learning from membership queries / online learning. Course notes in N. Bshouty’s homepage Mix Barrington, D.A., Straubing, H., Th´erien, D.: Non-uniform automata over groups. Information and Computation 89, 109–132 (1990) Beigel, R., Tarui, J.: On ACC. Computational Complexity 4, 350–366 (1994) Bergadano, F., Varricchio, S.: Learning behaviors of automata from multiplicity and equivalence queries. SIAM Journal on Computing 25, 1268– 1280 (1996)
An Algebraic Perspective on Boolean Function Learning [CGPT06] [CKK+ 07]
[EH89] [GT03]
[GTT06] [HS07] [HSW90] [KLPV87] [KS06] [Kus97] [PT88]
[Riv87] [Sch76] [She08] [Sim95]
[Tes03]
[TT04] [TT05]
[TT06] [Val84] [Wei87]
215
Chattopadhyay, A., Goyal, N., Pudl´ ak, P., Th´erien, D.: Lower bounds for circuits with modm gates. In: FOCS, pp. 709–718 (2006) Chattopadhyay, A., Krebs, A., Kouck´ y, M., Szegedy, M., Tesson, P., Th´erien, D.: Languages with bounded multiparty communication complexity. In: Thomas, W., Weil, P. (eds.) STACS 2007. LNCS, vol. 4393, pp. 500–511. Springer, Heidelberg (2007) Ehrenfeucht, A., Haussler, D.: Learning decision trees from random examples. Information and Computation 82(3), 231–246 (1989) Gavald` a, R., Th´erien, D.: Algebraic characterizations of small classes of boolean functions. In: Alt, H., Habib, M. (eds.) STACS 2003. LNCS, vol. 2607, pp. 331–342. Springer, Heidelberg (2003) Gavald` a, R., Tesson, P., Th´erien, D.: Learning expressions and programs over monoids. Inf. Comput. 204(2), 177–209 (2006) Hellerstein, L., Servedio, R.A.: On pac learning algorithms for rich boolean function classes. Theor. Comput. Sci. 384(1), 66–76 (2007) Helmbold, D.P., Sloan, R.H., Warmuth, M.K.: Learning nested differences of intersection-closed concept classes. Machine Learning 5, 165–196 (1990) Kearns, M.J., Li, M., Pitt, L., Valiant, L.G.: On the learnability of boolean formulae. In: STOC, pp. 285–295 (1987) Klivans, A.R., Shpilka, A.: Learning restricted models of arithmetic circuits. Theory of Computing 2(1), 185–206 (2006) Kushilevitz, E.: A simple algorithm for learning o(logn)-term dnf. Inf. Process. Lett. 61(6), 289–292 (1997) P´eladeau, P., Th´erien, D.: Sur les langages reconnus par des groupes nilpotents. Compte-rendus de l’Acad´emie des Sciences de Paris, 93–95 (1988); Translation to English as ECCC-TR01-040, Electronic Colloquium on Computational Complexity (ECCC) Rivest, R.L.: Learning decision lists. Machine Learning 2(3), 229–246 (1987) Sch¨ utzenberger, M.P.: Sur le produit de concat´enation non ambigu. Semigroup Forum 13, 47–75 (1976) Sherstov, A.A.: Communication lower bounds using dual polynomials. Bulletin of the EATCS 95, 59–93 (2008) Simon, H.-U.: Learning decision lists and trees with equivalence-queries. In: Vit´ anyi, P.M.B. (ed.) EuroCOLT 1995. LNCS, vol. 904, pp. 322–336. Springer, Heidelberg (1995) Tesson, P.: Computational Complexity Questions Related to Finite Monoids and Semigroups. PhD thesis, School of Computer Science, McGill University (2003) Tesson, P., Th´erien, D.: Monoids and computations. Intl. Journal of Algebra and Computation 14(5-6), 801–816 (2004) Tesson, P., Th´erien, D.: Complete classifications for the communication complexity of regular languages. Theory Comput. Syst. 38(2), 135–159 (2005) Tesson, P., Th´erien, D.: Bridges between algebraic automata theory and complexity. Bull. of the EATCS 88, 37–64 (2006) Valiant, L.G.: A theory of the learnable. Communications of the ACM 27, 1134–1142 (1984) Weil, P.: Closure of varieties of languages under products with counter. J. of Comp. Syst. Sci. 2(3), 229–246 (1987)
Adaptive Estimation of the Optimal ROC Curve and a Bipartite Ranking Algorithm St´ephan Cl´emen¸con1 and Nicolas Vayatis2 1
LTCI, Telecom Paristech (TSI) - UMR Institut Telecom/CNRS 5141
[email protected] 2 CMLA, ENS Cachan & UniverSud - UMR CNRS 8536 61, avenue du Pr´esident Wilson - 94235 Cachan cedex, France
[email protected]
Abstract. In this paper, we propose an adaptive algorithm for bipartite ranking and prove its statistical performance in a stronger sense than the AUC criterion. Our procedure builds on and significantly improves the RankOver algorithm proposed in [1]. The algorithm outputs a piecewise constant scoring rule which is obtained by overlaying a finite collection of classifiers. Here, each of these classifiers is the empirical solution of a specific minimum-volume set (MV-set) estimation problem. The major novelty arises from the fact that the levels of the MV-sets to recover are chosen adaptively from the data to adjust to the variability of the target curve. The ROC curve of the estimated scoring rule may be interpreted as an adaptive spline approximant of the optimal ROC curve. Error bounds for the estimate of the optimal ROC curve in terms of the L∞ -distance are also provided.
1
Introduction
Since a few decades, ROC curves have been widely used as the golden standard for assessing performance in areas such as signal detection, medical diagnosis, credit risk screening. More recently, ROC analysis has become an area of growing interest in Machine Learning. Various aspects are considered in this new approach such as model evaluation, model selection, machine learning metrics for evaluating performance, model construction, multiclass ROC, geometry of the ROC space, confidence bands for ROC curves, improving performance of classifiers, connection between classifiers and rankers, model manipulation (see for instance [2] and references therein). We focus here on the problem of bipartite ranking and the issue of ROC curve optimization. Previous work on bipartite ranking ([3], [4], [5]) considered the AUC criterion as the optimization target. However, this criterion is known to weight the errors uniformly while ranking rules with similar AUC may behave very differently on a subset of the input space. In the paper, we focus on two problems: (i) the estimation of the optimal ROC∗ , (ii) the construction of a consistent scoring rule whose ROC curve converges in supremum norm to the optimal ROC∗ . In contrast to binary classification or AUC maximization, the classical empirical risk minimization approach R. Gavald` a et al. (Eds.): ALT 2009, LNAI 5809, pp. 216–231, 2009. c Springer-Verlag Berlin Heidelberg 2009
Adaptive Estimation of the Optimal ROC Curve
217
cannot be invoked here because of the function-like nature of the performance measure and the use of the supremum norm as a metric. The approach taken here follows the perspective sketched in [1], and further explored in [6]. In these two papers, ranking rules made of overlaying classifiers were considered and the RankOver algorithm was introduced. Dealing with a function-like optimization criterion as the ROC curve requires to perform both curve approximation and statistical estimation. In the RankOver algorithm, the approximation step is conducted with a piecewise linear approximation with fixed breakpoints on the false positive rate axis. The estimation part involves a collection of classification problems with mass constraint. In [6], we improved this step by using a modified minimum-volume set approach inspired from [7] to solve this collection of constrained classification problems. More precisely, our method can be understood as a statistical version of a simple finite element method with an explicit scheme: it produces an accurate spline estimate of the optimal curve in the ROC space, together with a scoring rule whose ROC curve mimics the behavior of the optimal one. In our previous work [1], [6], bounds on the generalization rate of this ranking algorithm were obtained under strong conditions on the regularity of the optimal ROC curve. Indeed, it was assumed that the optimal ROC curve was twice continuously differentiable and that its derivative was bounded in the neighborhood of the origin. The purpose of this paper is to relax these regularity conditions. In particular, we provide an adaptive algorithm which selects breakpoints for the approximation of the ROC curve by the means of a data-driven scheme which takes into account the variability of the target curve. Hence, the partition of the false positive rate axis is chosen according to the local regularity of the optimal curve. The paper is structured as follows. In Section 2, notations are set out and important concepts of ROC analysis are briefly described. Section 3 is devoted to the presentation of the adaptive approximation of the optimal ROC curve with dyadic recursive partitioning. In Section 4, theoretical results related to empirical minimum-volume set (MV-set) estimation are recalled. The adaptive statistical method for estimating the optimal ROC curve and the related ranking algorithm are presented in sections 5 and 6 respectively, together with the main results of the paper. Proofs are postponed to the Appendix.
2 2.1
Setup Probabilistic Model
The probabilistic setup is the same as the one in standard binary classification. Here and throughout, (X, Y ) denotes a pair of random variables where Y ∈ {−1, +1} is a binary label and X models some observation for predicting Y , taking its values in a high-dimensional feature space X ⊂ Rd . The joint distribution of (X, Y ) is entirely determined by the pair (μ, η) where μ denotes the marginal distribution of X and the regression function η(x) = P{Y = +1 | X = x}, x ∈ X . We also introduce the theoretical proportion p = P{Y = +1}, as well as G and H, the conditional distributions of X given Y = +1 and Y = −1 respectively.
218
S. Cl´emen¸con and N. Vayatis
Throughout the paper, these probability measures are assumed to be absolutely continuous with respect to each other. Equipped with these notations, one may write η(x) = p(dG/dH)(x)/(1 − p + p(dG/dH)(x)) and μ = pG + (1 − p)H. 2.2
Bipartite Ranking and ROC Curves
We briefly recall the issue of the bipartite ranking task and describe the key notions related to this statistical learning problem. Based on the observation of i.i.d. examples Dn = {(Xi , Yi ) : 1 ≤ i ≤ n}, the goal is to learn how to order all instances x ∈ X in a way that instances X such that Y = +1 appear on top in the list with the largest possible probability. Clearly, the simplest way of defining an order relationship on X is to transport the natural order on the real line to the feature space through a scoring rule s : X → R. The notion of ROC curve, which we recall below, provides a functional criterion for evaluating the performance of the ordering induced by such a function. We denote by F −1 (t) = inf{u ∈ R : F (u) ≥ t} the pseudo-inverse of any c`ad-l` ag increasing function F : R → R and by S the set of all scoring functions, i.e. the space of real-valued measurable functions on X . Definition 1. (ROC curve) Let s ∈ S. The ROC curve of the scoring function s(x) is the c` ad-l` ag curve given by α ∈ [0, 1] → ROC(s, α) = 1 − Gs ◦ Hs−1 (1 − α), where Gs and Hs denote the conditional distributions of s(X) given Y = +1 and given Y = −1 respectively. We denote by ROC∗ the ROC curve for s = η. When Gs (du) and Hs (du) are both continuous distributions, the ROC curve of s(x) is nothing else than the PP-plot: t → (P{s(X) ≥ t | Y = −1}, P{s(X) ≥ t | Y = +1}) .
(1)
It is a well-known result in ROC analysis that increasing transforms of the regression function η(x) form the class S ∗ of optimal scoring functions in the sense that their ROC curve, namely ROC∗ = ROC(η, .), dominates the ROC curve of any other scoring function s(x) everywhere: ∀α ∈ [0, 1[, ROC(s, α) ≤ ROC∗ (α). The proof of this fact is based on a simple application of Neyman-Pearson’s lemma in hypothesis testing: the likelihood statistic Φ(X) = (1 − p)η(X)/(p(1 − η(X))) yields a uniformly most powerful statistical test for discriminating between the composite hypotheses H0 : Y = −1 and H1 : Y = +1 (i.e. H0 : X ∼ H and H1 : X ∼ G). Therefore, the power of any other test s(X) is smaller that the test based on η(X) at the same level α. Recall also that, when continuous, the curve ROC∗ is concave. Refer to [8] for a detailed list of properties of ROC curves.
Adaptive Estimation of the Optimal ROC Curve
219
Remark 1. (Alternative convention.) Note that ROC curves may be alternatively defined through the representation given in formula (1). With this convention, jumps in the graph, due to possible degeneracy points of H ∗ and G∗ , are continuously connected by line segments, see [9] for instance. Hence, a good scoring function is such that, for any level α ∈ (0, 1), the power ROC(s, α) of the test it defines is close to the optimal power ROC∗ (α). The sup-norm ||ROC∗ − ROC(s, .)||∞ = sup {ROC∗ (α) − ROC(s, α)} α∈(0,1)
provides a natural way of measuring the performance of a scoring rule s(x). The ROC curve of the scoring function s(x) can be straightforwardly estimated from the training dataset Dn by the stepwise function s−1 (1 − α), s ◦ H α) = 1 − G α ∈ (0, 1) → ROC(s, where s (t) = H with n+ =
1 n−
n
i=1
s (t) = I{s(Xi ) ≤ t} and G
i: Yi =−1
1 n+
I{s(Xi ) ≤ t}
i: Yi =+1
I{Yi = +1} = n − n− .
However, the target curve ROC∗ is unknown in practice and no empirical counterpart is directly available for the deviation ||ROC∗ − ROC(s, .)||∞ . For this reason, empirical risk minimization (ERM) strategies are generally based on the L1 -distance, leading to the popular AUC criterion: minimizing ||ROC∗ − ROC(s, .)||L1 ([0,1]) indeed boils down to maximize def
1
AUC(s) =
α=0
ROC(s, α)dα .
An empirical counterpart of the AUC may be built from the Mann-Whitney statistic, see [5] and the references therein. Beyond this computational advantage, it is noteworthy that two scoring functions may have the same AUC but their ROC curves can present very different shapes. Since the L1 -distance does not permit to account for local properties of the ROC curve, we point out the importance of deriving strategies for ROC curve estimation and optimization whose convergence is validated in a stronger sense than the AUC. The goal of this paper is precisely to provide an adaptive procedure for estimating ROC∗ in sup norm under mild regularity conditions. Regularity of the curve ROC∗ . In the subsequent analysis, the following assumptions will be required. A1 The conditional distributions G∗ (dt) and H ∗ (dt) of the random variable η(X) are continuous.
220
S. Cl´emen¸con and N. Vayatis
A2 The cumulative distribution function H ∗ is strictly increasing on the support of H ∗ (dt). We recall that under these assumptions one may explicit the derivative of ROC∗ . For any α ∈ (0, 1), we denote by Q∗ (α) the quantile of order (1 − α) of the conditional distribution of η(X) given Y = −1. Lemma 1. [9]. Suppose that assumptions A1 − A2 are fulfilled. Let α ∈ (0, 1) such that Q∗ (α) < 1. Then, ROC∗ is differentiable at α and 1−p Q∗ (α) dROC∗ (α) = · . dα p 1 − Q∗ (α) In [6], a statistical procedure for estimating the curve ROC∗ , mimicking a linear spline approximation scheme, has been proposed in a very restrictive setup, stipulating that ROC∗ is of class C 2 with its two first derivatives bounded. . As shown by the result above, boundedness of ROC∗ means that Q∗ (0) = ∗ limα→0 Q (α) < 1, in other words that η(X) stays bounded away from 1, or equivalently that the likelihood ratio Φ(X) = (1 − p)η(X)/(p(1 − η(X))) remains bounded. It is the purpose of this paper to examine to which extent one may estimate ROC∗ under weaker assumptions (see assumption A5 below), including cases where it has a vertical tangent at the origin. 2.3
Ranking by Overlaying Classifiers
From the angle embraced in this paper, ranking amounts to recovering the decreasing collection of level sets of the regression function η(x): {{x ∈ X | η(x) > u}, u ∈ [0, 1]} , without necessarily disposing of the corresponding levels. Indeed, any scoring function of the form 1 s∗ (x) = I{η(x) > Q∗ (α)} dν(α), (2) 0
where ν(dα) is an arbitrary finite positive measure on [0, 1] with same support as the distribution H ∗ , is optimal with respect to the ROC criterion. The next proposition also illustrates this view on the problem. We set the notations: ∗ Rα = {x ∈ X | η(x) > Q∗ (α)} and Rs,α = {x ∈ X | s(x) > Q(s(X), α)},
where Q(s(X), α) is the quantile of order (1 − α) of the conditional distribution of s(X) given Y = −1. Proposition 1. [6]. Let s be a scoring function and α ∈ (0, 1) such that Q∗ (α) < 1. Suppose additionally that the cdf Hs (respectively, H ∗ ) is continuous at Q(s(X), α) (resp. at Q∗ (α)). Then, we have: ROC∗ (α) − ROC(s, α) =
∗ E(|η(X) − Q∗ (α)| I{X ∈ Rα ΔRs,α }) , ∗ p(1 − Q (α))
where Δ denotes the symmetric difference between sets.
Adaptive Estimation of the Optimal ROC Curve
221
This result shows that the pointwise difference between the dominating ROC curve and the one related to a candidate scoring function s may be interpreted ∗ as the error made in recovering the specific level set Rα through Rs,α .
3
Adaptive Approximation
Here we focus on very simple approximants of ROC∗ , taken as piecewise constant curves. Precisely, to any subdivision σ : α0 = 0 < α1 < . . . < αK < αK+1 = 1 of the unit interval, we associate the curve given by: ∀α ∈ (0, 1), Eσ (ROC∗ )(α) =
K
I{α ∈ [αk , αk+1 [} · ROC∗ (αk ).
(3)
k=0
We point out that the approximant Eσ (ROC∗ )(α) is actually a ROC curve. It coincides indeed with ROC(s∗σ , .) where s∗σ is the piecewise constant scoring function given by: ∀x ∈ X ,
s∗σ (x) =
K+1
∗ I{x ∈ Rα }, k
(4)
k=1 ∗ which is obtained by ”overlaying” the regression level sets Rα = {x ∈ X : k ∗ η(x) > Q (αk )}, 1 ≤ k ≤ K.
Adaptive approximation. In free knot splines, it is well-known folklore that the approximation rate in supremum norm by piecewise constant functions with at most K pieces is of the order O(K −1 ), if and only if the target function belongs to the space BV ([0, 1])1 , see Chapter 12 in [10]. From a practical perspective however, in absence of full knowledge of the target curve, it is a very challenging task to determine a grid of points {αk : 1 ≤ k ≤ K} that yields a nearly optimal approximant. In the case where the points of the mesh grid are fixed in advance independently from the curve f to approximate, say with uniform spacing, the rate of approximation is of optimal order O(K −1 ) if and only if f belongs to the space Lip1 ([0, 1]) of absolutely continuous functions f such that f ∈ L∞ ([0, 1]). The latter condition is precisely the type of assumption we would like to avoid in the present work. We propose to use adaptive approximation schemes instead of fixed grids. In such procedures, the mesh grid is progressively refined by adding new breakpoints, as further information about the local variation of the target is gained: this way, one uses a coarse mesh where the target is smooth, and a finer mesh where it exhibits high degrees of variability. Given the properties of the target ROC∗ (concave and non decreasing curve connecting (0, 0) to (1, 1)), an ideal mesh grid should be finer and finer as one gets close to the origin. Dyadic recursive partitioning. For computational reasons, here we shall restrict ourselves only to a dyadic grid of points αj,k = k2−j , with j ∈ N and 1
Recall that the space BV ([0, 1]) of functions of bounded variation on (0, 1) is the space of absolutely continuous functions f : (0, 1) → R such that f ∈ L1 ([0, 1]).
222
S. Cl´emen¸con and N. Vayatis
k ∈ {0, . . . , 2j − 1}, and to partitions of the unit interval [0, 1] produced by recursive dyadic partitioning: any dyadic interval Ij,k = [αj,k , αj,k+1 ) is possibly split into two halves, producing two siblings Ij+1,2k and Ij+1,2k+1 , depending on the (estimated) local properties of the target curve. The adaptive estimation algorithm described in the next section will then appear as a top-down search strategy through a tree structure T , on which the Ij,k ’s are aligned. Precisely, we will consider approximants of the form: ROC∗ (αj,k ) · I{α ∈ [αj,k , αj,k+1 )}, Ij,k ∈{terminal nodes}
where the sum is taken over all dyadic intervals corresponding to terminal nodes, determined by weights ω(·) fulfilling the two conditions: (i) (Keep-or-kill) For any dyadic interval I ⊂ [0, 1), the weight ω(I) belongs to {0, 1}. (ii) (Heredity) If ω(I) = 1, then for any dyadic subinterval I such that I ⊂ I , we have ω(I ) = 1. If ω(I) = 0, then for any dyadic subinterval I ⊂ I, we have ω(I ) = 0. Each collection ω of weights satisfying these two constraints is said admissible and determines the nodes of a subtree Tω of the tree T representing the set of all dyadic intervals. A dyadic subinterval I will be said terminal when ω(I) = 1 and ω(I ) = 0 for any dyadic subinterval I ⊂ I: terminal subintervals correspond to the outer leaves of Tω and form a partition Pω of [0, 1). The algorithm described in the next section consists of selecting those intervals, i.e. the set ω. We denote by σω the mesh grid made of endpoints of terminal subintervals selected by the collection of weights ω. Given two admissible sequences of weights ω1 and ω2 , the mesh σω1 is said finer than σω2 when {I : ω2 (I) = 0} ⊂ {I : ω1 (I) = 0}.
4
Empirical MV-Set Estimation
Beyond the functional approximation facet of the problem, another key ingredient of the estimation procedure consists of estimating specific points ∗ ∗ (αj,k , ROC∗ (αj,k )) = (H(Rα ), G(Rα )) j,k j,k
lying on the optimal ROC curve, in order to gain information about its location in the ROC space and the way it locally varies both at the same time. Following in the footsteps of [6], a constructive approach to this problem lies in viewing ∗ X \Rα as the solution of the following minimum-volume set (MV-set) estimation problem: min G(W ) subject to H(W ) > 1 − α, W ∈B(X )
where the maximum is taken over the set B(X ) of all measurable subsets W ⊂ X . Equivalently, this boils down to solve the constrained optimization problem: sup G(R) subject to H(R) ≤ α.
R∈B(X )
Adaptive Estimation of the Optimal ROC Curve
223
From a statistical perspective, the search should be based on the empirical distributions: n n 1 1 H(dx) = I{Yi = −1} · δXi (dx) and G(dx) = I{Yi = +1} · δXi (dx), n− i=1 n+ i=1
where δx denotes the Dirac mass at x ∈ X . An empirical version of the optimization problem above is then OP (α, φ) :
sup G(R) subject to H(R) ≤ α + φ,
R∈R
where φ is a complexity penalty and R a class of measurable subsets of X . We α a solution of this problem. The success of this program hinges denote by R upon the richness of the class R and the calibration of the tolerance parameter φ, as shown by the next result established in [6] (see also [11]) and involving the following technical assumptions. ∗ A3 For all α ∈ (0, 1), we have Rα ∈ R. A4 The set R is such that the Rademacher average n 1 i I{Xi ∈ R} An = E sup R∈R n i=1
is of order O(n−1/2 ). Note that the assumption A4 is satisfied, for instance, when R is a VC class (see for instance [12] for the use of Rademacher averages in complexity control). Theorem 1. [6]. Suppose that assumptions A1 − A4 are fulfilled and for all δ ∈ (0, 1), set 2 log(1/δ) . φ = φ(δ, n) = 2An + n Then, there exists a constant c < ∞ such that for all δ ∈ (0, 1), we have with probability at least 1 − δ: ∀n ∈ N∗ , ∀α ∈ (0, 1), α ) ≤ α + 2φ(δ/2, n) and G(R α ) ≥ ROC∗ (α) − 2φ(δ/2, n) . H(R Remark 2. (Regularity vs. noise condition) Under the additional condition that the distribution η(X), denoted by F ∗ = pG∗ +(1−p)H ∗, has a bounded density f ∗ , the following extension of Tsybakov’s noise condition ([13]) is fulfilled for any α ∈ (0, 1): ∀t ≥ 0, a
P {|η(X) − Q∗ (α) ≤ t|} ≤ c · t 1−a ,
(5)
with a = 1/2 and c = supt f ∗ (t). Notice that this condition is incompatible with assumption A2 when a > 1/2. It has been shown in [6] (see Theorem 12 therein) R α ) is then of order that, under this assumption, the deviation ROC∗ (α) − G( O(n−5/8 ).
224
S. Cl´emen¸con and N. Vayatis Adaptive estimation - Algorithm 1 (Input.) Target tolerance ∈ (0, 1). Volume tolerance φ > 0. Training data Dn = {(Xi , Yi ) : 1 ≤ i ≤ n}. Class R of level set candidates.
1. (Initialization.) Set ω (I0,0 ) = 0 and ω (Ij,k ) = 1 for all dyadic subinterval I I0,0 = [0, 1). Take β0,0 = 0 and β0,1 = 1. 2. (Iterations.) For all j ≥ 0, for all k ∈ {0, . . . , 2j − 1}: if ω (Ij,k ) = 0, then j,k ) = βj,k+1 − βj,k , (a) Compute E(I (b) If E(Ij,k ) > , then i. set ω (Ij+1,2k ) = ω (Ij+1,2k+1 ) = 0 , α ii. solve the problem OP (αj+1,2k+1 , φ) → solution R j+1,2k+1 , iii. update: βj+1,2k = βj,k and βj+1,2k+2 = βj,k+1 . (c) Else, let the weights of the siblings of the Ij,k unchanged. 3. (Stopping rule.) The algorithm terminates as soon as the weights ω(·) of the nodes of the current level j are all equal to 1. (Output.) Let σ the collection of dyadic levels αj,k corresponding to the terminal nodes defined by ω . Compute the ROC∗ estimate: ∗ (α) = R α ) · I{α ∈ Ij,k }. ROC G( j,k αj,k ∈ σ
5
Adaptive Estimation of the Optimal ROC Curve
In this section we describe an adaptive algorithm for estimating the optimal curve ROC∗ by piecewise constants. It should be interpreted as a statistical version of the adaptive approximation scheme studied in [14]. We emphasize that the crucial difference with the approach developed in [6] is that, here, the mesh grid, the cardinality of the grid of points as well as their locations, used for computing the ROC∗ estimate, are entirely learnt from the data. In this respect, we define the local error empirical estimate on the subinterval I = [α1 , α2 ) ⊂ [0, 1) as R α1 ). = G( R α2 ) − G( E(I) The quantity E(I) is nonnegative (by construction, the mapping α ∈ (0, 1) → R α ) is non decreasing with probability one) and should be viewed as an emG( . pirical counterpart of E(I) = ROC∗ (α2 ) − ROC∗ (α1 ), which provides a simple way of estimating the variability of the (nondecreasing) function ROC∗ on I. This measure is additive, as its statistical version E(.): E(I1 ∪ I2 ) = E(I1 ) + E(I2 )
Adaptive Estimation of the Optimal ROC Curve
225
for any siblings I1 and I2 of the same subinterval. It controls the approximation rate of ROC∗ by a constant on any interval I ⊂ [0, 1) in the sense that: inf ||ROC∗ (.) − c||L∞ (I) ≤ E(I).
c∈[0,1)
The adaptive algorithm designated as ’Algorithm 1’ is based on the following principle: a dyadic subinterval I will be part of the final partition of the true positive rate axis whenever the empirical local error has not met the tolerance on any of its ancestors J ⊃ I but meets the tolerance on it. We point out that, by construction, the sequence ω produced by Algorithm 1 is admissible. Remark 3. (On the stopping rule) One should notice that, as H(R) ∈ {k/n : k = 0, . . . , n} for any R ∈ R, the estimation algorithm necessarily stops before ∗ exceeding the level j = j(n) = log(n)/ log(2): the empirical estimate ROC j(n) has no more than 2 pieces. We now establish a rate of convergence for Algorithm 1. The following assumption shall be required. It classically permits to control the rate at which the derivative of ROC∗ (α) may goes to infinity as α tends to zero, see [15].
A5 The derivative ROC∗ belongs to the space L log L of Borel functions f : (0, 1) → R such that: 1 def ||f ||L log L = (1 + log |f (α)|)|f (α)|dα < ∞. α=0
The next result provides a bound for the rate of the estimator produced by Algorithm 1. Theorem 2. Let δ ∈ (0, 1). Suppose that assumptions A1 −A5 are fulfilled. Take . = (δ, n) = 7φ(δ/2, n). Then, we have, with probability at least 1 − δ: ∗ ||∞ ≤ 16φ(δ/2, n) . ∀n ≥ 1, ||ROC∗ − ROC Moreover, the number of terminal nodes in the output of Algorithm 1 satisfies: # σ ≤ κ
||ROC∗ ||L log L , for some constant κ < ∞. φ(δ/2, n)
(6)
Corollary 1. Let δ ∈ (0, 1). Suppose that assumptions A1 − A5 are fulfilled. Take and φ of the order of n−1 log(1/δ). Then,there exists a constant c such that we have, with probability at least 1 − δ: c ∗ ∗ ||∞ ≤ √ + 2 log(1/δ) , ∀n ≥ 1, ||ROC − ROC n n and the adaptive whose cardinality is at most √ Algorithm 1 builds a partition σ of the order of n.
226
S. Cl´emen¸con and N. Vayatis
Remark 4. (On the rate of convergence.) When assuming ROC∗ of class C 1 on [0, 1] (which implies in particular that ROC∗ is bounded in the neighborhood of 0), it may be shown that a piecewise constant estimate with rate O(n−1/2 ) can be built using K = O(n1/2 ) equispaced grid points, cf [6]. It is remarkable that, with the adaptive scheme of Algorithm 1, comparable performance is achieved, while relaxing significantly the smoothness assumption on ROC∗ . Remark 5. (On lower bounds.) To our knowledge, no lower bound result related to statistical estimation of ROC∗ in sup norm is currently available in the literature. Intuition indicates that the rate O(n−1/2 ) is accurate, insofar as, in absence of further assumption, it is likely that it is the best rate that can be obtained for the MV-set estimation problem, and consequently for local estimation of ROC∗ at a given point α ∈ (0, 1).
6
Adaptive Ranking Algorithm
We now tackle the problem of building a scoring function s(x) whose ROC ∗ . In general, the curve is asymptotically close to the empirical estimate ROC α , (j, k) ∈ σ latter is not a ROC curve: by construction, the sequence R , j,k sorted by increasing order of magnitude of their level αj,k , is not necessarily ∗ increasing, in contrast to the true level sets the Rα . This induces an additional j,k ’Monotonicity’ step in Algorithm 2, before overlaying the estimated sets.
Adaptive RankOver - Algorithm 2 (Input.) Target tolerance ∈ (0, 1). Volume tolerance φ > 0. Training data Dn = {(Xi , Yi ) : 1 ≤ i ≤ n}. Class R of level set candidates. 1. (Algorithm 1.) Run Algorithm 1, in order to get the regression level estimates , where K = # ) < 1. α(1) , . . . , R σ and 0 = α(1) < . . . < α(K R α(K ) α 2. (Monotonocity.) Form recursively the non decreasing sequence R j,k de fined by: for 1 ≤ k < K , α(1) = R α(1) and R α(k+1) = R α(k) ∪ R α(k) . R (Output.) Build the piecewise constant scoring function: s (x) =
K
α(k) }. I{x ∈ R
k=1
Remark 6. (Top-down vs. Bottom-up) Alternatively, a monotonous sequence } the following way: α(k) , 1 ≤ k ≤ K of sets can be built from the collection {R ¯ =R and R ¯ α(k) = R ¯ α(k+1) ∩ R α(k) for k = K −1, . . . , 1. A similar set R α(K ) α(K ) K ¯ α(k) }. I{x ∈ R result as the one stated below can be established for s¯ (x) = k=1
Adaptive Estimation of the Optimal ROC Curve
227
The next theorem states the consistency of the estimated scoring function under the same complexity and regularity assumptions. Theorem 3. Let δ ∈ (0, 1). Suppose that assumptions A1 − A5 are fulfilled. Take a target tolerance of the order of n−1/6 . Then, there exists a constant c = c(δ) > 0 such that we have with probability at least 1 − δ: ∀n ≥ 1, log n ||ROC( s , ·) − ROC∗ ||∞ ≤ c . n1/3 We observe that the rate of convergence of the order of n−1/6 obtained in Theorem 3 is much slower than the n−1/3 rate obtained in [6]. This is due to the fact that we relaxed the regularity assumptions on the optimal ROC curve and used the approximation space made of piecewise constant curves, while we used piecewise linear scoring curves before. We expect that, using nonlinear approximation techniques, the n−1/6 -rate can be significantly improved but we leave this issue open for a future work.
7
Conclusion
In this paper, we have seen how strong consistency of a piecewise constant estimate of the optimal ROC curve can be guaranteed under weak regularity assumptions. Additionally, our approach leads to a strongly consistent piecewise constant scoring rule in terms of ROC curve performance. Whereas the subdivision of the false positive rate axis used for building the ROC curve approximant had to be fixed in advance in the original RankOver approach proposed in [6], which was viewed as a severe restriction on its applicability, the essential novelty of the two algorithms presented here lies in their ability to adapt automatically to the variability of the (unknown) optimal ROC curve.
References 1. Cl´emen¸con, S., Vayatis, N.: Overlaying classifiers: a practical approach for optimal ranking. In: NIPS 2008: Proceedings of the 2008 conference on Advances in neural information processing systems, Vancouver, Canada, pp. 313–320 (2009) 2. Flach, P.: Tutorial on “the many faces of roc analysis in machine learning”. In: ICML 2004 (2004) 3. Freund, Y., Iyer, R.D., Schapire, R.E., Singer, Y.: An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research 4, 933–969 (2003) 4. Agarwal, S., Graepel, T., Herbrich, R., Har-Peled, S., Roth, D.: Generalization bounds for the area under the ROC curve. Journal of Machine Learning Research 6, 393–425 (2005) 5. Cl´emen¸con, S., Lugosi, G., Vayatis, N.: Ranking and empirical risk minimization of U-statistics. The Annals of Statistics 36(2), 844–874 (2008) 6. Cl´emen¸con, S., Vayatis, N.: Overlaying classifiers: a practical approach to optimal scoring. To appear in Constructive Approximation (hal-00341246) (2009)
228
S. Cl´emen¸con and N. Vayatis
7. Scott, C., Nowak, R.: Learning minimum volume sets. Journal of Machine Learning Research 7, 665–704 (2006) 8. Cl´emen¸con, S., Vayatis, N.: Tree-structured ranking rules and approximation of the optimal ROC curve. Technical Report hal-00268068, HAL (2008) 9. Cl´emen¸con, S., Vayatis, N.: Tree-structured ranking rules and approximation of the optimal ROC curve. In: ALT 2008: Proceedings of the 2008 conference on Algorithmic Learning Theory (2008) 10. Devore, R., Lorentz, G.: Constructive Approximation. Springer, Heidelberg (1993) 11. Scott, C., Nowak, R.: A Neyman-Pearson approach to statistical learning. IEEE Transactions on Information Theory 51(11), 3806–3819 (2005) 12. Boucheron, S., Bousquet, O., Lugosi, G.: Theory of Classification: A Survey of Some Recent Advances. ESAIM: Probability and Statistics 9, 323–375 (2005) 13. Tsybakov, A.: Optimal aggregation of classifiers in statistical learning. Annals of Statistics 32(1), 135–166 (2004) 14. Devore, R.: A note on adaptive approximation. Approx. Theory Appl. 3, 74–78 (1987) 15. Bennett, C., Sharpley, R.: Interpolation of Operators. Academic Press, London (1988) 16. Devore, R.: Nonlinear approximation. Acta Numerica, 51–150 (1998)
Appendix - Proofs Proof of Theorem 2 We first prove a lemma which quantifies the uniform deviation of the empirical entropy from the true entropy over all dyadic scales. Lemma 2. (Uniform deviation) Suppose that assumptions A3 − A4 are satisfied. Let δ ∈ (0, 1). With probability at least 1 − δ, we have: ∀n ≥ 1, sup j≥0, 0≤k<2j
j,k ) − E(Ij,k )| ≤ 6φ(δ/2, n). |E(I
proof. In the first place, we observe that 1 j,k ) − E(Ij,k )| ≤ j,k )) − ROC∗ (αj,k )| sup |E(I max |G(R(α 2 j≥0, 0≤k<2j j≥0, 0≤k<2j + sup |G(R) − G(R)| . R∈R
The proof then immediately follows from the complexity assumption A4 , combined with Theorem 1. It follows from Lemma 2 that, with probability at least 1 − δ: ∀j ≤ log2 n, ∀k ∈ {0, . . . , 2j − 1}, j,k ) ≥ E(Ij,k ) − 6φ(δ/2, n) . j,k ) ≤ E(Ij,k ) + 6φ(δ/2, n) and E(I E(I
Adaptive Estimation of the Optimal ROC Curve
229
We now introduce the notation for partitions σ based on the optimal ROC curve at a target tolerance of . Let > 0 and consider the piecewise constant approximant built from the same recursive strategy as the one implemented by Algorithm 1, except that it is based on the (theoretical) error estimate E(.): Eσ (ROC∗ ), σ denoting the associated mesh grid. Choosing = 7φ(δ/2, n), we obtain that, with probability larger than 1−δ, the mesh grid σ is finer than σ1 where 1 = 1 (δ, n) = + 6φ(δ/2, n) = 13φ(δ/2, n), but coarser than σ0 with 0 = 0 (δ, n) = − 6φ(δ/2, n) = φ(δ/2, n). We thus have ||Eσ (ROC∗ ) − ROC∗ ||∞ ≤ ||Eσ1 (ROC∗ ) − ROC∗ ||∞ ≤ 1 , Now we use the following decomposition: ∗ ||∞ . ∗ − ROC∗ ||∞ ≤ ||ROC∗ − Eσ (ROC∗ )||∞ + ||Eσ (ROC∗ ) − ROC ||ROC We have seen that the first term is bounded, with probability at least 1 − δ, by On the same event, we have: ∗ ||∞ ≤ max |G(Rα(k) ) − G( R α(k) )| ||Eσ (ROC∗ ) − ROC 1≤k≤K
α(k) )| + sup |G(R) − G(R)| ≤ max |G(Rα(k) ) − G(R 1≤k≤K
R∈R
≤ 3φ(δ/2, n) ,
where we have used Theorem 1 and a concentration inequality to derive the last inequality. We have thus proved the estimation error rate of 16φ(δ/2, n) for the ∗ of Algorithm 1. output ROC We now show the bound on the cardinality of the partition as a function of the target tolerance parameter. Let us denote by K (= #σ ) the number of pieces forming this approximant. We have the following result. Lemma 3. (Approximation rate) Suppose that assumptions A1 , A2 and A5 are fulfilled. There exists a universal constant κ > 0 such that, for all > 0: K ≤
κ ||ROC∗ ||L log L .
For a proof of this lemma, we refer to [14], and also to subsection 3.3 in [16] for more insights on adaptive approximation methods. It reveals that the number ∗ , i.e. the cardinality of σ of pieces forming ROC (δ,n) is bounded by #σ0 (δ,n) ≤ κ
||ROC∗ ||L log L . φ(δ/2, n)
(7)
In short, in regards to nonlinear approximation of ROC∗ , the performance of the mesh grid σ selected empirically is comparable to the one of the ideal subdivision σ , which would be obtained if an oracle could supply us perfect information about the local variability of ROC∗ .
230
S. Cl´emen¸con and N. Vayatis
Proof of Theorem 3 The next lemma permits to quantify the loss arising from the transformation performed at step 2 of Algorithm 2. Lemma 4. (Error stacking) Suppose that the assumptions of Theorem 3 are }, satisfied. Let δ ∈ (0, 1). With probability at least 1−δ, we have: ∀k ∈ {1, . . . , K
as well as
α(k) ) − α(k)| ≤ kφ(δ/2, n), |H(R
(8)
α(k) ) − ROC∗ (α(k))| ≤ kφ(δ/2, n). |G(R
(9)
} ⊂ {k2−j : 0 ≤ j ≤ log n , 0 ≤ k < proof. Notice that {α(k) : 1 ≤ k ≤ K 2 j 2 }, see Remark 3. Observe also that we have α(2) ) = H(R α(2) ) + H(R α(1) ) \ H(R α(2) ) H(R ∗ ∗ α(1) \ R α(2) as and, since Rα(1) ⊂ Rα(2) , one may write R
α(1) \ R∗ α(1) ∩ R∗ α(2) \ R∗ α(2) ∩ R∗ R ∪ R \ R ∪ R . α(1) α(1) α(2) α(2)
By additivity of the distribution H combined with Theorem 1, we obtain Equation (8) for k = 2. The general result is then established by recurrence. Equation (9) may be proved in a similar fashion. The deviation ||ROC( s , .) − ROC∗ ||∞ is bounded by: s , .) − Eσ (ROC∗ )||∞ . ||ROC∗ − Eσ (ROC∗ )||∞ + ||ROC( The first term may be shown to be of order by reproducing the argument involved in the proof of Theorem 2, while the second term is bounded by max
} k∈{1,...,K
α(k) ) − Eσ (ROC∗ )(H(R α(k) ))| . |G(R
The latter quantity may be bounded by: max
} k∈{1,...,K
α(k) ) − Eσ (ROC∗ )(α(k))|+ |G(R max
} k∈{1,...,K
α(k) )) − Eσ (ROC∗ )(α(k))| . |Eσ (ROC∗ )(H(R
The first term can be taken care of with Lemma 4. We now consider bounding the second term. We count the number of jumps of the piecewise constant curve α(k) ). With probability Eσ (ROC∗ ) between the x-values given by α(k) and H(R
Adaptive Estimation of the Optimal ROC Curve
231
at least 1 − δ, the number of jumps is given by the product of the total number of jumps with the amplitude of the interval of false positive rate levels: · K
max
} k∈{1,...,K
α(k) ) − α(k)| ≤ C · K 2 φ(δ/2, n) ≤ (C/2 ) φ(δ/2, n) , |H(R
where we have used Lemma 4 and a union bound in the first inequality and Lemma 2 in the second. Given the assumption A4 , we are led to the calibration for of the order of n−1/6 since we need to balance, up to some constants, with a term of the order of −2 φ(δ/2, n).
Complexity versus Agreement for Many Views Co-regularization for Multi-view Semi-supervised Learning Odalric-Ambrym Maillard and Nicolas Vayatis 1
2
Sequential Learning Project, INRIA Lille - Nord Europe, France
[email protected] ENS Cachan & UniverSud - CMLA UMR CNRS 8536
[email protected]
Abstract. The paper considers the problem of semi-supervised multiview classification, where each view corresponds to a Reproducing Kernel Hilbert Space. An algorithm based on co-regularization methods with extra penalty terms reflecting smoothness and general agreement properties is proposed. We first provide explicit tight control on the Rademacher (L1 ) complexity of the corresponding class of learners for arbitrary many views, then give the asymptotic behavior of the bounds when the coregularization term increases, making explicit the relation between consistency of the views and reduction of the search space. Since many views involve many parameters, we third provide a parameter selection procedure, based on the stability approach with clustering and localization arguments. To this aim, we give an explicit bound on the variance (L2 diameter) of the class of functions. Finally we illustrate the algorithm through simulations on toy examples.
1
Introduction
In real-life applications for classification tasks, different representations of a same object may be available. Financial experts may use different sets of indicators to assess the current market regime, while in the context of active computer vision, several views of the same object are provided before rendering the decision. This problem is known as that of multi-view classification. After the early work of (Blum & Mitchell, 1998) on learning from both labeled and unlabeled data, this topic has been considered more recently by several authors (see for example (Sridharan & Kakade, 2008),(Weston et al., 2005),(Zhou et al., 2004)). In (Balcan & Blum, 2005), the authors propose a theoretical PAC-model for semi-supervised learning where multi-view learning appears as a special case. Due to the restriction over the search space (compatibility between different views), multi-view learning may provide good generalization results, and indeed this is the case in numerical experiments (e.g. (Belkin et al., 2005)). In (Rosenberg & Bartlett, 2007), these results are applied to a two-view learning problem
The first author is eligible for the E.M.Gold Award. The second author was partly supported by the ANR Project TAMIS.
R. Gavald` a et al. (Eds.): ALT 2009, LNAI 5809, pp. 232–246, 2009. c Springer-Verlag Berlin Heidelberg 2009
Complexity versus Agreement for Many Views
233
and explicit bounds on the Rademacher complexity of the class of predictors are computed. Various algorithms are introduced together with theoretical studies are provided in (Sindhwani et al., 2005). In the latter references, the central issue addressed was to explain how consistency between views affects the performance of classification procedures. Indeed, in multi-view learning, we consider individuals predictors based on separate views, and one intuitive idea is that (1) having a good final predictor is related to the agreement of individual predictors on a majority of labels. It is generally assumed that (2) each view is independent from the others conditionally on labeled data. Though this may be weakened (see (Balcan et al., 2005)), providing theoretical justification to the heuristics that conditional independence of the views allows for high-performance results (two compatible classifiers trained on independent views are unlikely to agree on a mislabeled item) has been the motivation for most of the works on this topic. Thus we build on the same heuristics (1) and (2). As we intend to exploit all the information available in the classification task, our setup will also take unlabeled data into consideration. In the present paper, we consider semi-supervised multi-view binary classification with many views. Allowing for more than two views brings up new questions. For instance, (i) how does the number of views V affects complexity measures? and (ii) how to choose the parameters when there are as many as O(V 2 ) of them ? For the first issue, we will focus on the Rademacher average and track down the dependency on V and other parameters in this formulation. As far as the second issue is concerned, various strategies can be invoked. In supervised classification, cross-validation (e.g. 10-fold) techniques are widely used due to their ease of implementation, but theory is unavailable in most cases (see (Celisse, 2008) and references therein for some recent developments). Another idea comes from recent work on clustering and makes use of the stability approach (see (Ben-David et al., 2006)). This relies on strongly theoretically founded results known as localization arguments (see (Koltchinskii, 2006)) which takes advantage of the so-called ‘small ball estimates” (see also (Li & Linde, 1999),(Berthet & Shi, 2001)). The stability approach has also been applied successfully to other learning problems (see e.g. (Sindhwani & Rosenberg, 2008)). This is the one we have chosen in order to perform the selection of parameters. Comparison of different selection procedures, although interesting by itself, is not the purpose of this paper. In the sequel, we introduce an algorithm which combines semi-supervised and multi-view ideas and generalizes over previous algorithms: it contains RLS, co-RLS and co-Laplacian (see (Rosenberg & Bartlett, 2007),(Sindhwani et al., 2005)) algorithms as special cases. We use the setup of Reproducing Kernel Hilbert Spaces (RKHS) and provide explicit upper and lower data-dependent bounds on the Rademacher complexity of the class of functions involved by this general algorithm. Our second contribution is to give a new parameter selection procedure, based on the work of (Koltchinskii, 2006), together with explicit stability bounds (L2 -diameter) on the localized class for the general algorithm, which has not been investigated so far.
234
O.-A. Maillard and N. Vayatis
The paper is organized as follows: Section 2 defines our framework and the objective function. Section 3 is devoted to the Rademacher complexity control with the first main theorem (section 3.2), and the asymptotic behavior of the bound. Section 4 presents our stability-based selection procedure and the second main theorem, on the L2 local diameter of our class of functions. In Section 5, we successfully apply the algorithm to some toy examples.
2
Setup for Multi-view Semi-supervised Learning
Our approach is based on penalized empirical risk minimization in RKHS. A compound penalty term will reflect both the complexity of the class of decision functions and the particular context of multi-view semi-supervised learning. This is an important improvement on the work of (Sindhwani et al., 2005) where penalties (and corresponding algorithms) are considered separately. The goal we pursue here is of unifying algorithms instead of comparing them. In this section, we provide the notations and definitions of the penalty terms involved. In the multi-view setup, an observation results from elements taken in a collection of representation spaces X (v) ,v = 1, . . . , V , where V is the number of views. We write x = (x(1) , ..., x(V ) ), where x(v) ∈ X (v) , for the resulting point living in the product space which accounts for the multiple views of the object. Learning with RKHS. Let {−1, 1} be the label set and a loss function, for instance losssquare (g(x), y) = (y − g(x))2 , with x a data point and y a label. As usual, we consider the label set Y = [−1, 1] instead. We consider n = u + l i.i.d. data points l of which are labeled, and u are unlabeled. We now define the loss of a multi-view classifier f thanks to the corresponding f (v) in each view: Definition 1 (Loss). For f = (f (1) , ..., f (V ) ) and a sample (xi , yi )i=1,...,l : l
Loss(f ) =
1 loss(f, xi , yi ), l i=1
where loss(f, x, y) may be for instance or losssquare ( V1 Vv=1 f (v) (x(v) ), y).
1 V
V
v=1
losssquare (f (v) (x(v) ), y)
We consider real-valued decision functions φ : x → V1 Vv=1 f (v) (x(v) ) where f (v) : X (v) → Y is a classifier. We assume that each predictor f (v) lives in an RKHS F (v) with kernel K (v) , associated representation function kv (., .), and norm ||.||v . Thanks to the representer theorem, we restrict only to functions (v) (v) f (v) ∈ Lv = span{kv (xi , .)}l+u . Let F be the product space of the views i=1 ⊂ F (v) F and L ⊂ F the product space of the spans. This complexity penalization leads to the following definition: Definition 2 (Complexity ). For f = (f (1) , ..., f (V ) ) ∈ F , we define: Complexity(f ) =
V v=1
λv ||f (v) ||2v
where λ ∈
V +
Complexity versus Agreement for Many Views
235
Semi-supervised regularization. In the sequel, we consider a batch of n = u+l i.i.d. data points,(xi )i=1..l,l+1..l+u . xi is the representation of one object in all (1) (V ) views xi = (xi , ..., xi ). This setup is in-between classification and clustering theory: the labeled part allows for an objective function (whereas in clustering, there is no labeling, thus no objective truth), and the unlabeled part involves structure detection in the data. Using a graph-Laplacian is a natural choice to express the search for structure, as explained for instance in (Smola & Kondor, 2003; Ando & Zhang, 2007). The idea is to consider that the data points depict a manifold (see (Belkin et al., 2005)), for which the graph-Laplacian is a discretized Laplace-Beltrami differential operator. Assuming we have for each view v a similarity graph given by its adjacency matrix W (v) , then the (unnormalized) graph-Laplacian is L(v) = D(v) − W (v) , where D(v) is the diagonal matrix (v) (v) Di,i = j Wi,j . Other interesting choices are the symmetrical or random walk normalized graph-Laplacian. Since intuitively one wants that each f (v) ∈ F v be smooth w.r.t similarity V structures in all views, we use the weighted average graph-Laplacian L = v=1 αv L(v) with weights α summing to 1. Definition 3 (Smoothness). For f = (f (1) , ..., f (V ) ), we define: Smoothness(f ) =
V
γv f(v)T L f(v) , where
v=1
– γ = (γ1 , ..γV ) ≥ 0 meaning that each component is positive. – L is defined L(v) , the Vbased on V graph-Laplacian corresponding to the v-th (v) view: L = v=1 αv L with v=1 αv = 1. – f(v) is the vector (f (v) (x1 ), ..., f (v) (xl+u ))T . (v)
(v)
Multiple view co-regularization. In a multi-view approach, the need for compatibility between the f (v) is conveyed by a so-called Agreement term. We propose the following one which penalizes disagreement with a square loss and generalizes (Sindhwani et al., 2005) to our setting: Definition 4 (Agreement). For f = (f (1) , ..., f (V ) ), and symmetric positive definite matrices cL , cU ∈ V ×V , we define Agreement(f ) as the sum of: C L (f ) =
cL v1 ,v2
v1 =v2
and C U (f ) =
v1 =v2
cU v1 ,v2
l i=1
l+u
(v1 )
) − f (v2 ) (xi
(v1 )
) − f (v2 ) (xi
[f (v1 ) (xi [f (v1 ) (xi
(v2 )
)]2
(v2 )
)]2
i=l+1
Compound complexity penalties. We finally formulate the objective function in this setup as the result of loss minimization with a compound penalty:
236
O.-A. Maillard and N. Vayatis – Compute: [1]
f ∗ = argminf ∈F {Loss(f ) + Complexity(f ) +Smoothness(f ) + Agreement(f )}
– Output: φ =
V 1 ∗(v) f V v=1
We point out that there is a representer theorem for this setting. Indeed, for V any fixed f (2) , ..., f (V ) ∈ Πv=2 F (v) , f ∗(1) minimizes a function (1)
cf (2) ,...,f (V ) (f (x1 ), y1 , ..., f (x(1) n ), yn ) + gf (2) ,...,f (V ) (||f ||1 ) w.r.t. f . Thus the representer theorem tells us that f (1) ∈ L1 . Iterating the argument leads to f ∗ ∈ L. We also refer to (Sindhwani & Rosenberg, 2008) for an alternative construction where one single RKHS combines all the views. For specific choices of the parameters, we recover the former problems studied in previous papers: – when γ and C are 0 we have a Regularized Least Squares (RLS) in RKHS, – when only γ = 0 we have a Co-Regularized Least Squares (co-RLS) problem (see (Sindhwani et al., 2005)), – when Agreement is diagonal nonzero (i.e. cL and cU are diagonal), we have a co-Laplacian method (e.g. co-Laplacian RLS, co-Laplacian SVM, see (Sindhwani et al., 2005)) ; indeed, the f (v) are decoupled, and thus problem [1] amounts to solving for each v: f (v)∗ = argminf (v) ∈F (v) Loss(f (v) ) + λv ||f (v) ||2v + γv f(v)T Lf(v) .
3
Excess Risk Bound
This section is devoted to the control of the Rademacher complexity in our problem. We need the following assumption from (Rosenberg & Bartlett, 2007), which is l V satisfied for instance by the square loss (Loss(0, .., 0) = 1l i=1 V1 v=1 yi2 ≤ 1): Assumption A1: The loss functional satisfies Loss(0, ..., 0) ≤ 1 where (0, ..., 0) is the multi-predictor with constant output 0. 3.1
Preliminaries
One nice property is that under assumption (A1), the final predictor φ belongs to: V 1 (v) (v) (1) (V ) J = x→ f (x ) : (f , .., f ) ∈ H V v=1 with H being the class of multi-predictors f , with total penalty bounded by 1:
Complexity versus Agreement for Many Views
237
H = {f ∈ L : Complexity(f ) + Smoothness(f )+ Agreement(f ) ≤ 1} . Excess risk bounds involve the Rademacher complexity of the class G of learners. For a sample (x1 , ...xn ), it is defined as n 2 σi g(xi )| Rn (G) = σ sup | g∈G n i=1 where (σi )i≤n are Rademacher i.i.d. random variables (P(σi = 1) = P(σi = −1) = 12 ). The following proposition, adapted from (Rosenberg & Bartlett, 2007) makes use of this data-dependent complexity to derive an upper bound of the excess risk: Proposition 1 (Excess risk ). For any positive loss function L uniformly βLipschitz w.r.t its first variable and upper-bounded by 1, then conditionally on the unlabeled data, ∀δ ∈ (0, 1), with probability at least 1 − δ over the labeled points drawn i.i.d, for φ∗l the empirical minimizer of the objective function:
(L(φ∗l (X), Y )) −
inf
φ∈J
2
(L(φ(X), Y )) ≤ 4βRl (J ) + √
l
(2 + 3 ln(2/δ)/2)
The proof is an easy combination of classical generalization bounds with some arguments from (Rosenberg & Bartlett, 2007) and the following contraction principle: if h is β-Lipschitz and h(0) = 0, then Rn (h ◦ J ) ≤ 2βRn (J ) (see (Ledoux & Talagrand, 1991)), together with the symmetry of J . 3.2
Explicit Rademacher Complexity Bound
Block-wise notations. We use the following notations: for any n, In or I is the identity of n , 0u,l the zero matrix of u×l . For any given n1 , n2 , A(v) ∈ n1 ×n2 , A is the block-diagonal matrix with blocks A(v) , v = 1..V (of size n1 V × n2 V ), and similarly A the block-row matrix of size n1 V × n2 . To multiply block-wise ˜ be the blockeach block A(v) by the v-th component of a vector λ ∈ v , let λ diagonal matrix of size n1 V × n1 V with blocks λv In1 . Since we always multiply the v-th block with the v-th component, we drop the index. Data. With the following matrices, we between labeled and unlabeled
decompose (v) data: K (v) = (kv (x(v) i , xj ))1≤i,j≤n =
(v)
KL (v) KU
∈ Ên×n and Π =
Il 0u,l
∈ ÊnV ×l
Agreement. To compare between views pairwisely, we introduce a block-line defined δ ∈ nV (V −1)×nV , with blocks (0 . . . 0 In 0 . . . 0 −In 0 . . . 0) with identity matrices at position v1 and v2 = v1 . Let also the block-diagonal matrix Cv,w with nV (V −1)×nV (V −1) diagonal blocks (cL ) and (cU i=1..l v,w v,w )i=l+1..l+u , and then C ∈ the block-diagonal matrix with blocks Cv,w when v, w ∈ {1, .., V } Smoothness. Let LI be the diagonal block matrix with all V blocks equaled to L. Note that we would have introduced α ˜ L instead, if we have used each graph Laplacian and not the average Laplacian L in the smoothness term.
238
O.-A. Maillard and N. Vayatis
Thanks to the previous notations, we can now state our first main theorem, which shows an explicit upper and lower data-dependent bound for the Rademacher complexity of the class of functions. Theorem 1 (Rademacher complexity bound ). Under assumption (A1), then 2b 2 b ≤ Rl (J ) ≤ Vl 21/4 V l where
T
˜ −1 ΠK T ) − tr(J T (I + M )−1 J ) with b2 = tr(B λ L
−1 ˜ −1 – B = (I ∈ nV ×nV √ + λ −1γ˜ LTI K) T ˜ – J = Cδ λ B KL ∈ nV (V −1)×l √ √ ˜ −1 δ T C ∈ nV (V −1)×nV (V −1) – M = CδKB λ
Note that b is explicit as a difference of two terms. The first term only depends on unlabeled data when Smoothness is null, and contains no co-regularization term. The second term corresponds to the idea that there is a reduction in complexity of the space. Indeed, in section 3.3, we give some results about the behavior of b enforcing this idea. As pointed by (Sindhwani & Rosenberg, 2008), this term is connected to a specific norm induced by the parameters and data over the space. This Theorem generalizes previous results: for instance, if V = 2, γ = 0, and cL v,w = 0, we recover exactly the previous known bound of (Rosenberg & Bartlett, 2007) where our 2cU v,w corresponds to their λ and our λ is their (γF , γG ). 3.3
Asymptotics
Let θ = (α, λ, γ, C) be the parameters of the learning problem, where α appears in the graph-Laplacian, λ in the Complexity term, γ in the Smoothness term and C in the Agreement term. The number of parameters grows with O(V 2 ). We study how the previous Rademacher bound changes with these parameters. More agreement reduces space complexity. The second term appearing in the expression of b2 depends on the co-regularization (matrix) parameter C. To see how constrained is the space when using bigger penalization, we introduce Δ(C) = tr(J T (I + M )−1 J ), which can be written, provided that C −1 exists, as: Δ(C) = tr(J1T (C −1 + M1 )−1 J1 ) ˜−1 B T K T and M1 = δKB λ ˜ −1 δ T . where J1 = δ λ L Thus when the eigenvalues of C increases to +∞, Δ(C) tends to: T ˜−1 δ T (δKB λ ˜ −1 δ T )−1 δ λ ˜−1 B T K T ), Δ∞ = tr(KLT B λ L
˜ −1 Πl K T T ),and shows that b2 → 0 in this which can be rewritten Δ∞ = tr(B λ L case. That b decreases as the model gets more constraint is coherent with the intuition of multi-view learning. Similarly,b2 → 0 whenever ||γ||, or ||λ|| → ∞.
Complexity versus Agreement for Many Views
239
Unconstrained space. When the constraint on the space vanishes, we have a completely different behavior. Indeed, if C = 0 then Δ(C) = 0. When γ = 0, we refer to (Rosenberg & Bartlett, 2007). Finally, when λ = 0, b2 has the following expression (provided every term appearing in this expression is finite and defined): b2 = tr(ΠlT L−1 ˜ −1 Πl ) − tr(ΠlT L−1 ˜ −1 δ T (C −1 + δL−1 ˜ −1 δ T )−1 δL−1 ˜ −1 Πl ) I γ I γ I γ I γ Note that when both γ and λ tend to 0, the previous bound may tend to ∞ even in some simple case (which is coherent with the intuition). Note also that the dependency with V is hidden here in the trace.
4
Stability-Based Parameter Selection
The multi-view setting involves new questions, like the choice of the parameters since there are O(V 2 ) many of them. We now describe an automatic parameter selection procedure which will be theoretically sound. 4.1
Theoretical Selection Procedure n 1 Let Pn = n n i=1 δXi be the empirical measure, and P the true measure. Thus Pn f = n1 i=1 f (Xi ) and Pf = (f (X)). For a general class F of functions, and probability measure Q, we define FQ ( ) = {f ∈ F; Qf − inf Qf ≤ } and then introduce the true -optimal ball F ( ) = FP ( ), and the empirical -optimal ball Fn ( ) = FPn ( ), or balls around the Empirical Risk Minimizer (ERM) and True Risk Minimizer (TRM). For a general class F of functions, we now assume that we have T : F 2 → + such that ∀f, g ∈ F (f − g) ≤ T 2 (f, g), and then introduce the two objects: Δn ( ) = supf1 ,f2 ∈F () |Pn − P |(f1 − f2 ) and DF ( ) = supf,g∈F () T (f, g). We refer to the first one as a L1 , P -diameter and the second one as a L2 , P -diameter. Lemma 1 in (Koltchinskii, 2006) tells us that for large enough radii, the empirical and true quasi-optimal sets around the ERM and TRM are included in each other, or put differently, that true quasi-optimal sets can be estimated by empirical quasi-optimal sets: Lemma 1. (Koltchinskii) For any > 0, and any λ < 1, we set
2 2 log( −1 ) 2 Δn ( ) log( −1 ) + + [DF ( ) + 2Δn ( )], Bn ( , λ) = 2 λ λn λ n α ∈ [0, 1];
and rn ( , λ) = inf Set also = 2 +
ln(rn (,λ)) ln(λ)
∀r ≥ rn ( , λ)
sup j∈ ;1≥λj ≥α
Bn ( , λj ) ≤ λ
.
. Then, with probability larger than 1 − :
F (r) ⊂ Fn (3r/2) and Fn (r) ⊂ F(2r) .
240
O.-A. Maillard and N. Vayatis
In the general case, if the radii are too small, then such inclusions no longer hold, and the intersection may even be empty. For our problem, we will simply select the parameter θ inducing the larger range of quasi-optimal sets controlled around the ERM, which is a notion of stability. Thus, for a given radius of the true penalized ball, we want to minimize the critical radius rn w.r.t. θ. A side motivating intuition is that having good stability allows for easy discovery of the minimizer f ∗ . 4.2
Empirical Selection Procedure
We now propose an empirical version of this lemma. Fortunately, using an empirical estimation of the rn ( , λ) is possible thanks to the Theorem 3, page 18, in (Koltchinskii, 2006), leading to a full data-dependent quantity. Indeed, let 2 ˆ Fn ( ) = sup Δˆn ( ) = Rn (Fn ( )) and D f,g∈Fn () Tn (f, g), with Tn bounding the empirical variance n . The empirical versions of Bn ( , λ) and rn ( , λ) given by (Koltchinskii, 2006) are: j 3 ˆ rˆn ( , λ) = inf α ∈ [0, 1]; sup Bn ( , λ ) ≤ λ , where j∈ ;1≥λj ≥α
ˆn (c ) 2cΔ log( −1 ) log( −1 ) ˆ ˆ + 2DFn (c ) + Bn ( , λ) = λ λ2 n λn and c, c ≥ 1 are universal constants. We now propose to apply this result to semi-supervised multi-view classification. We identify the classes Fˆθ,n to be V 1 (v) (v) J (r) = x → f (x ); f ∈ H(r) , V v=1 ˆ ˆ ( ) for where H(r) = {f ; π ˆθ,l (f ) ≤ r}, and estimate Rn (Fˆθ,n ( )) and D Fn,θ each parameter θ. Note that the dependency w.r.t. θ = (α, λ, γ, C) is hidden in the definition. Thus we need to bound the Rademacher complexity of J (r) and its L2 , Pn -Diameter. An analysis of the proof of Theorem 1 shows that √ changing J = J (1) for J√(r) affects the Rademacher bound with a factor r, leading r to a bound 2b(θ) for the first term. Following the same analysis as for the lV L1 -diameter (or Rademacher complexity), the next theorem gives us the second bound we need: Theorem 2 (Empirical local L2 diameter ). Under assumption A1, then √ r ˆ J (r) ≤ 2d √ D lV ˜ −1 Π(K T )T , with where d2 is the largest eigenvalue of (B − J2T (I + M )−1 J2 )λ L √ ˜−1 B T J2 = Cδ λ
Complexity versus Agreement for Many Views
241
√ Note the dependency with l instead of the l for the Rademacher bound. Eventually, each θ leads to a radius rnθ ( , λ) ≥ rˆnθ ( , λ) defined likewise, using upper bounds of Theorem 1 and 2. For maximal stability, we propose to have the largest range of values for which Lemma 1 still holds, which boils down to minimizing this quantity with θ. This leads to the following selection procedure where each term is computable: – Fix a probability threshold with > 0 and λ < 1. – Compute r(θ, n, l, , λ), defined by: j 3 ˜ inf α ∈ [0, 1]; sup Bn,l (θ, , λ ) ≤ λ , j∈ ;1≥λj ≥α
˜n,l (, λ) is: where the term B √ √ 2cb(θ) c 4d(θ) c log(−1 ) log(−1 ) √ + + lV λ λ2 n λn lV – Output: θ∗ = argminθ∈Θ r(θ, n, l, , λ)
5
Experiments
We have performed some toy simulations to see the flexibility of this general algorithm and the results are promising. Based on only one or two labeled points, we can always recover perfect labeling of the data, even on the challenging crossmoons data set on which all classical algorithms (Co-Laplacian and Co-RLS) performs badly. For completeness, we first give hints how to solve the minimization problem. Recall that the solutions of [1] can be written f (v) (x(v) ) = l+u (v) (v) (v) (v) (v) (x , xi ) = Kx(v) α(v) . We first consider the case where the loss i=1 αi K function is differentiable. Theorem 3 (Solution in the differentiable case). Assuming that the loss function satisfies ∇α(v) Loss(f (v) ) = 2K (v) A(v) α(v) , then the solution of the problem 1 is given by the resolution of the linear system, where the α(v) are the unknown vectors. ∀v ∈ 1 . . . V : Y = [A(v) + λv I + γv LK (v) ]α(v) + 2
V
Cv,w (K (v) α(v) − K (w) α(w) )
w=1
where Yi = yi for 1 ≤ i ≤ l and Yi = 0 for l + 1 ≤ i ≤ l + u. The proof is a straightforward application of usual algebra and is omitted here. This system contains as a special case the linear system of (Sindhwani et al.,
242
O.-A. Maillard and N. Vayatis
2005). We can rewrite it as Sα = Y˜ where S is an appropriate matrix, α = (α(1)T , .., α(V )T )T and Y˜ = (Y T , .., Y T )T , but S a priori is not positive and may have a very large conditioning number. An important case of non-differentiable loss function (which is not covered by the previous Theorem) is the hinge loss used in SVM. How to use a classical SVM solver for our problem, is left aside in this paper. A complete derivation is given in(Belkin et al., 2005) when γ = 0. 5.1
Toy Examples
We have done some experiments on three toy examples (Figure 1), with only two views and two classes for simplicity. – The easy two moons-two lines data set, for which the data is linearly separable in the second view, and almost separated in the first. – The more complex two spirals-two clouds data set, with intricate spirals (to “force” the use of graph-Laplacian). Note that a human operator cannot separate the two classes without the information of the second view. – The challenging cross-two moons data set, which appears to fool the tested algorithms based on only one of the Smoothness or Agreement term. Since the less labeled object, the more heuristic the definition of the “true” classes, we refer here to human beings to say what are the true classes. Such a definition of truth is a real problem still unsolved in the clustering community and we do not pretend here to solve it. In the first two data sets, a human only needs one label object of each class to recover the classes. For the last one, because the cross yields ambiguity, a human operator needs two objects in each class. Thus, we use this number of labels. For each algorithm we use the quadratic loss, which is differentiable. The first one is the classical RLS, for which Smoothness and Agreement are set to 0. The second one is a co-RLS, with only Smoothness set to 0. Then we used
Two moons-two lines
Two spirals-two clouds
One cross-two moons
Fig. 1. Three toy data sets. Normal points for unlabeled points, circle for class number one and cross for class number two. From left to right: Two moons (above)- two lines (below), with one labeled object in each class. Two spirals-two clouds, with one labeled object in each class. One cross-two moons, with two labeled objects in each class.
Complexity versus Agreement for Many Views
243
a Laplacian-based algorithm (co-Laplacian), which outperform co-RLS on the tricky two spirals-two clouds data set, and finally an algorithm with none of the terms set to 0. Since all these algorithms are specialization of the general algorithm, with some parameters set to 0 to highlight some behaviors, we just tuned the parameters by hand trying to find the best results for each algorithm. Finally, note that the choice of the kernels for each view is important, and we used well-suited kernels for each problem (gaussian for clouds, linear for lines, . . . ). algo dataset 1 RLS 0.455 ± 0.035 co-RLS 0.146 ± 0.071 co-Laplacian 0.242 ± 0.040 general 0.011 ± 0.015
dataset 2 dataset 3 0.103 ± 0.024 0.379 ± 0.026 0.103 ± 0.024 0.467 ± 0.025 0.001 ± 0.004 0.510 ± 0.028 0.322 ± 0.067 0.042 ± 0.071
Empirical misclassification errors for the above algorithms (one set of parameters per dataset, some possibly put to zero when specified to each algorithm), averaged over 1000 runs.
6
Discussion and Conclusion
In this paper, we have combined different aspects of semi-supervised and multiview learning into one algorithm. Based on previous work, we have derived an explicit control for the L1 -diameter (Rademacher complexity) of the class of decision functions for this new algorithm. Besides, we have shown how considering the full multi-view learning problem may generate new questions. Combining stability ideas from the statistical and clustering community, we have proposed a new stability-based parameter selection procedure, which benefits from strong recent theoretical developments. For this procedure to be implementable, we have controlled the L2 -diameter of the class as well, which has not been investigated so far for similar settings.
References Ando, R.K., Zhang, T.: Learning on graph with laplacian regularization. In: Sch¨ olkopf, B., Platt, J., Hoffman, T. (eds.) Advances in neural information processing systems, vol. 19, pp. 25–32. MIT Press, Cambridge (2007) Balcan, M., Blum, A.: A PAC-style model for learning from labeled and unlabeled data. In: Auer, P., Meir, R. (eds.) COLT 2005. LNCS (LNAI), vol. 3559, pp. 111– 126. Springer, Heidelberg (2005) Balcan, M.F., Blum, A., Yang, K.: Co-training and expansion: Towards bridging theory and practice. In: Saul, L.K., Weiss, Y., Bottou, L. (eds.) Advances in neural information processing systems, vol. 17, pp. 89–96. MIT Press, Cambridge (2005) Belkin, M., Niyogi, P., Sindhwani, V.: On Manifold Regularization. In: AISTAT (2005)
244
O.-A. Maillard and N. Vayatis
Ben-David, S., von Luxburg, U., Pal, D.: A sober look at clustering stability. In: Lugosi, G., Simon, H.U. (eds.) COLT 2006. LNCS (LNAI), vol. 4005, pp. 5–19. Springer, Heidelberg (2006) Berthet, P., Shi, Z.: Small ball estimates for brownian motion under a weighted supnorm. Studia Sci. Math. Hung, 1–2, 275–289 (2001) Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT 1998: Proceedings of the eleventh annual conference on Computational learning theory, pp. 92–100. ACM, New York (1998) Celisse, A.: Model selection via cross-validation in density estimation, regression, and change-points detection. Doctoral dissertation, Universite Paris Sud, Faculte des Sciences d’Orsay (2008) Golub, G.H., Van Loan, C.F.: Matrix computations. The Johns Hopkins University Press (1996) Koltchinskii, V.: Local rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics 34(6), 2593–2656 (2006) Ledoux, M., Talagrand, M.: Probability on banach spaces: Isoperimetry and processes. Springer, Berlin (1991) Li, W.V., Linde, W.: Approximation, metric entropy and small ball estimates for gaussian measures. Ann. Probab. 27, 1556–1578 (1999) Rosenberg, D., Bartlett, P.L.: The rademacher complexity of co-regularized kernel classes. In: Proceedings of the Eleventh ICAIS (2007) Sindhwani, V., Niyogi, P., Belkin, M.: A co-regularization approach to semi-supervised learning with multiple views. In: Workshop on Learning with Multiple Views, Proceedings of International Conference on Machine Learning (2005) Sindhwani, V., Rosenberg, D.S.: An rkhs for multi-view learning and manifold coregularization. In: ICML 2008: Proceedings of the 25th international conference on Machine learning, pp. 976–983. ACM, New York (2008) Smola, A.J., Kondor, R.: Kernels and regularization on graphs. In: Conference on Learning Theory and 7th Kernel Workshop, pp. 144–158 (2003) Sridharan, K., Kakade, S.M.: An information theoretic framework for multi-view learning. In: COLT, pp. 403–414. Omnipress (2008) Weston, J., Leslie, C., Ie, E., Zhou, D., Elisseeff, A., Noble, W.S.: Semi-supervised protein classification using cluster kernels. Bioinformatics 21, 3241–3247 (2005) Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Sch¨ olkopf, B.: Learning with local and global consistency. In: Advances in Neural Information Processing Systems, vol. 16, pp. 321–328. MIT Press, Cambridge (2004)
Appendix - Proofs Sketch of Proof of Theorem 1 The proof of Theorem 1 follows the same line as (Rosenberg & Bartlett, 2007) and extends their result to the compound regularization penalty in the case of an arbitrary number of views. Since there is no novelty in the proof technique, we do not reproduce it here entirely. For completeness, we recall the main next steps: (i) use classical invariance properties of the kernel function to reformulate the optimization problem with an invertible matrix, (ii) apply Lemma 3 below to get the solution, (iii) eventually, rewrite it with the formulation involving the initial data by use of the Sherman-Morrison-Woodbury formula (Golub & Van Loan, 1996). We provide the key intermediate steps adapted to our setting.
Complexity versus Agreement for Many Views
245
Lemma 2. Under assumption (A1), the solution of the minimization problem [1] belongs to the set L ∩ H. Proof: Let Q be the functional to be minimized, decomposed as: Q(f ) = Loss(f )+ Π(f ). For the null multi-view predictor 0 ∈ F, we have Q(0) = Loss(0), thus under assumption (A1), inf Q ≤ 1. But since all terms of Q are non negative, the solution is in H. Finally, that f ∗ ∈ L by the representer theorem. First, we apply Lemma 2 to reduce the search space. Then, if f ∈ L ∩ H, thanks to the representer theorem, we can write its component in each view f (v) = (v) (v) (v) fα(v) = ni=1 αi kv (., xi ), where α(v) ∈ n . Thus, a matrix reformulation of f ∈ L ∩ H is: f ∈ {(fα(1) , ..., fα(V ) ) : where α ∈
αT N α ≤ 1}
nV ×1
, and the data-dependent N square matrix is1 : v ,v ˜ + γ˜ Diag K (1) LK (1) . . . K (V ) LK (V ) + KC1 2 , and N = λK v1 =v2
⎛
v1 ,v2 KC
0 .. .
⎞
⎛
0 .. .
⎞T
⎟ ⎟ ⎜ ⎜ ⎟ ⎟ ⎜ ⎜ ⎜ (v ) ⎟ ⎜ (v ) ⎟ ⎜K 1 ⎟ ⎜K 1 ⎟ ⎟ ⎟ ⎜ ⎜ ⎟ ⎟ ⎜ ⎜ = ⎜ ... ⎟ Cv1 ,v2 ⎜ ... ⎟ ⎟ ⎟ ⎜ ⎜ ⎜−K (v2 ) ⎟ ⎜−K (v2 ) ⎟ ⎟ ⎟ ⎜ ⎜ ⎜ . ⎟ ⎜ . ⎟ . . ⎝ . ⎠ ⎝ . ⎠ 0 0
.
Thus, the definition of the Rademacher complexity can be seen as the solution to an optimization problem under quadratic constraint. Indeed, since H is symmetrical: Rl (J ) =
2 lV
σ
sup α;αT N α≤1
αT KLT σ
with σ = (σ1 , . . . , σl )T ∈ l×1 . To apply Lemma 3, we need an invertible matrix. Let P, Σ such that P (v)T K (v) P (v) = Σ (v) is the diagonal matrix of non zero (v) eigenvalues of K (v) .We introduce α// , the projection of α(v) on the subspace
associated to the rows of K (v) . Since αT N α is left unchanged under this pro(v) jection, we rewrite it with a(v) such that P (v) a(v) = α// , ending up with the constraint aT T a ≤ 1 where T is now an invertible matrix. As mentioned, we use the following lemmas to conclude: 1
Diag(v1 ....vk ) is a shortcut notation for the square matrix with diagonal blocks v1 , ..., vk on the diagonal.
246
O.-A. Maillard and N. Vayatis
Lemma 3. If M is a symmetric positive definite matrix, then sup α:αT Mα≤1
v T α = ||M −1/2 v|| .
Lemma 4. (Sherman-Morrison-Woodbury formula) Provided that the inverses exist: (A + U U T )−1 = A−1 − A−1 U (I + U T A−1 U )−1 U T A−1 .
Sketch of Proof of Theorem 2 The proof essentially follows the same steps as Theorem 1, but uses Lemma 5 below to solve the minimization problem. 1/2 ˆ J (r) = sup , which is also By definition, we have D φ1 ,φ2 ∈J (r) l (φ1 − φ2 )
(Pl ((φ1 − φ2 )2 ))1/2
sup
φ1 ,φ2 ∈J (r)
l 4 sup ≤ φ(xi )2 l φ∈J (r) i=1
1/2 .
(v) (v) ˆ J (r)2 ≤ 42 αT Since f (v) (xi ) = Ki α(v) , where Ki is the ith row of K (v) , D V l nV ×nV Dα where D ∈ is the symmetrical matrix with block v, w equal to l (v)T (w) Ki . Applying Lemma 2 with (A1), φ ∈ J (r) is αT N α ≤ r. If we i=1 Ki introduce the same transformation as in Theorem 1, this is again aT T a ≤ r with now invertible T . Moreover T as the appropriate form A + U U T for Lemma 4. T Thus Lemma 5 first tells us that we want the highest eigenvalue of T −1 P DP .
Lemma 5. When M is symmetric positive definite and Q is symmetric positive semidefinite, the quadratic problem : sup a;aT Ma≤r
aT Qa
admits as solution λr, where λ is the highest eigenvalue of M −1 Q. Now, D = KeK T with e ∈ n×n being the projection matrix with diagonal blocks Il and 0u . Lemma 4 applies to T −1 , and since the eigenvalues of T T T −1 P DP and P T −1 P D are the same, we compute: T
T
˜ −1 Σ P A−1 P D = P P BP λ
−1
˜ −1 Π(K T )T P T KeK T = B λ L
Where A comes from Lemma 4. Similar computations yield the second term and allow to conclude the proof.
Error-Correcting Tournaments Alina Beygelzimer1, John Langford2, and Pradeep Ravikumar3 1
IBM Thomas J. Watson Research Center, Hawthorne, NY 10532, USA
[email protected] 2 Yahoo! Research, New York, NY 10018, USA
[email protected] 3 University of California, Berkeley, CA 94720, USA
[email protected]
Abstract. We present a family of pairwise tournaments reducing k-class classification to binary classification. These reductions are provably robust against a constant fraction of binary errors, and match the best possible computation and regret up to a constant.
1 Introduction We consider the classical problem of multiclass classification, where given an instance x ∈ X, the goal is to predict the most likely label y ∈ {1, . . . , k}, according to some unknown probability distribution. A common general approach to multiclass learning is to reduce a multiclass problem to a set of binary classification problems [2,6,10,11,14]. This black-box approach is composable with any binary learning algorithm (and thus bias), including online algorithms, Bayesian algorithms, and even humans. A key technique for analyzing reductions is regret analysis, which bounds the “regret” of the resulting multiclass classifier in terms of the average classification “regret” on the binary problems. Here regret is the difference between the incurred loss and the smallest achievable loss on the problem, i.e., excess loss due to suboptimal prediction. The most commonly applied reduction is one-against-all, which creates a binary classification problem for each of the k classes: The classifier for class i is trained to predict whether the label is i or not; predictions are done by evaluating each binary classifier and randomizing over those which predict “yes,” or randomly if all answers are “no”. This simple reduction is inconsistent, in the sense that given optimal (zero-regret) binary classifiers, the reduction may not yield an optimal multiclass classifier in the presence of noise. Optimizing squared loss of the binary predictions instead of the 0/1 loss √ makes the approach consistent, but the resulting multiclass regret may be as high as 2kr, where r is the average squared loss regret on the induced problems, which is upper bounded by the average binary classification regret via the Probing reduction [15]. The probabilistic error correcting output code approach (PECOC) [14] reduces kclass classification to learning O(k) regressors on the interval [0, 1], creating O(k) binary examples per multiclass example at both training and test time, with √ a test time computation of O(k 2 ). The resulting multiclass regret is bounded by 4 r, where r is the average squared loss regret of the regressors (which is upper bounded by the average R. Gavald`a et al. (Eds.): ALT 2009, LNAI 5809, pp. 247–262, 2009. c Springer-Verlag Berlin Heidelberg 2009
248
A. Beygelzimer, J. Langford, and P. Ravikumar
binary classification regret via the Probing reduction [15]). Thus PECOC removes the dependence on the number of classes k. When only a constant number of labels have non-zero probability given x, the complexity can be reduced to O(log k) examples per multiclass example and O(k log k) computation per example [13]. This leads to several questions: 1. Is there a consistent reduction from multiclass to binary classification that does not have a square root dependence [17]? For example, an average binary regret of just 0.01 may imply a PECOC multiclass regret of 0.4. 2. Is there a consistent reduction that requires just O(log k) computation, matching the information theoretic lower bound? The well known tree reduction (see [9]) distinguishes between the labels using a balanced binary tree, where each non-leaf nodes predicts “Is the correct multiclass label to the left or right?”. As shown in Section 2, this method is inconsistent. 3. Can the above be achieved with a reduction that only performs pairwise comparisons between classes? One fear associated with the PECOC approach is that it creates binary problems of the form “What is the probability that the label is in a given random subset of labels?,” which may be hard to solve. Although this fear is addressed by regret analysis (as the latter operates only on excess loss), and is overstated in some cases [8,13], it is still of some concern, especially with larger values of k. The error-correcting tournament family presented here answers all of these questions in the affirmative. It provides an exponentially faster in k method for multiclass prediction with the resulting multiclass regret bounded by 5.5r, where r is the average binary regret and every binary classifier logically compares two distinct class labels. The result is based on a basic observation that if a non-leaf node fails to predict its binary label, which may be unavoidable due to noise in the distribution, nodes between this node and the root should have no preference for class label prediction. Utilizing this observation, we construct a reduction, called the Filter Tree, with the property that it uses O(log k) binary examples and O(log k) computation at training and test time with a multiclass regret bounded by log k times the average binary regret. The decision process of a Filter Tree, viewed bottom up, can be viewed as a singleelimination tournament on a set of k players. Using c independent single-elimination tournaments is of no use as it does not affect the average regret of an adversary controlling the binary classifiers. Somewhat surprisingly, it is possible to have c = log k complete single-elimination tournaments between k players in O(log k) rounds with no player playing twice in the same round [5]. All error-correcting tournaments first pair labels in consecutive interfering single-elimination tournaments, followed by a final carefully weighted single-elimination tournament that decides among the log2 k winners of the first phase. As for the Filter Tree, test time evaluation can start at the root and proceed to a multiclass label with O(log k) computation. This construction is also useful for the problem of robust search, yielding the first algorithm which allows the adversary to err a constant fraction of the time in the “full lie” setting [16] where a comparator can missort any comparison. Previous work either applied to the “half lie” case where a comparator can fail to sort but can not actively missort [5,18] or to a “full lie” setting where an adversary has a fixed known bound on
Error-Correcting Tournaments
249
the number of lies [16] or a fixed budget on the fraction of errors so far [4,3]. Indeed, it might even appear impossible to have an algorithm robust to a constant fraction of full lie errors since an error can always be reserved for the last comparison. By repeating the last comparison O(log k) times we can defeat this strategy. The result here is also useful for the actual problem of tournament construction in games with real players. Our analysis does not assume that errors are i.i.d. [7], or have known noise distributions [1] or known outcome distributions given player skills [12]. Consequently, the tournaments we construct are robust against severe bias such as a biased referee or some forms of bribery and collusion. Furthermore, the tournaments we construct are shallow, requiring fewer rounds than m-elimination bracket tournaments, which do not satisfy the guarantee provided here. In an m-elimination bracket tournament, bracket i is a single-elimination tournament on all players except the winners of brackets 1, . . . , i − 1. After the bracket winners are determined, the player winning the last bracket m plays the winner of bracket m − 1 repeatedly until one player has suffered m losses (they start with m − 1 and m − 2 losses respectively). The winner moves on to pair against the winner of bracket m − 2, and the process continues until only one player remains. This method does not scale well to large m, as the final elimination m phase takes i=1 i − 1 = O(m2 ) rounds. Even for k = 8 and m = 3, our constructions have smaller maximum depth than bracketed 3-elimination. Paper overview. Section 2 shows that the simple divide-and-conquer tree approach is inconsistent, motivating the Filter Tree algorithm described in section 3 (which applies to more general cost sensitive multiclass problems). Section 3.1 proves that the algorithm has the best possible computational dependence, and gives two upper bounds on the regret of the returned (cost-sensitive) multiclass classifier. Section 4 presents the error-correcting tournament family parametrized by an integer m ≥ 1, which controls the tradeoff between maximizing robustness (m large) and minimizing depth (m small). Setting m = 1 gives the Filter Tree, while m = 4 ln k gives a (multiclass to binary) regret ratio of 5.5 with O(log k) depth. Setting m = ck gives regret ratio of 3 + O(1/c) with depth O(k). The results here provide a nearly free generalization of earlier work [5] in the robust search setting, to a more powerful adversary that can missort as well as fail to sort. Section 5 gives an algorithm independent lower bound of 2 on the regret ratio for large k. When the number of calls to a binary classifier is independent (or nearly independent) of the label predicted, we strengthen this lower bound to 3 for large k.
2 Inconsistency of Divide and Conquer Trees One standard approach for reducing multiclass learning to binary learning is to split the set of labels in half, then learn a binary classifier to distinguish between the subsets, and repeat recursively until each subset contains one label. Multiclass predictions are made by following a chain of classifications from the root down to the leaves. The following theorem shows that there exist multiclass problems such that even if we have an optimal classifier for the induced binary problem at each node, the tree reduction does not yield an optimal multiclass predictor.
250
A. Beygelzimer, J. Langford, and P. Ravikumar 1
2
3
1 vs 2 {winner of 1 vs 2}
4
5
3 vs 4
6
5 vs 6
vs {winner of 3 vs 4}
7
{winner of 5 vs 6} vs 7
.
Fig. 1. Filter Tree. Each node predicts whether the left or the right input label is more likely, conditioned on a given x ∈ X. The root node predicts the best label for x.
Notation. Let D be the underlying distribution over X×Y , where X is some observable feature space and Y = {1, . . . , k} is the label space. The error rate of a classifier f : X → Y on D is given by err(f, D) = Pr(x,y)∼D [f (x) = y]. The regret of f on D is defined as reg(f, D) = err(f, D) − minf ∗ err(f ∗ , D). The tree reduction transforms D into a distribution DT over binary labeled examples by drawing a multiclass example (x, y) from D, drawing a random non-leaf node i, and outputting instance x, i with label 1 if y is in the left subtree of node i, and 0 otherwise. A binary classifier f for this problem induces a multiclass classifier T (f ), via a chain of binary predictions starting from the root. Theorem 1. For all k ≥ 3, for all binary trees over the labels, there exists a multiclass distribution D such that reg(T (f ∗ ), D) > 0 for any f ∗ = arg min err(f, DT ). f
Proof. Find a node with one subset corresponding to two labels and the other subset corresponding to a single label. (If the tree is perfectly balanced, simply let D assign probability 0 to one of the labels.) Since we can freely rename labels without changing the underlying problem, let the first two labels be 1 and 2, and the third label be 3. Choose D with the property that D(y = 1 | x) = D(y = 2 | x) = 1/4 + 1/100, while D(y = 3 | x) = 1/2 − 2/100. Under this distribution, the fraction of examples for which label 1 or 2 is correct is 1/2 + 2/100, so any minimum error rate binary predictor must choose either label 1 or label 2. Each of these choices has an error rate of 3/4 − 1/100. The optimal multiclass predictor chooses label 3 and suffers an error rate of 1/2 + 2/100, implying that the regret of the tree classifier based on an optimal binary classifier is 1/4 − 3/100 > 0.
3 The Filter Tree Algorithm The Filter Tree algorithm is illustrated by Figure 1. It is equivalent to a single-elimination tournament on the set of labels structured as a binary tree T over the labels. In the first round, the labels are paired according to the lowest level of the tree, and a classifier is trained for each pair to predict which of the two labels is more likely. (The labels that don’t have a pair in a given round, win that round for free.) The winning labels from the
Error-Correcting Tournaments
251
Algorithm 1. Filter-Train (multiclass training set S, binary learner Learn) for each non-leaf node n in order from leaves to root do Set Sn = ∅ for each (x, y) ∈ S such that y ∈ Γ (Tn ) and all nodes u on the path n ; y predict yu given x do add (x, yn ) to Sn end Let cn = Learn(Sn ) end return c = {cn }
first round are in turn paired in the second round, and a classifier is trained to predict whether the winner of one pair is more likely than the winner of the other. The process of training classifiers to predict the best of a pair of winners from the previous round is repeated until the root classifier is trained. The setting above is akin to Boosting: At each round t, a booster creates an input distribution Dt and calls an oracle learning algorithm to obtain a classifier with some error t on Dt . The distribution Dt depends on the classifiers returned by the oracle in previous rounds. The accuracy of the final classifier is analyzed in terms of t ’s. Let Tn be the subtree of T rooted at node n. The set of leaves of a tree T is denoted by Γ (T ). Let yn be the bit specifying whether the multiclass label y is in the left subtree of n or not. The key trick in the training stage (Algorithm 1) is to form the right training set at each interior node. A training example for node n is formed conditioned on the predictions of classifiers in the round before it. Thus the learned classifiers from the first level of the tree are used to “filter” the distribution over examples reaching the second level of the tree. Given x and classifiers at each node, every edge in T is identified with a unique label. The optimal decision at any non-leaf node is to choose the input edge (label) that is more likely according to the true conditional probability. This can be done by using the outputs of classifiers in the round before it as a filter during the training process: For each observation, we set the label to 0 if the left parent’s output matches the multiclass label, 1 if the right parent’s output matches, and reject the example otherwise. The testing algorithm, Filter-Test, is very simple. Given a test example x ∈ X, we output the label y such that every classifier on the path from y to the root prefers y. Algorithm 2 extends this idea to the cost-sensitive multiclass case where each choice has a different associated cost. Formally, a cost-sensitive k-class classification problem is defined by a distribution D over X × [0, 1]k . The expected cost of a classifier f : X → {1, ..., k} of D is (f, D) = E(x,c)∼D cf (x) . Here c ∈ [0, 1]k gives the cost of each of the k choices for x. As in the multiclass case (which is a special case), the regret of f on D is defined as regc (f, D) = (f, D) − minf ∗ (f ∗ , D). The algorithm relies upon an importance weighted binary learning algorithm, which takes examples of the form (x, y, w), where x is a feature vector used for prediction, y is a binary label, and w ∈ [0, ∞) is the importance any classifier pays if it doesn’t predict y on x.
252
A. Beygelzimer, J. Langford, and P. Ravikumar
Algorithm 2. C-Filter-Train (cost-sensitive training set S, importance-weighted binary learner Learn) for each non-leaf node n in the order from leaves to root do Set Sn = ∅ for each example (x, c1 , ..., ck ) ∈ S do Let a and b be the two classes input to n Sn ← Sn ∪ {(x, arg min{ca , cb }, |ca − cb |)} end Let cn = Learn(Sn ) end return c = {cn }
3.1 Filter Tree Analysis Before doing the regret analysis, we note the computational characteristics of the Filter Tree. Since the algorithm is a reduction, we count the computational complexity in the reduction itself, assuming that the oracle calls take unit time. 1. Algorithm 1 requires O(log k) computation per multiclass example, by searching for the correct leaf in O(log k) time, then filtering back toward the root. This matches the information theoretic lower bound since simply reading one of k labels requires log2 k bits. 2. Algorithm 2 requires O(k) computation per cost sensitive example, because there are k − 1 nodes, each requiring constant computation per example. Since any method must read the k costs, this bound is tight. 3. The testing algorithm is the same for both multiclass and cost-sensitive variants, requiring O(log k) computation per example to descend a binary tree. Any method must write out labels of length log2 k bits. First, we define several concepts necessary to understand the analysis. Algorithm 2 transforms cost-sensitive multiclass examples into importance-weighted binary examples. This process implicitly transforms a distribution D over cost sensitive multiclass examples into a distribution DFT over importance-weighted binary examples. There are many induced problems, one for each call to the oracle Learn. To simplify the analysis, we use a standard transformation allowing us to consider only a single induced problem: We add the node index n as an additional feature into each importance weighted binary example, and then train based upon the union of all the training sets. The learning algorithm produces a single binary classifier c(x, n) for which we can redefine cn (x) as c(x, n). The induced distribution DFT can be defined by the following process: (1) draw a cost-sensitive example (x, c) from D, (2) pick a random node n, (3) create an importance-weighted sample according to the algorithm, except using x, n instead of x. The theorem is quantified over all classifiers, and thus it holds for the classifier returned by the algorithm. In practice, one can either call the oracle multiple times to learn a separate classifier for each node (as we do in our experiments), or use iterative techniques for dealing with the fact that the classifiers are dependent on other classifiers closer to the leaves.
Error-Correcting Tournaments
253
When reducing to importance-weighted classification, the theorem statement depends on importance weights. To remove the importances, we compose the reduction with the Costing reduction [19], which alters the underlying distribution using rejection sampling on the importance weights. This composition transforms DFT into a distribution D over binary examples. We use the folk theorem from [19] saying that for all binary classifiers f and all importance weighted binary distributions P , the importance weighted binary regret of f on P is upper bounded by E(x,y,w)∼P [w] times the binary regret of f on the induced binary distribution. The core theorem relates the regret of a binary classifier f to the regret of the induced cost sensitive classifier Filter-Test(f ). Theorem 2. For all binary classifiers f and all cost sensitive multiclass distributions D, regc (Filter-Test(f ), D) ≤ reg(f, D )E(x,c)∼D
w(n, x, c),
n∈T
where w(n, x, c) is the importance weight in Algorithm 2 (the difference in cost between the two labels that node n chooses between on x), and D is the induced distribution as defined above. Before proving the theorem, we state the corollary for multiclass classification. Corollary 1. For all binary classifiers f and all multiclass distributions D on k labels, for all Filter Trees of depth d, reg(Filter-Test(f ), D) ≤ d · reg(f, DFT ). (Since all importance weights are either 0 or 1, we don’t need to apply Costing.) The proof of the corollary given the theorem is simple since for any (x, y), the induced (x, c) has at most one node per level with induced importance weight 1; all other importance weights are 0. Therefore, n w(n, x, c) ≤ d. Theorem 3 provides an alternative bound for cost-sensitive classification. It is the first known bound giving a worst-case dependence of less than k. Theorem 3. For all binary classifiers f and all cost-sensitive k-class distributions D, regc(Filter-Test(f ), D) ≤ k reg(f, D )/2, where D is as defined above. The remainder of this section proves Theorems 2 and 3. Proof. (Theorem 2) It is sufficient to prove the claim for any x ∈ X because that implies that the result holds for all expectations over x. Conditioned on the value of x, each label y has a distribution over costs cy with an expected value Ec∼D|x [cy ]. The zero regret cost sensitive classifier predicts according to arg miny Ec∼D|x [cy ]. Suppose that Filter-Test(f ) predicts y on x, inducing cost sensitive regret regc (y , D|x) = Ec∼D|x [cy ] − miny Ec∼D|x [cy ]. First, we show that the sum over the binary problems of the importance weighted regret is at least regc(y , D|x), using induction starting at the leaves. The induction hypothesis is that the sum of the regrets of importance-weighted binary classifiers in any subtree bounds the regret of the subtree output.
254
A. Beygelzimer, J. Langford, and P. Ravikumar
For node n, each importance weighted binary decision between class a and class b has an importance weighted regret which is either 0 or rn = |Ec∼D|x [ca − cb ]| = |Ec∼D|x [ca ] − Ec∼D|x [cb ]|, depending on whether the prediction is correct or not. Assume without loss of generality that the predictor outputs class b. The regret of the subtree Tn rooted at n is given by rTn = Ec∼D|x [cb ] − miny∈Γ (Tn ) Ec∼D|x [cy ]. As a base case, the inductive hypothesis is trivially satisfied for trees with one label. Inductively, assume that n ∈L rn ≥ rL and n ∈R rn ≥ rR for the left subtree L of n (providing a) and the right subtree R (providing b). There are two possibilities. Either the minimizer comes from the leaves of L or the leaves of R. The second possibility is easy since we have rTn = Ec∼D|x [cb ] − min Ec∼D|x [cy ] = rR ≤ rn ≤ rn , y∈Γ (R)
n ∈R
n ∈Tn
which proves the induction. For the first possibility, we have rTn = Ec∼D|x [cb ] − min Ec∼D|x [cy ] y∈Γ (L)
= Ec∼D|x [cb ] − Ec∼D|x [ca ] + Ec∼D|x [ca ] − min Ec∼D|x [cy ] y∈Γ (L) = Ec∼D|x [cb ] − Ec∼D|x [ca ] + rL ≤ rn + rn ≤ rn , n ∈L
n ∈Tn
which completes the induction. The inductive hypothesis for the root is that regc (y , D|x) ≤ n∈T rn , implying regc (y , D|x) ≤ n∈T rn = (k − 1) · ri (f, DFT ), where ri is the importance weighted binary regret on the induced problem. Using the folk theorem from [19], we have ri (f, DFT ) = reg(f, D )E(x,y,w)∼DFT [w]. 1 E(x,c)∼D n∈T w(n, x, c). Plugging this in, we get The expected importance is k−1 the theorem. The proof of Theorem 3 makes use of the following inequality. Consider a Filter Tree T evaluated on a cost-sensitive multiclass instance with cost vector c ∈ [0, 1]k . Let ST be the sum of importances over all nodes in T , and IT be the sum of importances over the nodes where the class with the larger cost was selected for the next round. Let cT denote the cost of the winner chosen by T . Lemma 1. For any Filter Tree T on k labels, ST + cT ≤ IT + k2 . Proof. The inequality follows by induction, the result being clear when k = 2. Assume that the claim holds for the two subtrees, L and R, providing their respective inputs l and r to the root of T , and T outputs r without loss of generality. Using the inductive hypotheses for L and R, we get ST +cT = SL +SR +|cr −cl |+cr ≤ IL +IR + k2 −cl + |cr −cl |. If cr ≥ cl , we have IT = IL +IR +(cr −cl ), and ST +cT ≤ IT + k2 −cl ≤ IT + k k k 2 , as desired. If cr < cl , we have IT = IL + IR and ST + cT ≤ IT + 2 − cr ≤ IT + 2 , completing the proof. Proof. (Theorem 3) We will fix (x, c) ∈ X × [0, 1]k and take the expectation over the draw of (x, c) from D as the last step.
Error-Correcting Tournaments
255
Consider a Filter Tree T evaluated on (x, c) using a given binary classifier b. As before, let ST be the sum of importances over all nodes in T , and IT be the sum of importances over the nodes where b made a mistake. Recall that the regret of T on (x, c), denoted in the proof by regT , is the difference between the cost of the tree’s output and the smallest cost c∗ . The importance-weighted binary regret of b on (x, c) is simply IT /ST . Since the expected importance is upper bounded by 1, IT /ST also bounds the binary regret of b. The inequality we need to prove is regT ST ≤ k2 IT . The proof is by induction on k, the result being trivial if k = 2. Assume that the assertion holds for the two subtrees, L and R, providing their respective inputs l and r to the root of T . (The number of classes in L and R can be taken to be even, by splitting the odd class into two classes with the same cost as the split class, which has no effect on the quantities in the theorem statement.) Let the best cost c∗ be in the left subtree L. Suppose first that T chooses r and cr > cl . Let w = cr − cl . We have regL = cl − c∗ and regT = cr − c∗ = regL + w. The left hand side of the inequality is thus regT ST = (regL + w)(SR + SL + w) = w(regL +SR +SL +w)+regL (SL +SR ) ≤ w regL + IR + IL − cr − cl + w + k2 + regL IR + IL − cl − cr + k2 ≤ k2 w + IR (w + regL ) + IL (w + regL ) + k k regL 2 − cr − cl ≤ 2 w + IR (w + regL ) + IL w + regL + k2 − cr − cl ≤ k2 w + IR (w + regL ) + k2 IL ≤ k2 (w + IR + IL ) = k2 IT . The first inequality follows from lemma 1. The second and fourth follow from w(regL − cl − cr + w) ≤ 0. The third follows from regL ≤ IL . The fifth follows from regT ≤ k2 for k ≥ 2. The proofs for the remaining three cases (cT = cl < cr , cT = cl > cr , and cl > cr = cT ) use the same machinery as the proof above. Case 2. T outputs l, and cl < cr . In this case regT = regL = cl − c∗ . The left hand side can be rewritten as regT ST = regL (SR + SL + cr −cl ) = regL SL + regL (SR + cr − cl ) ≤ regL IL + IR − 2cl + k2 ≤ IR + regL IL − 2cl + k2 ≤ IR + IL regL −2cl + k2 ≤ IR + k2 IL ≤ k2 IT . The first inequality from the lemma, the second from regL ≤ 1, the third from regL ≤ IL , the fourth from −cL − c∗ < 0, and the fifth because IT = IL + IR . Case 3. T outputs l, and cl > cr . We have regT = regL = cl − c∗ . The left hand side can be written as |L| k − |L| regT ST = regL (SR + SL + cl − cr ) ≤ IL +regL IR + − cr + c l − c r 2 2 k k k ≤ IL + IR + (cl − 2cr ) ≤ (IL + IR + (cl − cr )) = IT , 2 2 2 The first inequality follows from the inductive hypothesis and the lemma, the second from regL < 1 and regL < IL , and the third from cr > 0 and k/2 > 1. Case 4. T outputs r, and cl > cr . Let w = cl −cr . We have regT = cr −c∗ = regL −w. The left hand side can be written as
256
A. Beygelzimer, J. Langford, and P. Ravikumar
regT ST = (regL − w)(SR + SL + w) = regL SL − wSL + (regL − w)(SR + w) |L| k − |L| |L| IL − w IL + − cl + (regL − w) IR + cl − 2cr + ≤ 2 2 2 |L| |L| k − |L| IL − w IL + − cl + (IL − w) + (regL − w) (IR + cl − 2cr ) ≤ 2 2 2 k k ≤ (IL + IR ) − w − w(IL − cl ) + (regL − w)(cl − 2cr ). 2 2
The first inequality follows from the inductive hypothesis and the lemma, the second from regL ≤ IL , and the third from regL ≤ k2 . The last three terms are upper bounded by −w − wregL + wcl + regL cl − 2cr regL − wcl + 2wcr ≤ −w − regL (cr + cl ) + regL cl + 2wcr ≤ −w − (cl − c∗ )cr + wcr + (cl − cr )cr ≤ 0, and thus can be ignored, yielding regT ST ≤ k2 (IL + IR ) = k2 IT , which completes the proof. Taking the expectation over (x, c) completes the proof. 3.2 Lower Bound The following simple example shows that the theorem is essentially tight in the worst case. Let k be a power of two, and let every label have cost 0 if it is is even, and 1 otherwise. The tree structure is a complete binary tree of depth log k with the nodes being paired in the order of their labels. Suppose that all pairwise classifications are correct, except for class k wins all its log k games leading to cost-sensitive multiclass regret 1. If T is the resulting filter tree, we have regT = 1, ST = k2 + log k − 1, and IT = log k, leading to reg S k−1 k = Ω( 2 log the reget ratio of ITT T ≤ k/2+log log k k ), almost matching the theorem’s k bound of 2 the regret ratio.
4 Error-Correcting Tournaments In this section we first state and then analyze error correcting tournaments. As this section builds on the previous section, understanding the previous should be considered prerequisite for reading this section. For simplicity, we work with only the multiclass case. An extension for cost-sensitive multiclass problems is possible using the importance weighting techniques of the previous section. 4.1 Algorithm Description An error-correcting tournament is one of a family of m-elimination tournaments where m is a natural number. An m-elimination tournament operates in two phases. The first phase consists of m single-elimination tournaments over the k labels where a label is paired against another label at most once per round. Consequently, only one of these single elimination tournaments has a simple binary tree structure—see for example Figure 2 for an m = 3 elimination tournament on k = 8 labels. There is substantial freedom in exactly how the pairings of the first phase are done—our bounds are
Error-Correcting Tournaments
257
1
2
3
4
5
6
7
8
dependent on the depth of any mechanism which pairs labels in m distinct single elimination tournaments. One such explicit mechanism is stated in [5]. Note that once an (x, y) example has lost m times, it is eliminated and no longer influences training at the nodes closer to the root. The second phase is a final elimination phase, where we select the winner from the m winners of the first phase. It consists of a redundant single-elimination tournament, where the degree of redundancy increases as the root is approached. To quantify the redundancy, let every subtree Q have a charge cQ equal to the number of leaves under the subtree. First phase winners at the leaves of final elimination tournament have charge 1. For any non-leaf node comparing subtree R to subtree L, the importance weight of a binary example is set to max{cR , cL }. For reference, in tournament applications, an importance weight can be expressed by playing games repeatedly where the winner of R must beat the winner of L cL times to advance, and vice versa. One complication arises: what happens when the two labels compared are the same? In this case, the importance weight is set to 0, indicating there is no preference in the pairing amongst the two choices.
Final Winner
Fig. 2. An example of a 3-elimination tournament on k = 8 players. There are m = 3 distinct single elimination tournaments in first phase—one as solid lines, one as dashed lines, and one as dotted lines. After that, a final elimination phase occurs over the three winners of the first phase. The final elimination tournament has an extra weighting on the nodes, detailed in the text.
4.2 Error Correcting Tournament Analysis A key concept throughout this section is the importance depth, defined as the worstcase length (number of games) of the overall tournament, where importance-weighted matches in the final elimination phase are played as repeated games. In Theorem 6 we prove a bound on the importance depth. The computational bound per example is essentially just the importance depth. Theorem 4. (Structural Depth Bound) For any m-elimination tournament, the training and test computation is O(m + ln k) per example.
258
A. Beygelzimer, J. Langford, and P. Ravikumar
Proof. The proof is by simplification of the importance depth bound (theorem 6), which bounds the sum of importance weights at all nodes in the circuit. To see that the importance depth controls the computation, first note that the importance depth bounds the circuit depth since all importance weights are at least 1. At training time, any one example is used at most once per circuit level starting at the leaves. At testing time, an unlabeled example can have its label determined by traversing the structure from root to leaf. 4.3 Regret Analysis Our regret theorem is the analogue of corollary 1 for error-correcting tournaments. Using the one classifier trick detailed there, the reduction transforms a multiclass distribution D into an induced distribution ECT(D) over binary labeled examples. Let fECT denote the multiclass predictor induced by a binary classifier f . It is useful to have the notation m 2 for the smallest power of 2 larger than or equal to m. Theorem 5. (Main Theorem) For all distributions D over k-class examples, all binary classifiers f , all m-elimination tournaments ECT, the ratio of reg(fECT , D) to reg(f, ECT(D)) is upper bounded by
2+ 4+
m2 m 2 ln k m
+
k 2m
+2
for all m ≥ 2 and k > 2 ln k m
for all k ≤ 262 and m ≤ 4 log2 k
The first case shows that a regret ratio of 3 is achievable for very large m. The second case is the best bound for cases of common interest. For m = 4 ln k it gives a ratio of 5.5. Proof. The proof holds for each input x, and hence in expectation over x. For a fixed x, we can define the regret of any label y as ry = maxy ∈{1,··· ,k} D(y | x) − D(y | x). A node n comparing two labels y and y has regret rn , which is |D(y | x)−D(y | x)| if the most probable label is not predicted, and 0 otherwise. The regret of a tree T is defined as rT = n∈T rn . The first part of the proof is by induction on the tree structure F of the final phase. The invariant for a subtree Q of F won by label q is cQ rq ≤ rQ + w∈Γ (Q) rTw , where w is the winner of the first phase single-elimination tournament Tw . When Q is a leaf w of F , we have cQ rq = rq ≤ rTi , where the inequality is from Corollary 1 noting that d times the average binary regret is the sum of binary regrets. Assume inductively that the hypothesis holds at node n for the right subtree R and the left subtree L of Q with respective winners q and l: cR rq ≤ rR + w∈Γ (R) rTw and cL rl ≤ rL + w∈Γ (L) rTw . Now, a chain of inequalities holds, completing the induction: rQ + w∈Γ (Q) rTw ≥ cL rn + rR + rL + w∈Γ (R) rTw + w∈Γ (L) rTw ≥ cL rn + cR rq + cL rl ≥ cQ rq . Here the first inequality uses the fact that the adversary must pay at least cL rn to make q win. The second inequality follows by the inductive hypothesis. The third inequality comes from rl + rn ≥ rq . To finish the proof, m reg(fECT , D | x) = cF rf ≤ rF + w∈Γ (F ) rTw ≤ d reg(f, ECT(D | x)), where
Error-Correcting Tournaments
259
d is the maximum importance depth and the last quantity follows from the folk theorem in [19]. Applying the importance depth theorem 6 and algebra complete the proof. The depth bound follows from the following three lemmas. Lemma 2. (First Phase Depth bound) The importance depth of the first phase tournament is bounded by the minimum of ⎧ ⎪ ⎪log2 k + mlog2 (log2 k + 1) ⎪ ⎨1.5log k + 3m + 1 2 ⎪ k2 + 2m ⎪ ⎪ √ ⎩ For k ≤ 262 and m ≤ 4 log2 k, 2(m − 1) + ln k + ln k ln k + 4(m − 1). Proof. The depth of the first phase is bounded by the classical problem of robust minimum finding with low depth. The first three cases hold because any such construction upper bounds the depth of an error correcting tournament, and one such construction has these bounds [5]. For the fourth case, we construct the depth bound by analyzing a continuous relaxation of the problem. The relaxation allows the number of labels remaining in each single elimination tournament of the first phase to be broken into fractions. Relative to this version, the actual problem has two important discretizations: 1. When a single-elimination tournament has only a single label remaining, it enters the next single elimination tournament. This can have the effect of decreasing the depth compared to the continuous relaxation. 2. When a single-elimination tournament has an odd number of labels remaining, the odd label does not play that round. Thus the number of players does not quite halve, potentially increasing the depth compared to the continuous relaxation. ( d )k In the continuous version, tournament i on round d has i−1 labels, where the first 2d tournament corresponds to i = 1. Consequently, the number of labels remaining in any m d of the tournaments is 2kd i=1 i−1 . We can get an estimate of the depth by finding the value of d such that this number is 1. This value of d can be found using the Chernoff bound. The probability that a coin m−1 2 1 with bias 1/2 has m − 1 or fewer heads in d coin flips is bounded by m−2d( 2 − d ) , and the probability that this occurs in k attempts is bounded by k times that. Setting this 2 value to 1, we get ln k = 2d 12 − m−1 . Solving the equation for d, gives d = 2(m − d 1) + ln k + 4(m − 1) ln k + (ln k)2 . This last formula was verified computationally for k < 262 and m < 4 log2 k by discretizing k into factors of 2 and running a simple program to keep track of the number of labels in each tournament at each level. For k ∈ {2l−1 + 1, 2l }, we used a pessimistic value of k = 2l−1 + 1 in the above formula to compute the bound, and compared it to the output of the program for k = 2l . Lemma 3. (Second Phase Depth Bound) In any m-elimination tournament, the second phase has importance depth at most m 2 − 1 rounds for m > 1.
260
A. Beygelzimer, J. Langford, and P. Ravikumar
Proof. When two labels are compared in round i ≥ 1, the importance weight of their log m−1 i−1 comparison is at most 2i−1 . Thus we have i=1 2 2 + m2 = m 2 − 1. Putting everything together gives the importance depth theorem. Theorem 6. (Importance Depth Bound) For all m-elimination tournaments, the importance depth is upper bounded by ⎧ log2 k + mlog2 (log2 k + 1) + m 2 ⎪ ⎪ ⎪ ⎨1.5log k + 3m + m 2 k 2 ⎪ + 2m + m 2 ⎪ ⎪ √ ⎩ 2 For k ≤ 262 and m ≤ 4 log2 k, 2m + m 2 + 2 ln k + 2 m ln k. Proof. We simply add the depths of the first and second√phases from √ Lemmas 2 and 3. For the last case, we bound ln k + 4(m − 1) ≤ ln k + 2 m and eliminate subtractions in Lemma 3.
5 Lower Bound All of our lower bounds hold for a somewhat more powerful adversary which is more natural in a game playing tournament setting. In particular, we disallow reductions which use importance weighting on examples, or equivalently, all importance weights are set to 1. Note that we can modify our upper bound to obey this constraint by transforming final elimination comparisons with importance weight i into 2i − 1 repeated comparisons and use the majority vote. This modified construction has an importance depth which is at most m larger implying the ratio of the adversary and the reduction’s regret increases by at most 1. The first lower bound says that for any reduction algorithm B, there exists an adversary A with the average per-round regret r such that A can make B incur regret 2r even if B knows r in advance. Thus an adversary who corrupts half of all outcomes can force a maximally bad outcome. In the bounds below, fB denotes the multiclass classifier induced by a reduction B using a binary classifier f . Theorem 7. For any deterministic reduction B from k > 2 classification to binary classification, there exists a choice of D and f such that reg(fB , D) ≥ 2 reg(f, B(D)). Proof. The adversary A picks any two labels i and j. All comparisons involving i but not j, are decided in favor of i. Similarly for j. The outcome of comparing i and j is determined by the parity of the number of comparisons between i and j in some fixed serialization of the algorithm. If the parity is odd, i wins; otherwise, j wins. The outcomes of all other comparisons are picked arbitrarily. Suppose that the algorithm halts after some number of queries c between i and j. If neither i nor j wins, the adversary can simply assign probability 1/2 to i and j. The adversary pays nothing while the algorithm suffers loss 1, yielding a regret ratio of ∞. Assume without loss of generality that i wins. The depth of the circuit is either c or at least c + 1, because each label can appear at most once in any round. If the depth is
Error-Correcting Tournaments
261
c, then since k > 2, some label is not involved in any query, and the adversary can set the probability of that label to 1 resulting in ρ(B) = ∞. Otherwise, A can set the probability of label j to be 1 while all others have probability 0. The total regret of A is at most c+1 2 , while the regret of the winning label is 1. Multiplying by the depth bound c + 1, gives a regret ratio of at least 2. Note that the number of rounds in the above bound can depend on A. Next, we show that for any algorithm B taking the same number of rounds for any adversary, there exists an adversary A with a regret of roughly one third, such that A can make B incur the maximal loss, even if B knows the power of the adversary. Lemma 4. For any deterministic reduction B to binary classification with number of rounds independent of the query outcomes, there exists a choice of D and f such that reg(fB , D) ≥ (3 − k2 ) reg(f, B(D)). Proof. Let B take q rounds to determine the winner, for any set of query outcomes. We qk will design an adversary A with incurs regret r = 3k−2 , such that A can make B incur the maximal loss of 1, even if B knows r. The adversary’s query answering strategy is to answer consistently with label 1 winr rounds, breaking ties arbitrarily. The total number of queries ning for the first 2(k−1) k that B can ask during this stage is at most (k − 1)r since each label can play at most once in every round, and each query occupies two labels. Thus the total amount of regret at this point is at most (k − 1)r, and there must exist a label i other than label k with at most r losses. In the remaining q − 2(k−1) r = r rounds, A answers consistently n with label i and all other skills being 0. Now if B selects label 1, A can set D(i | x) = 1 with r/q average regret from the first stage. If B selects label i instead, A can choose that D(1 | x) = 1. Since the number of queries between labels i and k in the second stage is at most r, the adversary can incurs average regret at most r/q. If B chooses any other label to be the winner, the regret ratio is unbounded.
References 1. Adler, M., Gemmell, P., Harchol-Balter, M., Karp, R., Kenyon, C.: Selection in the presence of noise: The design of playoff systems. In: SODA 1994 (1994) 2. Allwein, E., Schapire, R., Singer, Y.: Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research 1, 113–141 (2000) 3. Aslam, J., Dhagat, A.: Searching in the presence of linearly bounded errors. In: STOC 1991 (1991) 4. Borgstrom, R., Rao Kosaraju, S.: Comparison-base search in the presence of errors. In: STOC 1993 (1993) 5. Denejko, P., Diks, K., Pelc, A., Piotr’ow, M.: Reliable minimum finding comparator networks. Fundamenta Informaticae 42, 235–249 (2000) 6. Dietterich, T., Bakiri, G.: Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research 2, 263–286 (1995) 7. Feige, U., Peleg, D., Raghavan, P., Upfal, E.: Computing with unreliable information. In: Symposium on Theory of Computing, pp. 128–137 (1990)
262
A. Beygelzimer, J. Langford, and P. Ravikumar
8. Foster, D., Hsu, D.: http://hunch.net/?p=468 9. Fox, J.: Applied regression analysis, linear models, and related methods. Sage Publications, Thousand Oaks (1997) 10. Guruswami, V., Sahai, A.: Multiclass learning, Boosting, and Error Correcting Codes. In: COLT 1999 (1999) 11. Hastie, T., Tibshirani, R.: Classification by pairwise coupling. In: NIPS 1997 (1997) 12. Herbrich, R., Minka, T., Graepel, T.: TrueSkill(TM): A Bayesian skill rating system. In: NIPS 2007 (2007) 13. Hsu, D., Langford, J., Kakade, S., Zhang, T.: Multi-label prediction via compressed sensing (2009); arXiv:0902.1284v1 14. Langford, J., Beygelzimer, A.: Sensitive Error Correcting Output Codes. In: Auer, P., Meir, R. (eds.) COLT 2005. LNCS (LNAI), vol. 3559, pp. 158–172. Springer, Heidelberg (2005) 15. Langford, J., Zadrozny, B.: Estimating class membership probabilities using classifier learners. In: AISTAT 2005 (2005) 16. Ravikumar, B., Ganesan, K., Lakshmanan, K.B.: On selecting the largest element in spite of erroneous information. In: Brandenburg, F.J., Wirsing, M., Vidal-Naquet, G. (eds.) STACS 1987. LNCS, vol. 247, pp. 88–99. Springer, Heidelberg (1987) 17. Williamson, B.: Personal communication 18. Yao, A.C., Yao, F.F.: On fault-tolerant networks for sorting. SIAM Journal of Computing 14(1), 120–128 (1985) 19. Zadrozny, B., Langford, J., Abe, N.: Cost-sensitive learning by cost-proportionate example weighting. In: ICDM 2003 (2003)
Difficulties in Forcing Fairness of Polynomial Time Inductive Inference John Case and Timo Kötzing Department of Computer and Information Sciences, University of Delaware, Newark, DE 19716-2586, USA {case,koetzing}@cis.udel.edu Abstract. There are difficulties obtaining fair feasibility from polynomial time updated language learning in the limit from positive data. Pitt 1989 noted that unfair delaying tricks can achieve polynomial time updates but with no feasibility constraint on the whole learning process. In this context Yoshinaka 2009 makes a useful list of properties or restrictions towards true feasibility. He also provides interesting examples of fair polynomial time algorithms featuring particular uniformly polynomial time decidable hypothesis spaces, and each of his algorithms satisfies several of his properties. Yoshinaka claims that the combination of the three restrictions on polynomial time learners of consistency (which we call herein postdictive completeness), conservativeness and prudence is restrictive enough to stop Pitt’s delaying tricks from working. The present paper refutes the claim of the previous paragraph in three settings. In the setting of uniformly polynomial time decidable hypothesis spaces with a few effective closure properties, the three restrictions allow maximal unfairness. The other two settings involve certain other uniformly decidable hypothesis spaces and general language learning hypothesis spaces. In each of these settings, the three restrictions forbid some, but not all Pitt-style delaying tricks. Inside the proofs of each of our theorems asserting that the three restrictions do not forbid some or all delaying tricks, the witnessing learners can be seen to explicitly employ delaying tricks.
1
Introduction
For a class of (at least computably enumerable) languages L and an algorithmic learning function h, we say that h TxtEx-learns L [Gol67, JORS99] iff, for each L ∈ L, for every function T enumerating (or presenting) all and only the elements of L (with or without pauses), as h is fed the succession of values T (0), T (1), . . ., it outputs a corresponding succession of programs p(0), p(1), . . . from some hypothesis space, and, for some i0 , for all i ≥ i0 , p(i) is a correct program for L, and p(i + 1) = p(i). The function T as just above is called a text or presentation for L. TxtEx-learning is also called learning in the limit from positive data. We say that h TxtEx-learns a L in polynomial time iff there is a polynomial Q such that, for each i, h computes p(i) within time Q(|T (0), T (1), . . . , T (i − 1)|). R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 263–277, 2009. c Springer-Verlag Berlin Heidelberg 2009
264
J. Case and T. Kötzing
Pitt [Pit89] notes (in a slightly different context) that such a definition of polynomial time learning may not give one any feasibility restriction on the total time for successful learning. Here is informally why. Suppose h is any TxtEx-learner. Then, for suitable polynomial Q, a variant of learner h can delay outputting significant conjectures based on data σ until it has seen a much larger sequence of data τ so that Q(|τ |) is enough time for h to think about σ as long as it needs. Pitt [Pit89] discusses some possible ways to forbid such unfair delaying tricks. More recently, Yoshinaka [Yos09] compiled a very useful list of properties to help toward achieving fairness and efficiency in polynomial time learners, including to avoid Pitt-style delaying tricks. In the second part of [Yos09], Yoshinaka provides a number of interesting example fair polynomial time learners each satisfying several of these properties. In each of his example algorithms, the associated hypothesis space is uniformly polynomial time decidable.1 In the present paper, we focus, for polynomial time learners, on three of Yoshinaka’s properties: Postdictive completeness2 , conservativeness, and prudence. Postdictive completeness [B¯ar74, BB75, Wie76, Wie78] requires that each hypothesis output by a learner correctly postdicts the input data on which that hypothesis is based. Conservativeness [Ang80] requires that each hypothesis may be changed only if it fails to predict a new datum. Prudence [Wei82, OSW86] requires each output hypothesis has to be for a target that the learner actually learns. Yoshinaka [Yos09] claims that, for efficient learning in the limit from positive data, the combination of postdictive completeness, conservativeness and prudence is restrictive enough to prevent all Pitt-style delaying tricks. In the present paper, in several different settings (settings mostly as to kind of hypothesis spaces), we refute the claim of the immediately above paragraph. In one of our settings, uniformly polynomial time decidable hypothesis spaces with a few effective closure properties,3,4 the three restrictions allow maximal 1
2 3 4
These spaces are such that there is a polynomial Q and an algorithm so that, from both an hypothesis i and an object x, the algorithm returns, within time Q(|i|,|x|) a correct decision as to whether x is in the language defined by hypothesis i. In the prior literature, except for [Ful88] and [CK08a, CK08b], what we call postdictive completeness is called consistency. These effective closure properties pertain to obtaining finite languages and modifications of languages by finite languages. The particular uniformly polynomial time hypothesis spaces Yoshinaka employs in the second half of [Yos09] do not have our few effective closure properties, but his algorithms would work essentially unchanged were one to extend his hypothesis spaces to ones with our few effective closure properties. Then his algorithms would not search or use the new hypotheses and would not learn any more languages. The space of CFGs with Prohibition discussed below in this section and in Section 2.1 further below would work as such an extension of both Yoshinaka’s hypothesis spaces. Yoshinaka does mention the possibility of extending his hypothesis spaces to provide an hypothesis for Σ ∗ . We did not examine whether we could, in some cases, work with such an extension instead of our few effective closure properties. We also did not examine whether we can modify our (to be mentioned shortly) Theorem 13 to cover just his particular hypothesis spaces.
Difficulties in Forcing Fairness of Polynomial Time Inductive Inference
265
unfairness (Theorem 13 in Section 3 below).5 An example of our uniformly polynomial time decidable hypothesis spaces (with a few effective closure properties) employs efficiently coded DFAs. Another example employs an (also efficiently coded), interesting extension of context free grammars (CFGs), called CFGs with Prohibition [Bur05]. This latter example is treated in more detail after Definition 2 in Section 2.1 below. In all of our settings, any combination of just the two restrictions of conservativeness and prudence allows for arbitrary delaying tricks (Theorem 18 in Section 5). In each of our two settings besides the first setting of uniformly polynomial time decidable hypothesis spaces (with a few effective closure properties), postdictive completeness does strictly forbid some, but not all Pitt-style delaying tricks.6 The two residual settings are: 1. TxtEx-learning with certain other uniformly decidable hypothesis spaces (Section 4), e.g., the (efficiently coded), explicitly clocked, multi-tape Turing Machines which halt in linear time [RC94, Chapter 6]7 and 2. TxtEx-learning with a general purpose hypothesis space (Section 5). The theorems that postdictive completeness forbids some delaying tricks in these last two settings are: Theorem 14 in Section 4 and Theorem 19 in Section 5. The theorems that postdictive completeness does not forbid all delaying tricks in these last two settings are: Theorem 15 in Section 4 and Theorem 22 in Section 5. Inside the proofs of each of our theorems asserting that the three restrictions do not forbid some or all delaying tricks, the witnessing learners can be seen to explicitly employ delaying tricks. Note that many of our delaying tricks involve “overlearning,” i.e., learning a larger class of languages than required. To avoid having to define successively each of a large number of criteria of successful learning (e.g., restricted variants of TxtEx-learning), we provide a modular approach to presenting such definitions. In our modular approach, we define names for “pieces” of our criteria (Section 2.1). Then, after that, each criterion needed is named by stringing together the relevant names of its pieces. For example, unrestricted TxtEx-learning in the present section will be later 5
6
7
It is an interesting open question, though, for our uniformly polynomial time decidable hypothesis spaces, whether the combination of postdictive completeness, conservativeness and prudence is so restrictive, that, any class of languages TxtEx-learnable employing such an hypothesis space and with those three restrictions, is also TxtExlearnable with an intuitively fair, different polynomial time learner respecting all three restrictions. That is, in our residual two settings, of the three restrictions, postdictive completeness does improve fairness, but there can still be some residual unfair delaying tricks. For these residual settings, we did not examine the question of whether adding onto postdictive completeness, conservativeness and/or prudence, provides better degree of avoidance of delaying tricks than postdictive completeness alone. Again: we already know, though, that all three restrictions do not avoid all delaying tricks. The associated class is not uniformly polynomial time decidable, by [RC94, Theorem 6.5].
266
J. Case and T. Kötzing
named, TxtGEx.8 A similar modular approach appears already in [CK08a, CK08b].
2
Mathematical Preliminaries
Any unexplained complexity-theoretic notions are from [RC94]. All unexplained general computability-theoretic notions are from [Rog67]. Strings herein are finite and over the alphabet {0, 1}. {0, 1}∗ denotes the set of all such strings; ε denotes the empty string. N denotes the set of natural numbers, {0, 1, 2, . . .}. We do not distinguish between natural numbers and their dyadic representations as strings.9 For each w ∈ {0, 1}∗ and n ∈ N, wn denotes n copies of w concatenated end to end. For each string w, we define size(w) to be the length of w. Since we identify each natural number x with its dyadic representation, for all n ∈ N, size(n) denotes the length of the dyadic representation of n. For all strings w, we define |w| to be max{1, size(w)}.10 The symbols ⊆, ⊂, ⊇, ⊃ respectively denote the subset, proper subset, superset and proper superset relation between sets. For sets A, B, we let A \ B := {a ∈ A | a ∈ B}, A := N \ A and Pow(A) be the power set of A. The quantifier ∀∞ x means “for all but finitely many x”, the quantifier ∃∞ x means “for infinitely many x”. For any set A, card(A) denotes the cardinality of A. P and R denote, respectively, the set of all partial and of all total functions N → N ∪ {#}. dom and range denote, respectively, domain and range of a given function. We sometimes denote a function f of n > 0 arguments x1 , . . . , xn in lambda notation (as in Lisp) as λx1 , . . . , xn f (x1 , . . . , xn ). For example, with c ∈ N, λx c is the constantly c function of one argument. A function ψ is partial computable iff there is a deterministic, multi-tape Turing machine computing ψ. P and R denote, respectively, the set of all partial computable and the set of all total (partial) computable functions N → N. If f ∈ P is defined for some argument x, then we denote this fact by f (x)↓, and we say that f on x converges. We say that f ∈ P converges to p iff ∀∞ x : f (x)↓ = p; we write f → p to denote this.11 ϕTM is the fixed programming system from [RC94, Chapter 3] for the partial computable functions N → N. This system is based on deterministic, multi-tape 8 9
10 11
In general, standard inductive inference criteria names will be changed to slightly different names in our modular approach. The dyadic representation of a natural number x := the x-th finite string over {0, 1} in length-lexicographical order, where the counting of strings starts with zero [RC94]. Hence, unlike with binary representation, lead zeros matter. This convention about |ε| = 1 helps with runtime considerations. f (x) converges should not be confused with f converges to.
Difficulties in Forcing Fairness of Polynomial Time Inductive Inference
267
Turing machines (TMs). In this system the TM-programs are efficiently given numerical names or codes.12 ΦTM denotes the TM step counting complexity measure also from [RC94, Chapter 3] and associated with ϕTM . In the present paper, we employ a number of complexity bound results from [RC94, Chapters 3 & 4] regarding (ϕTM , ΦTM ). These results will be clearly referenced as we use them. For simplicity of notation, hereafter, we write (ϕ, Φ) for (ϕTM , ΦTM ). ϕp denotes the partial computable function computed by the TM-program with code number p in the ϕ-system, and Φp denotes the partial computable runtime function of the TM-program with code number p in the ϕ-system. The symbol # is pronounced pause and is used to symbolize “no new input data” in a text. Note that all (partial) computable functions are N → N. Whenever we want to consider (partial) computable functions on objects like finite sequences or finite sets, we assume those objects to be efficiently coded as natural numbers. We give such codings for finite sequences and finite sets below. For all p, Wp denotes the computably enumerable (ce) set dom(ϕp ). E denotes the set of all ce sets. We say that e is an index (in W ) for We . We fix the 1-1 and onto pairing function ·, · : N × N → N from [RC94], which is based on dyadic bit-interleaving. Pairing and unpairing is computable in linear time. Whenever we consider tuples of natural numbers as input to TMs, it is understood that the general coding function ·, · is used to (left-associatively) code the tuples into appropriate TM-input. We identify any function f ∈ P with its graph {x, f (x) | x ∈ N}. A finite sequence is a mapping with a finite initial segment of N as domain (and range, (N ∪ {#})). ∅ denotes the empty sequence (and, also, the empty set). The set of all finite sequences is denoted by Seq. For each finite sequence σ, we will denote the first element, if any, of that sequence by σ(0), the second, if any, with σ(1) and so on. #elets(σ) denotes the number of elements in a finite sequence σ, that is, the cardinality of its domain. From now on, by convention, f , g and h with or without decoration range over (partial) functions N → N; x, y with or without decorations range over N. D with or without decorations ranges over finite subsets of N. Following [LV97], we fix a coding ·Seq of all sequences into N ∪ {#} (= {0, 1}∗ ∪ {#}) – with the following properties. The set of all codes of sequences is decidable in linear time. The time to encode a sequence, that is, to compute λk, v1 , . . . , vk v1 , . . . , vk Seq is O(λk, v1 , . . . , vk
k
|vi |).
i=1 12
This numerical coding guarantees that many simple operations involving the coding run in linear time. This is by contrast with historically more typical codings featuring prime powers and corresponding at least exponential costs to do simple things.
268
J. Case and T. Kötzing
Therefore, the size of the codeword is also linearin the size of the elements: k 13 λk, v1 , . . . , vk |v1 , . . . , vk Seq | is O(λk, v1 , . . . , vk i=1 |vi |). Furthermore, ∀σ : #elets(σ) ≤ |σSeq |. (1) Henceforth, we will many times identify a finite sequence σ with its code number σSeq . However, when we employ expressions such as σ(x), σ = f and σ ⊂ f , we consider σ as a sequence, not as a number. For a (partial) function g and i ∈ N, if ∀j < i : g(j)↓, then g[i] is defined to be the finite sequence g(0), . . . , g(i − 1). D, with and without decorations, ranges over finite sets. We fix the following 1-1 coding for all finite subsets of N. For each non-empty finite set D = {x0 < . . . < xn }, x0 , . . . , xn Seq is the code for D and Seq is the code for ∅. Henceforth, we will many times identify a finite set D with its code number. However, when we employ expressions such as x ∈ D, card(D), max(D) and D ⊂ D , we consider D and D as sets, not as numbers. For each (possibly infinite) sequence q, let content(q) = (range(q) \ {#}). We define LinPrograms = {e | ∃a, b, ∀x : Φe (x) ≤ a|x| + b} and PolyPrograms = {e | ∃p polynomial ∀x ∈ N : Φe (x) ≤ p(|x|)}. Furthermore, for let LinF = {ϕe | e ∈ LinPrograms} and PF = {ϕe | e ∈ PolyPrograms}. For g ∈ PF we say that g is computable in polytime, or also, feasibly computable. Recall that we have, by (1), ∀σ : #elets(σ) ≤ |σ|. With log we denote the floor of the base-2 logarithm, with the exception of log(0) = 0. For all e, x, t, we write ϕe (x)↓t iff Φe (x) ≤ t. Furthermore, we write ϕe (x), if Φe (x) ≤ t; (2) ∀e, x, t : ϕe (x)↓t = 0, otherwise. The following lemma is used in many of our detailed proofs. The present paper, because of space limitations, omits many details of proofs. Nonetheless, we still include this lemma herein to give the reader some intuitions as to how to manage some missing details. Lemma 1. Regarding time-bounded computability, we have the following. – Equality checks and log are computable in linear time [RC94, Lemma 3.2]. – Conditional definition is computable in a time polynomial in the runtimes of its defining programs [RC94, Lemma 3.14]. – Bounded minimizations, and, hence, bounded maximizations are computable in a time polynomial in the runtimes of its defining programs [RC94, Lemma 3.15]. – Boolean combinations of predicates computable in polytime are computable in polytime [RC94, Lemma 3.18]. – From [RC94, Corollary 3.7], we have that λe, x, t ϕe (x)↓|t| and λe, x, t, z ϕe (x)↓|t| = z are computable in polynomial time. 13
For these O-formulas, |ε| = 1 helps.
Difficulties in Forcing Fairness of Polynomial Time Inductive Inference
269
– Our coding of finite sequences easily gives that the following functions are linear time computable. ∀x : 1 ≤ length(¯ x), λσSeq #elets(σ) and σ(i), if i < #elets(σ); λσSeq , i 0, otherwise. – Our coding above of finite sets enables content to be computable in polynomial time.14
2.1
Learning Criteria Modules
In this section we give our modular definition of what a learning criterion is. After that we will put the modules together to obtain the actual criteria we need. As noted above, all standard inductive inference criteria names will be changed to slightly different names in our modular approach. Definition 2. An effective numbering of ce languages is a function V : N → E such that there is a function f ∈ P with ∀e, x : f (e, x)↓ ⇔ x ∈ V (e).15 For such numberings V , for each e, we will write Ve instead of V (e), and call e an index or an hypothesis (in V ) for Ve . Recall that we identify functions with their graphs. Therefore, ϕ and any other indexing for partial computable functions is considered an effective numbering. We use effective numberings as hypothesis spaces for our learners. We will sometimes require that {Ve | e ∈ N} is effectively closed under some finite modifications; precisely we will sometimes require ∃s∩ ∈ R : ∀e, D : Vs∩ (e,D) = Ve ∩ D ∧ ∃s∪ ∈ R : ∀e, D : Vs∪ (e,D) = Ve ∪ D ∧ ∃s\ ∈ R : ∀e, D : Vs\ (e,D) = Ve \ D.
(3)
Note that, in practice, many effective numberings of ce languages allow s∩ , s∪ and s\ as in (3) to be computable in polynomial or even linear time. Effective numberings include the following important examples. – W ((3) holds). – A canonical numbering of all regular languages, represented by efficiently coded DFAs (where membership is trivially uniformly polynomial time decidable and (3) holds). – For each pair of context free grammars (CFGs) G0 , G1 , we efficiently code (G0 , G1 ) to be an index for (L(G0 ) \ L(G1 )). Then the resulting numbering, in particular, has an index for each context free language. Furthermore, 14
15
This computation involves sorting. Selection sort can be done in quadratic time in the RAM model [Knu73], and adding an extra linear factor to translate from RAM complexity to deterministic multi-tape TM complexity [vEB90], we get selection sort in cubic (and, hence, polynomial) time measured by ΦTM . Note that such a numbering does not necessarily need to be onto, i.e., a numbering might only number some of the ce languages, leaving out others.
270
J. Case and T. Kötzing
membership is uniformly polynomial time decidable [HU79, Sch91], and (3) holds. As noted above, these grammars are called CFGs with Prohibition in [Bur05].16 Definition 3. Any set C ⊆ P is a learner admissibility restriction. Intuitively, a learner admissibility restriction defines which functions are admissible as potential learners. Two typical learner admissibility restrictions are P and R. When denoting criteria with P as the learner admissibility restriction, we will omit P. Definition 4. Any function from E to Pow(R) is called a target presenter for the ce languages. The only target presenter used in this paper is Txt : E → Pow(R), L → {ρ ∈ R | content(ρ) = L}. Definition 5. Every computable operator P × R → P2 is called a sequence generating operator.17 Intuitively, a sequence generating operator defines how learner and presentation interact to generate two infinite sequences, one for learner-outputs (we call this sequence the learner-sequence) and one for learneeoutputs. For any sequence generating operator β, we define β1 and β2 such that β = λh, g (β1 (h, g), β2 (h, g)). We define the following sequence generating operators. – Goldstyle: G : P × R → P × R, (h, g) → (λi h(g[i]), g). – [JORS99] Set-driven: Sd : P × R → P × R, (h, g) → (λi h(content(g[i])), g). – [JORS99] Partly set-driven: Psd : P × R → P × R, (h, g) → (λi h(content(g[i]), i), g). Definition 6. Every subset of P 2 is called a sequence acceptance criterion. Intuitively, a sequence acceptance criterion defines what identification-sequences are considered a successful identification of a target. Any two such sequence acceptance criteria δ and δ can be combined by intersecting them. For ease of notation we write δδ instead of δ ∩ δ . For each effective numbering of some ce languages V , we define the following sequence acceptance criteria. – Explanatory: ExV = {(p, q) ∈ P 2 | ∃p : p → p ∧ Vp = content(q)}. – Postdictive Completeness: PcpV = {(p, q) ∈ R2 | ∀i : content(q[i]) ⊆ Vp(i) }. – Conservativeness: ConvV = {(p, q) | ∀i : p(i) = p(i+1) ⇒ content(q[i+1]) ⊆ Vp(i) }. For any given target presenter α and a sequence generating operator β, we can turn a given sequence acceptance criterion δ into a learner admissibility restriction T δ by admitting only those learners that obey δ on all input : 16
17
Intuitively, G0 may “generate” an element, and G1 can correct it or exclude it. The concept of Prohibition Grammars is generalized in [CCJ09, CR09] and, there, they are called Correction Grammars. Essentially, these computable operators are the recursive operators of [Rog67] but with two arguments and two outputs and restricted to the indicated domain.
Difficulties in Forcing Fairness of Polynomial Time Inductive Inference
271
T δ := {h ∈ P | ∀T ∈ range(α) : β(h, T ) ∈ δ}. We then speak of “total . . . .” For example, total postdictive completeness, i.e., T PcpV , requires postdictive completeness on any input data, including input data not necessarily taken from a target to be learned.18 Definition 7. A learning criterion (for short, criterion) is a 4-tuple consisting of a learner admissibility restriction, a target presenter, a sequence generating operator and a sequence acceptance criterion. Let C, α, β, δ be, respectively, a learner admissibility restriction, a target presenter, a sequence generating operator and a sequence acceptance criterion. For h ∈ P, L ∈ dom(α), we say that h (C, α, β, δ)-learns L iff: h ∈ C and, for all T ∈ α(L), β(h, T ) ∈ δ. For h ∈ P and L ⊆ dom(α) we say that h (C, α, β, δ)-learns L iff, for all L ∈ L, h (C, α, β, δ)-learns L. The set of (C, α, β, δ)-learnable sets of computable functions is Cαβδ := {L ⊆ E | ∃h ∈ P : h (C, α, β, δ)-learns L}. (4) We refer to the sets Cαβδ as in (4) as learnability classes. Instead of writing the tuple (C, α, β, δ), we will ambiguously write Cαβδ. For h ∈ P, the set of all computable learnees (C, α, β, δ)-learned by h is denoted by Cαβδ(h) := {L ∈ E | h (C, α, β, δ)-learns L}. Definition 8. We let Id be the function mapping a learning criterion (C, α, β, δ) to the set Cαβδ, as defined in (4). We define two versions of prudent learning as follows. For all C, α, β, δ, V , respectively, a learner admissibility restriction, a target presenter, a sequence generating operator, a sequence acceptance criterion and an effective numbering of ce languages, we let PrudV (C, α, β, δ) = {L ⊆ dom(α) |
∃h ∈ C : L ⊆ αβδ(h) ∧ ∀t ∈ L, ∀T ∈ α(t), ∀i : Vβ1 (h,T )(i) ∈ L},
and T PrudV (C, α, β, δ) = {L ⊆ dom(α) |
∃h ∈ C : L ⊆ αβδ(h) ∧ ∀e ∈ range(h) : Ve ∈ L}.
For D ∈ {Id, Prud, T Prud}, a learning criterion C and a learner h, we write DC instead of D(C); further, we let DC(h) denote the set of all targets learnable by h for criterion DC. We subscript an entire learning criterion with an effective numbering V to change all restrictions to expect hypotheses from V . For example, we write the criterion of TxtEx-learning, with V -indices for the hypothesis space, as TxtGExV . However, TxtGExV with the three restrictions of total postdictive completeness, total conservativeness and prudence (not total prudence) in our modular notation is written PrudT PcpT ConvTxtGExV , which we abbreviate as PrudT (PcpConv)TxtGExV . If, instead, we wanted this criterion but with total prudence in the place of prudence, it could be written T PrudT (PcpConv)TxtGExV . 18
Note that, while Yoshinaka [Yos09] essentially defines for his postdictive completeness, conservativeness and prudence the total kinds, his interesting algorithms for which he claims these three restrictions satisfy only the non-total versions.
272
3
J. Case and T. Kötzing
Uniformly Polytime Decidable Hypothesis Spaces
For this section, we let U be an arbitrary, fixed effective numbering of some ce languages and such that λe, x x ∈ Ue is computable in polynomial time. We call such a numbering uniformly polynomial time computable. Further suppose there is an r ∈ PF such that ∀D : Ur(D) = D and suppose (3) in Section 2.1 holds for U . Codings for DFAs or CFGs with Prohibition are example such U s (see Definition 2 in Section 2.1). Interestingly, Theorem 10 below says that, every conservative learner employing hypothesis space U can, without loss of generality, be assumed to be polynomial time, postdictively complete, prudent and set-driven. This leads to the main theorem in this section (Theorem 13), that, for hypothesis space U , no combination of the three restrictions of postdictive completeness, conservativeness and prudence will forbid arbitrary delaying tricks. First, we show with a lemma how we can delay set-driven learning and preserve postdictive completeness and conservativeness. We use this lemma for the succeeding theorem. Lemma 9. We have PFT (PcpConv)TxtSdExU = T (PcpConv)TxtSdExU . Proof: “⊆” is immediate. Let h ∈ R and L = T (PcpConv)TxtSdExU (h). Fix a ϕ-program for h. Let P be a computable predicate such that ∀D , D : P (D , D) ⇔ [D ⊆ D ∧ h(D )↓|D| ∧ D ⊆ Uh(D ) ]. By Lemma 1, there is h ∈ PF such that h(D ), if there is ≤-max D ≤ |D| such that P (D , D); ∀D : h (D) = r(D), otherwise.
(5)
(6)
We omit the proof that this delaying construction works. Theorem 10. We have T PrudPFT (PcpConv)TxtSdExU = TxtGConvExU . Proof: “⊆” is trivial. Regarding “⊇”: First we apply Proposition 16 to get total conservativeness. Then we use Theorem 17 to obtain total postdictive completeness. We use Theorem 20 to make the learner set-driven. By Lemma 9, such a learner can be delayed to be computable in polynomial time. By Proposition 21, this learner is automatically totally prudent. Proposition 11 just below shows that any learner can be assumed partially setdriven, and, importantly, the transformation of a learner to a partially set-driven
Difficulties in Forcing Fairness of Polynomial Time Inductive Inference
273
learner preserves prudence. The proposition and its proof are somewhat analogous to [JORS99, Proposition 5.29] and its proof. However, our proof, unlike that of [JORS99, Proposition 5.29], does not require the hypothesis space to be paddable. Proposition 11. Let D ∈ {Id, Prud, T Prud}. We have DTxtPsdExU = DTxtGExU . We can delay partially set-driven learning just as we delayed set-driven learning in Lemma 9, resulting in Lemma 12 just below. Lemma 12. We have PrudPFT PcpTxtPsdExU = PrudTxtGExU and T PrudPFT PcpTxtPsdExU = T PrudTxtGExU . The next theorem is the main result of the present section. As noted in Section 1 above, it says that the three restrictions of postdictive completeness, conservativeness and prudence allow maximal unfairness — within the current setting of polynomial time decidable hypothesis spaces. Theorem 13. Let δ ∈ {R2 , Pcp, Conv, PcpConv}, D ∈ {Id, Prud} and D ∈ {Id, T Prud}. Then DPFTxtGδExU = DTxtGδExU and D PFT δTxtGExU = D T δTxtGExU . Proof: Use Theorem 10, as well as Theorem 18 and Lemma 12.
4
Other Uniformly Decidable Hypothesis Spaces
For this section, we let V : N → E range over effective numberings of some ce languages such that λe, x x ∈ Ve is computable (we call such a numbering uniformly decidable). Further suppose, for each such V , there is r ∈ R such that ∀D : Vr(D) = D.19 Example such numberings V include the classes of all linear time, polynomial time, . . . decidable languages (not uniformly linear time, polynomial time, . . . decidable), each represented by efficiently numerically coded programs in a suitable subrecursive programming system for deciding languages [RC94]. For uniformly decidable hypothesis spaces, we get mixed results. We have already seen from Theorem 13 in Section 3 above that there are uniformly decidable hypothesis spaces where we have arbitrary delaying for all combinations of postdictive completeness, conservativeness and prudence. Next is the first main 19
Note that, in practice, many effective numberings of some ce languages allow r to be computable in polynomial or even linear time.
274
J. Case and T. Kötzing
theorem of the present section. It states that there are other uniformly decidable hypothesis spaces such that postdictive completeness, with or without any of conservativeness and prudence, forbids some delaying tricks. By contrast, according to Theorem 18 in Section 5, any combination of just the two restrictions of conservativeness and prudence allows for arbitrary delaying tricks. Theorem 14. There exists a uniformly decidable numbering V such that, for each δ ∈ {R2 , Pcp, Conv, PcpConv}, D ∈ {Id, Prud} and D ∈ {Id, T Prud}, DPFTxtGδExV ⊂ DRTxtGδExV ⇔ δ ⊆ Pcp and D PFT δTxtGExU ⊂ D T δTxtGExU ⇔ δ ⊆ Pcp. We can, and sometimes do, think of total function learning as a special case of TxtEx-learning thus. Suppose f is any (possibly, but not necessarily total) function mapping non-negative integers into the same. Recall that we identify f with its graph, {x, y | f (x) = y}, where x, y is the numeric coding of (x, y) (Section 2). Then {x, y | f (x) = y} is a sublanguage of the non-negative integers. Furthermore, programs for f are generally trivially intercompilable with programs or grammars for {x, y | f (x) = y}. We sometimes refer to languages of the form {x, y | f (x) = y} as single-valued languages. Next is our second main result of the present section. It asserts the polynomial time learnability with restrictions of postdictive completeness, conservativeness and prudence of a uniformly decidable class of total single-valued languages which are (the graphs of) the linear time computable functions. Importantly, our proof of this theorem employs a Pitt-style delaying trick on an enumeration technique [Gol67, BB75], and our result, then, entails, as advertised in Section 1 above, that some delaying tricks are not forbidden in the setting of the present section. Let θLtime be an efficiently coded programming system from [RC94, Chapter 6] for LinF. θLtime is based on multi-tape TM-programs each explicitly clocked to halt in linear time (in the length of its input). Let V Ltime be the corresponding effective numbering of all and only those ce languages (whose graphs are) ∈ LinF. Note that V Ltime does not satisfy the condition at the beginning of the present section on V s for obtaining codes of finite languages — since we have only infinite languages in V Ltime . Instead, for V Ltime , we have (and use) a linear time algorithm, which on any finite function F , outputs a V Ltime -index for the zero-extension of F . Theorem 15 LinF ∈ T PrudPFT (PcpConv)TxtGExV Ltime . The remainder of this section presents two results that are used elsewhere. They are put here to present them in more generality. They each hold for any V . The following proposition says that, for any uniformly decidable V , conservative learnability implies total conservative learnability. It is used for proving Theorem 10 in Section 3.
Difficulties in Forcing Fairness of Polynomial Time Inductive Inference
275
Proposition 16. We have T ConvTxtGExV = TxtGConvExV . The following theorem holds for all V and states that we can assume total postdictive completeness when learning with total conservativeness. Theorem 17. We have T (PcpConv)TxtGExV = T ConvTxtGExV .
5
Learning ce Languages
For the remainder of this section, let V be any effective numbering of some ce languages. For the present section, next (and mentioned in Section 1 above) is our first main result which says that any combination of just the two restrictions of conservativeness and prudence allows for arbitrary delaying tricks. Theorem 18. Let δ ∈ {R2 , Conv}, D ∈ {Id, Prud} and D ∈ {Id, T Prud}. Then DPFTxtGδExV = DTxtGδExV and D PFT δTxtGExV = D T δTxtGExV . Our proof of the just above theorem uses delaying tricks similar to those in the proof of Lemma 9 in Section 3. Our next main result of the present section says, for the general effective numbering of all ce languages, W , combinations of postdictive completeness, conservativeness and prudence forbid some delaying tricks iff postdictive completeness is part of the combination. Theorem 19. Let δ ∈ {R2 , Pcp, Conv, PcpConv}, D ∈ {Id, Prud} and D ∈ {Id, T Prud}. Then DPFTxtGδExW ⊂ DRTxtGδExW ⇔ δ ⊆ Pcp and D PFT δTxtGExW ⊂ D T δTxtGExW ⇔ δ ⊆ Pcp. Our proof of the just above theorem makes crucial use of [CK08b, Theorem 5(a)] as well as Theorem 18 above. Theorem 20 just below says that certain kinds of learners can be assumed without loss of generality to be set-driven. This is interesting on its own, and is also of important technical use for proving Theorem 10 in Section 3. Theorem 20. Let V be such that (3) holds. We have T (PcpConv)TxtSdExV = T (PcpConv)TxtGExV .
(7)
The following proposition shows that total postdictive complete and total conservative, set-driven learners are automatically totally prudent. This, again, is of important technical use for proving Theorem 10 in Section 3.
276
J. Case and T. Kötzing
Proposition 21. Let δ be a sequence acceptance criterion, let C ⊆ P. Let h ∈ P. We have T PrudCT (PcpConv)TxtSdδExV (h) = CT (PcpConv)TxtSdδExV (h). Next is our last main result. As noted above in Section 1, this theorem says that, in the general setting of the present section, postdictive completeness does not forbid all delaying tricks. Theorem 22. We have LinF ∈ T PrudPFT (PcpConv)TxtGExW . Proof: The effective numbering V Ltime from Theorem 15 can be translated into the W -system in linear (and, hence, in polynomial) time.
References [Ang80]
Angluin, D.: Inductive inference of formal languages from positive data. Information and Control 45, 117–135 (1980) [B¯ ar74] B¯ arzdi¸ nš, J.: Inductive inference of automata, functions and programs. In: Int. Math. Congress, Vancouver, pp. 771–776 (1974) [BB75] Blum, L., Blum, M.: Toward a mathematical theory of inductive inference. Information and Control 28, 125–155 (1975) [Bur05] Burgin, M.: Grammars with prohibition and human-computer interaction. In: Proceedings of the 2005 Business and Industry Symposium and the 2005 Military, Government, and Aerospace Simulation Symposium, pp. 143–147. Society for Modeling and Simulation (2005) [CCJ09] Carlucci, L., Case, J., Jain, S.: Learning correction grammars. Journal of Symbolic Logic 74(2), 489–516 (2009) [CK08a] Case, J., Kötzing, T.: Dynamic modeling in inductive inference. In: Freund, Y., Györfi, L., Turán, G., Zeugmann, T. (eds.) ALT 2008. LNCS (LNAI), vol. 5254, pp. 404–418. Springer, Heidelberg (2008) [CK08b] Case, J., Kötzing, T.: Dynamically delayed postdictive completeness and consistency in learning. In: Freund, Y., Györfi, L., Turán, G., Zeugmann, T. (eds.) ALT 2008. LNCS (LNAI), vol. 5254, pp. 389–403. Springer, Heidelberg (2008) [CR09] Case, J., Royer, J.: Program size complexity of correction grammars, Working draft (2009) [Ful88] Fulk, M.: Saving the phenomenon: Requirements that inductive machines not contradict known data. Information and Computation 79, 193–209 (1988) [Gol67] Gold, E.: Language identification in the limit. Information and Control 10, 447–474 (1967) [HU79] Hopcroft, J., Ullman, J.: Introduction to Automata Theory Languages and Computation. Addison-Wesley Publishing Company, Reading (1979) [JORS99] Jain, S., Osherson, D., Royer, J., Sharma, A.: Systems that Learn: An Introduction to Learning Theory, 2nd edn. MIT Press, Cambridge (1999)
Difficulties in Forcing Fairness of Polynomial Time Inductive Inference [Knu73]
277
Knuth, D.: The Art of Computer Programming, Volume III: Sorting and Searching. Addison-Wesley, Reading (1973) [LV97] Li, M., Vitanyi, P.: An Introduction to Kolmogorov Complexity and Its Applications, 2nd edn. Springer, Heidelberg (1997) [OSW86] Osherson, D., Stob, M., Weinstein, S.: Systems that Learn: An Introduction to Learning Theory for Cognitive and Computer Scientists. MIT Press, Cambridge (1986) [Pit89] Pitt, L.: Inductive inference, DFAs, and computational complexity. In: Jantke, K.P. (ed.) AII 1989. LNCS, vol. 397, pp. 18–44. Springer, Heidelberg (1989) [RC94] Royer, J., Case, J.: Subrecursive Programming Systems: Complexity and Succinctness. Research monograph in Progress in Theoretical Computer Science. Birkhäuser, Boston (1994) [Rog67] Rogers, H.: Theory of Recursive Functions and Effective Computability. McGraw Hill, New York (1967); Reprinted by MIT Press, Cambridge, Massachusetts (1987) [Sch91] Schabes, Y.: Polynomial time and space shift-reduce parsing of arbitrary context-free grammars. In: Proceedings of the 29th annual meeting on Association for Computational Linguistics, pp. 106–113. Association for Computational Linguistics (1991) [vEB90] van Emde Boas, P.: Machine models and simulations. In: Van Leeuwen, J. (ed.) Handbbook of Theoretical Computer Science. Algorithms and Complexity, vol. A, pp. 3–66. MIT Press/Elsevier (1990) [Wei82] Weinstein, S.: Private communication at the Workshop on Learnability Theory and Linguistics, University of Western Ontario (1982) [Wie76] Wiehagen, R.: Limes-erkennung rekursiver funktionen durch spezielle strategien. Elektronische Informationverarbeitung und Kybernetik 12, 93–99 (1976) [Wie78] Wiehagen, R.: Zur Theorie der Algorithmischen Erkennung. PhD thesis, Humboldt University of Berlin (1978) [Yos09] Yoshinaka, R.: Learning efficiency of very simple grammars from positive data. Theoretical Computer Science 410, 1807–1825 (2009); In: Hutter, M., Servedio, R.A., Takimoto, E. (eds.) ALT 2007. LNCS (LNAI), vol. 4754, pp. 227–241. Springer, Heidelberg (2007)
Learning Mildly Context-Sensitive Languages with Multidimensional Substitutability from Positive Data Ryo Yoshinaka Graduate School of Information Science and Technology, Hokkaido University, North-14 West-9, Sapporo, Japan
[email protected]
Abstract. Recently Clark and Eyraud (2007) have shown that substitutable context-free languages, which capture an aspect of natural language phenomena, are efficiently identifiable in the limit from positive data. Generalizing their work, this paper presents a polynomialtime learning algorithm for new subclasses of mildly context-sensitive languages with variants of substitutability.
1
Introduction
It has been a long-term goal of grammatical inference to find a reasonable class of formal languages that are powerful enough for expressing natural languages and are efficiently learnable under a reasonable model of language acquisition. As Gold [10] showed that even the family of regular languages, which is located in the lowest level of the Chomsky hierarchy, is not identifiable in the limit from positive data, this learning model seems very restrictive. In spite of this difficulty, researchers have been striving to find rich classes of languages efficiently learnable in this model. Angluin’s reversible languages [1] are the first nontrivial example of subclasses of regular languages that are efficiently identifiable in the limit from positive data. The literature has found other subclasses of regular languages, linear languages and context-free languages to be efficiently learnable under this model. In particular Clark and Eyraud’s work [7, 8] on substitutable context-free languages is noteworthy in regard to the close connection to natural languages. Their work has led to several fruitful results in grammatical inference [5, 9, 19], which target even larger classes of context-free languages with some special properties related to the substitutability. And now mildly contextsensitive languages have arisen as a topical target of grammatical inference in order to get even closer to natural languages [3, 14, 2, 12]. The goal of this paper is to present how to learn some specific kinds of mildly context-sensitive languages by developing Clark and Eyraud’s technique for learning substitutable context-free languages. We introduce the notion of multidimensional substitutability as a generalization of substitutability and demonstrate that it closely relates to mildly context-sensitive languages. In fact the R. Gavald` a et al. (Eds.): ALT 2009, LNAI 5809, pp. 278–292, 2009. c Springer-Verlag Berlin Heidelberg 2009
Learning Mildly Context-Sensitive Languages
279
role played by multidimensional substitutability in mildly context-sensitive languages is the exact analogue to that of substitutability in context-free languages, and that of reversibility in regular languages, as well. We would like the reader to recall that a regular language L is said to be zero-reversible if and only if, for any strings x, y, x , y , xy, xy , x y ∈ L
implies
x y ∈ L
and a substitutable language L satisfies that xyz, xy z, x yz ∈ L
implies
x y z ∈ L.
Our m-dimensional substitutability is roughly expressed as that x0 y1 x1 . . . ym xm , x0 y1 x1 . . . ym xm , x0 y1 x1 . . . ym xm ∈ L
implies
xm ∈ L. x0 y1 x1 . . . ym
m-dimensional substitutability is a stronger restriction than substitutability. This definition itself does not give richer language classes, but in fact this allows us to infer mildly context-sensitive languages from finite sets of examples. Among several formalisms of mildly context-sensitive grammars, we pick multiple context-free grammars for representing target languages. Section 2 reviews the definition and some properties of multiple context-free grammars. Section 3 introduces the hierarchy of multidimensional substitutable languages and gives some examples and counterexamples of those languages. The main issue of this paper, learning multidimensional substitutable multiple context-free languages, is discussed in Section 4. We conclude this paper in Section 5 with discussing possible future directions of study.
2 2.1
Preliminaries Definitions and Notations
The set of nonnegative integers is denoted by N and this paper will consider only numbers in N. The cardinality of a set S is denoted by |S|. If w is a string over an alphabet Σ, |w| denotes its length. ∅ is the empty set and λ is the empty string. Σ ∗ denotes the set of all strings over Σ, Σ + = Σ ∗ − {λ} and Σ k = { w ∈ Σ ∗ | |w| = k }. Any subset of Σ ∗ is called a language (over Σ). If L is a finite language over Σ, its size is defined as L = |L| + w∈L |w|. For any x, xm means the m-tuple of x, while xm denotes the usual concatenation of x, e.g., x3 = x, x, x and x3 = xxx. Hence (Σ ∗ )m is the set of m-tuples of strings over Σ, which are called m-words. Similarly we define (·)∗ and (·)+ , where, for instance, (Σ ∗ )+ denotes the set of all m-words for all m ≥ 1. For an m-word x = x1 , . . . , xm , |x| denotes its length m and x denotes its size m+ 1≤i≤m |xi |. If f is a function defined on k-tuples, we will write f (z1 , . . . , zk ) for f (z1 , . . . , zk ) for readability.
280
2.2
R. Yoshinaka
Identification in the Limit from Positive Data
Our learning criterion is identification in the limit from positive data (or equivalently from text ) introduced by Gold [10]. Let G be any recursive set of finite descriptions, called grammars, and L be a function from G to non-empty languages over Σ. A learning algorithm A on G, is an algorithm that computes a function from finite sequences of strings w1 , . . . , wn ∈ Σ ∗ to G. We define a presentation of a language L to be an infinite sequence of elements (called positive examples) of L such that every element of L occurs at least once. Given a presentation, we can consider the sequence of hypotheses that the algorithm produces, writing Gn = A(w1 , . . . , wn ) for the nth such hypothesis. The algorithm A is said to identify the class L of languages in the limit from positive data if for every L ∈ L, for every presentation of L, there is an integer n0 such that for all n > n0 , Gn = Gn0 and L = L(Gn0 ). For G ⊆ G satisfying that L = { L(G) | G ∈ G }, one also says A identifies G in the limit from positive data. For convenience, we often allow the learner to refer to the previous hypothesis Gn for computing Gn+1 in addition to w1 , . . . , wn+1 . Obviously this relaxation does not effect the learnability of language classes. Moreover, learning algorithms in this paper compute hypotheses from a set of positive examples by identifying a sequence with the set consisting of the elements of the sequence. 2.3
Multiple Context-Free Grammars
A function from (Σ ∗ )m1 × · · ·×(Σ ∗ )mn to (Σ ∗ )m is said to be linear regular, if there is α1 , . . . , αm ∈ ((Σ ∪ { zij | 1 ≤ i ≤ n, 1 ≤ j ≤ mi })∗ )m such that each variable zij occurs exactly once in α1 , . . . , αm and f (y1 , . . . , yn ) = α1 [z := y], . . . , αm [z := y ] for any yi = yi1 , . . . , yimi ∈ (Σ ∗ )mi with 1 ≤ i ≤ n, where αk [z := y ] denotes the string obtained by replacing each variable zij with the string yij . For example, f defined as f (z11 , z12 , z21 , z22 , z23 ) = z12 az21 bz11 , c, z23 z22 is linear regular, but g(z11 , z12 , z21 , z22 , z23 ) = z11 az21 bz11 , c, z23 z22 is not, because z11 occurs twice and z12 disappears in the right-hand side of the equality. The rank rank(f ) of f is defined to be n and the size size(f ) of f is m+|α1 . . . αm |. A multiple context-free grammar ( mcfg) is a tuple G = Σ, Vdim , F, P, S, where – Σ is a finite set of terminal symbols, – Vdim = V, dim is the pair of a finite set V of nonterminal symbols and a function dim assigning a positive integer, called a dimension, to each element of V , – F is a finite set of linear regular functions,1 – P is a finite set of rules of the form A → f (B1 , . . . , Bn ) where A, B1 , . . . , Bn ∈ V and f ∈ F maps (Σ ∗ )dim(B1 ) × · · · × (Σ ∗ )dim(Bn ) to (Σ ∗ )dim(A) , – S ∈ V is called the start symbol whose dimension is 1. 1
We identify a function symbol with the function itself for convention.
Learning Mildly Context-Sensitive Languages
281
We will simply write V for Vdim if no confusion occurs. If rank(f ) = 0 and f () = y , we may write A → y instead of A → f (). The dimension dim(G) of G is defined to be the maximum of dim(A) for A ∈ V and the rank rank(G) of G is the maximum of rank(f ) for f ∈ F . The size G of G is defined as G = |P | + ρ∈P size(ρ) where size(A → f (B1 , . . . , Bn )) = size(f ) + n + 1. For each A ∈ V , L(G, A) is the smallest set of dim(A)-words such that if A → f (B1 , . . . , Bn ) is a rule and yi ∈ L(G, Bi ), then f (y1 , . . . , yn ) ∈ L(G, A). The language L(G) generated by G means the set { w ∈ Σ ∗ | w ∈ L(G, S) }. L(G) is called a multiple context-free language ( mcfl). Two grammars G and G are equivalent if L(G) = L(G ). We denote by G(p, r) the collection of mcfgs G such that dim(G) ≤ p and rank(G) ≤ r and define L(p, r) = { L(G) | G ∈ G(p, r) }. We also write G(p, ∗) = r∈N G(p, r) and L(p, ∗) = r∈N L(p, r). The class of context-free grammars is identified with G(1, ∗) and that of linear grammars corresponds to G(1, 1). It is well-known that L(1, 1) L(1, 2) = L(1, ∗). The following three, which are thought to be typical mildly context-sensitive languages [13], are all in L(2, 1): { an bn cn | n ≥ 0 },
{ am bn cm dn | m, n ≥ 0 },
{ ww | w ∈ Σ ∗ }.
Seki et al. [17] and Rambow and Satta [15] have investigated the hierarchy of mcfls. Proposition 1 (Seki et al. [17]). For p ≥ 1, L(p, ∗) L(p + 1, ∗). In fact { (ai bi )p+1 | i ≥ 0 } ∈ L(p + 1, 1) − L(p, ∗). Proposition 2 (Rambow and Satta [15]). For p ≥ 2, r ≥ 1, L(p, r) L(p, r + 1) except for L(2, 2) = L(2, 3). Furthermore Rambow and Satta show a trade-off between dimension and rank. This contrasts with Proposition 1. Proposition 3 (Rambow and Satta [15]). For p ≥ 1, r ≥ 3 and 1 ≤ k ≤ r − 2, L(p, r) ⊆ L((k + 1)p, r − k). Proposition 4 (Seki et al. [17]). Let p and r be fixed. It is decidable in O(G|w|p(r+1) ) time whether w ∈ L(G) for any mcfg G ∈ G(p, r) and w ∈ Σ ∗ . We close this subsection with introducing inessential restrictions on mcfgs. Let f be a linear regular function such that f (z1 , . . . , zn ) = α1 , . . . , αm for zi = zi1 , . . . , zimi . If no αk from α1 , . . . , αm is λ, f is said to be λ-free. f is nonpermuting if zij always occurs left of zi(j+1) in α1 . . . αm for any i, j with 1 ≤ i ≤ n and 1 ≤ j < mi . f is moreover said to be non-merging if no αk has zij zi(j+1) as a substring for any i, j. An mcfg G is λ-free, non-permuting, non-merging, if all of its functions are λ-free, non-permuting, non-merging, respectively. Note that all G ∈ G(1, ∗) are non-permuting and non-merging. It is known that every mcfg G ∈ G(p, r) has an equivalent λ-free mcfg G ∈ G(p, r) modulo λ [17].
282
R. Yoshinaka
Lemma 1. Every mcfg G ∈ G(p, r) has an equivalent non-permuting one G ∈ G(p, r). Proof. A permutation π on m-words is a bijective linear regular function of rank 1 on (Σ ∗ )m such that π(z1 , z2 , . . . , zm ) is defined to be zp1 , . . . , zpm for some p1 , . . . , pm with {p1 , . . . , pm } = {1, . . . , m}. We define G to have nonterminals Aπ with dim(Aπ ) = dim(A) for all nonterminals A of G and all permutations π on dim(A)-words. For each rule A → f (B1 , . . . , Bn ) of G and each permutation π on dim(A)-words, G has the rule of the form Aπ → f π (B1π1 , . . . , Bnπn ) where f π is defined to satisfy that f π (y1 , . . . , yn ) = π(f (π1−1 (y1 ), . . . , πn−1 (yn )) and each πi , a permutation on dim(Bi )-words, is chosen so that f π permuting. Indeed π1 , . . . , πn are uniquely determined by π and f . It to see that for any permutation π on dim(A)-words, y ∈ L(G, A) iff L(G , Aπ ). The start symbol of G is of course S I where I is the unique tation on 1-words, i.e., the identity.
is nonis easy π(y ) ∈ permu
We note that G ≤ p!G. Lemma 2. Every non-permuting mcfg G ∈ G(p, r) has an equivalent nonmerging one G ∈ G(p, r). Proof. A merge μ on m-words is a linear regular function of rank 1 from (Σ ∗ )m to (Σ ∗ )k for some k with 1 ≤ k ≤ m such that μ(z1 , z2 , . . . , zm ) is defined to be z1 . . . zm1 , zm1 +1 . . . zm2 , . . . , zmk−1 +1 . . . zm for some m1 , . . . , mk−1 with 1 ≤ m1 < · · · < mk−1 < m. We define G to have nonterminals Aμ for nonterminals A of G and merges μ on dim(A)-words. For each rule A → f (B1 , . . . , Bn ) of G and each merge μ on dim(A)-words, G has the rule of the form Aμ → f μ (B1μ1 , . . . , Bnμn ) where f μ is defined to satisfy that f μ (μ1 (y1 ), . . . , μn (yn )) = μ(f (y1 , . . . , yn )) and each μi , a merge on dim(Bi )-words, is chosen so that f μ is non-merging. Indeed μ1 , . . . , μn are uniquely determined by μ and f . It is easy to see that for any merge μ on dim(A)-words, y ∈ L(G, A) iff μ(y ) ∈ L(G , Aμ ). The start I symbol of G is of course S where I is the unique merge on 1-words, i.e., the identity.
We note that G ≤ 2p−1 G. We say that a linear regular function f is good if it is λ-free, non-permuting and non-merging, and that an mcfg G is good if all of its functions are good. We assume that all mcfgs in this paper are good.
Learning Mildly Context-Sensitive Languages
3 3.1
283
Multidimensional Substitutable Languages Multidimensional Substitutability and Multiple Context-Free Hierarchy
This section introduces the notion of p-dimensional substitutability as a generalization of substitutability by Clark and Eyraud [8]. Let ∈ Σ be a new symbol, which represents a hole. If x ∈ (Σ ∪ {})∗ contains m occurrences of , then x is called an m-context. For an m-context x = x0 x1 . . . xn with x0 , . . . , xn ∈ Σ ∗ and an m-word y = y1 , . . . , yn ∈ (Σ ∗ )∗ , we define an operation by x y = x0 y1 x1 . . . yn xn . x y is defined only when x contains exactly |y | occurrences of . For a positive integer p, a language L is said to be pd-substitutable if and only if x1 y1 , x1 y2 , x2 y1 ∈ L implies x2 y2 ∈ L for any x1 , x2 ∈ Σ ∗ (Σ + )m−1 Σ ∗ , y1 , y2 ∈ (Σ + )m and m ≤ p. For notational [m] convenience, we write Σ for Σ ∗ (Σ + )m−1 Σ ∗ . By S(p) we denote the class of pd-substitutable languages. It is an immediate consequence of the definition that S(p + 1) ⊆ S(p) for any p ≥ 1 and in fact the inclusion is proper. Clark and Eyraud’s original notion of substitutability [8] is our 1d-substitutability. Thus apparently our generalization of the notion of substitutability does not introduce richer classes of languages. The following example however demonstrates how nicely pd-substitutability works in p-dimensional mcfls. Example 1. Let Σm = { ai | 1 ≤ i ≤ 2m } ∪ { #i | 1 ≤ i < 2m } and Lm = { an1 #1 an2 #2 . . . #2m−1 an2m | n ≥ 0 }. For any m, p ≥ 1, Lm ∈ S(p). Moreover any finite subset K Lm is a member of S(m − 1), but K is not in S(m) if |K| ≥ 2. Let Km = { #1 . . . #2m−1 , a1 #1 . . . #2m−1 a2m }, for example. The least md-substitutable language including Km is in fact Lm . Lm is a typical m-dimensional mcfl. In fact Lm ∈ L(m, 1)−L(m−1, ∗). To learn Lm from finite examples, pd-substitutability with p < m is not a sufficiently strong assumption, while md-substitutability is better suited. One may think of md-substitutability in m-dimensional mcfls for m ≥ 1 as a generalization of 1d-substitutability in context-free languages, as well as an analogue of zeroreversibility in regular languages. Example 2. Let Σm = { ai , bi | 1 ≤ i ≤ m } ∪ { #i | 1 ≤ i < 2m } and
Lm = { an1 a #1 bn1 b #2 an2 a #3 bn2 b #4 . . . #2m−2 anma #2m−1 bnmb | na , nb ≥ 0 }. For any m, p ≥ 1, Lm ∈ S(p). Let Km = { #1 . . . #2m−1 , a1 #1 b1 #2 . . . #2m−2 am #2m−1 bm }. Then Km ∈ S(m − 1) − S(m). The least md-substitutable language including Km is in fact Lm .
284
R. Yoshinaka
Thus those typical p-dimensional mcfls can be inferred from some finite subsets with pd-substitutability, if one can compute the least language in S(p) including an arbitrarily given finite language. Therefore we are concerned with the classes of languages that are in S(p) and at the same time in L(p, ∗). Let us denote SL(p, r) = S(p) ∩ L(p, r) and SL(p, ∗) = S(p) ∩ L(p, ∗). On the other hand, some other typical p-dimensional mcfls are not pd-substitutable. We say that two m-words y1 and y2 are substitutable for each other [m] in L, when for any x ∈ Σ it holds that x y1 ∈ L iff x y2 ∈ L. n n n n Example 3. The language L− 2 = { a1 #a2 #a3 #a4 | n ≥ 0 } is not 2d-substitutable. If a 2d-substitutable language L contains ### and a1 #a2 #a3 #a4 as L− 2 does, then #, # and a1 #a2 , a3 #a4 should be substitutable for each other in L. This entails that a1 a1 #a2 a2 a3 #a4 a3 #a4 ∈ L − L− 2. The language Lreverse = { w#wR | wR is the reverse of w ∈ {a, b}∗ } is 1d-substitutable but not 2d-substitutable. Actually even { an #an | n ≥ 0 } is not 2d-substitutable. Suppose that a 2d-substitutable language L contains aaa#aaa. Then aa#, a and a, #aa are substitutable for each other, because of the shared 2-context aaa. At the same time aaa#aaa = aaaaa#, a, so L must contain aaaa#aa = aaa a, #aa, too. This shows that even a singleton language is not 2d-substitutable, which contrasts the fact that every singleton is 1d-substitutable. The language Lcopy = { w#w | w ∈ {a, b}∗ } is not 1d-substitutable. If a 1d-substitutable language L contains a#a and b#b as Lcopy does, they should be substitutable for each other. aa#aa ∈ L entails ab#ba ∈ L − Lcopy .
When the language L1 = { an1 #1 an2 | n ≥ 0 } is generated by a context-free grammar, only nesting structural interpretation is possible, while by a 2-dimensional mcfg, cross-serial dependency is also a possible interpretation at the same time. One cannot decide which is the underlying structure from strings only. Actually if a 2d-substitutable language contains a1 #1 a2 and a1 a1 #1 a2 a2 , both interpretation are inevitably induced. 3.2
Comparison with Simple External Contextual Languages
We will extend the operator so that x y is defined for x ∈ ((Σ ∪ {})∗ )+ if x contains exactly |y| occurrences of . For instance, a, bcd, e y1 , y2 , y3 , y4 = a, by1 cy2 d, y3 ey4 . Simple external contextual ( sec) languages are important mildly context-sensitive languages in the context of grammatical inference [3, 14, 2]. For p ≥ 1 and q ≥ 0, a p, q-sec grammar G over Σ is a pair B, C where B ∈ (Σ ∗ )p and C ⊆ (Σ ∗ Σ ∗ )p with |C| ≤ q. The p, q-sec language L(G) generated by a p, q-sec grammar G is defined as L(G) = { w1 . . . wp ∈ Σ ∗ | w1 , . . . , wp = x1 · · · xn B, xi ∈ C, n ≥ 0 } ( is associative). Let SEC(p, q) denote the class of p, q-sec language. We note that SEC(p, q) ⊆ L(p, 1). The languages Lm in Example 1 and Lm in Example 2 are in SEC(m, 1) ∩ SL(m, 1). However the classes SL(p, ∗) and SEC(p, ∗) are incomparable. The
Learning Mildly Context-Sensitive Languages
285
regular language (ab∗ cd∗ )∗ e ∈ m∈N SL(m, 1) is not a p, q-sec language for any p, q. On the other hand, Lreverse from Example 3 is in SEC(1, 2) and is not 2d-substitutable. { an bn | n ≥ 1 } ∈ SEC(1, 1) is not 1d-substitutable either.
Learning pD-Substitutable Multiple Context-Free Languages
4 4.1
Learning Algorithm
Let us arbitrarily fix positive integers p and r. This section presents an algorithm that learns the class SL(p, r). However we do not yet have any grammatical characterization of this class. For mathematical completeness, yet we have to define our learning target by saying that our target representations are mcfgs in G(p, r) generating pd-substitutable languages, though this property is not decidable. While we have S(p + 1) S(p) and L(p, r) L(p + 1, r), the classes SL(p, r) and SL(p + 1, r) are incomparable, unless r = 0. We remark that our algorithm is easily modified to learn the class SL(p, ∗) if we give up the polynomial-time computability as we will discuss later. On the other hand SL(∗, r) = p∈N SL(p, r) is not identifiable in the limit from positive data unless r = 0. Let L∗ = { an bcn den | n ≥ 0 }. It is easy to see that all finite subsets of L∗ are in SL(1, 0), while L∗ ∈ SL(2, 1). Our learning algorithm A(p, r) for SL(p, r), which is shown as Algorithm 1, is a natural generalization of Clark and Eyraud’s original algorithm for SL(1, 2) = SL(1, ∗) [8]. If the new positive example is generated by the previous hypothesis, it keeps the hypothesis. Otherwise, A(p, r) computes an mcfg G(K) from the set K of positive examples given so far. The set of nonterminals is defined as VK = { y ∈ (Σ + )m | x y ∈ K for some x ∈ Σ and 1 ≤ m ≤ p } ∪ {S}, [m]
where dim(y ) = |y |. We will write [[y ]] instead of y for clarifying that it means a nonterminal symbol (indexed with y ). PK consists of the following rules: – (Type I) [[y ]] → f ([[y1 ]], . . . , [[yn ]]) if there is a good function f of rank n ≤ r such that y = f (y1 , . . . , yn ), where [[y ]], [[y1 ]], . . . , [[yn ]] ∈ VK − {S}; – (Type II) [[y ]] → Im ([[y ]]) where Im is the identity on m-words for m = |y | ≤ [m] p, if there is x ∈ Σ such that x y , x y ∈ K; – (Type III) S → I1 ([[w]]) if w ∈ K; and FK is the set of functions requested in the definition of PK . As VK is finite, FK and PK are also finite. Then G(K) = Σ, VK , FK , PK , S ∈ G(p, r) is the conjecture by A(p, r). Instead of having rules [[y ]] → Im ([[y ]]) of Type II, one may merge y and y for downsizing the output as Clark and Eyraud does in [8]. Example 4. Let p = 2 and r = 1. Let us consider the grammar G(K) = Σ, VK , FK , PK , S for K = { a#1 b#2 c#3 d, a#1 #2 c#3 , aa#1 b#2 cc#3 d }.
286
R. Yoshinaka
Algorithm 1. A(p, r) Data: A sequence of strings w1 , w2 , . . . Result: A sequence of mcfgs G1 , G2 , · · · ∈ G(p, r) ˆ be a mcfg such that L(G) ˆ = ∅; let G for n = 1, 2, . . . do read the next string wn ; ˆ then ∈ L(G) if wn ˆ = G(K) where K = {w1 , . . . , wn }; let G end if ˆ as Gn ; output G end for
We see that VK contains the following four nonterminals and others: [[a#1 #2 c#3 ]], [[a#1 , c#3 ]], [[#1 b, #3 d]], [[#1 , #3 ]] ∈ VK . PK contains at least the following four rules of Type I : [[a#1 #2 c#3 ]] → f ([[a#1 , c#3 ]]) [[a#1 , c#3 ]] → g([[#1 , #3 ]])
where where
f (z1 , z2 ) = z1 #2 z2 , g(z1 , z2 ) = az1 , cz2 ,
[[#1 b, #3 d]] → h([[#1 , #3 ]]) [[#1 , #3 ]] → #1 , #3 ,
where
h(z1 , z2 ) = z1 b, z2 d,
as well as the following rules of Type II: [[#1 , #3 ]] → I2 ([[a#1 , c#3 ]]) due to (ab#2 cd) #1 , #3 , (ab#2 cd) a#1 , c#3 ∈ K, [[#1 , #3 ]] → I2 ([[#1 b, #3 d]]) due to (a#2 c) #1 , #3 , (a#2 c) #1 b, #3 d ∈ K, and their symmetries [[a#1 , c#3 ]] → I2 ([[#1 , #3 ]]) and [[#1 b, #3 d]] → I2 ([[#1 , #3 ]]), too, and the rule S → I1 ([[a#1 #2 c#3 ]]) of Type III. Thus G(K) generates every string derived by the mcfg G∗ with the rules S → f (A), A → g(A), A → h(A), A → #1 , #3 , where f, g, h denote the same functions as in G(K). We have L(G∗ ) = { am #1 bn #2 cm #3 dn | m, n ≥ 0 } ⊆ L(G(K)). Here many other nonterminals and rules of G(K) are suppressed, but indeed it holds that L(G(K)) = L(G∗ ) as we will prove later. Note that L(G∗ ) ∈ SL(2, 1). 4.2
Correctness of the Algorithm
We first confirm that A(p, r) is consistent. Lemma 3. K ⊆ L(G(K)) for any finite language K. Proof. If w ∈ K, by definition G(K) has the rules S → [[w]] and [[w]] → w.
Learning Mildly Context-Sensitive Languages
287
ˆ of our algorithm A(p, r) is always We then show that the conjectured grammar G a subset of the target language. Lemma 4. For any L ∈ S(p) and any finite subset K of L, if w ∈ L(G(K), [[y ]]) with [[y ]] ∈ VK − {S}, then y and w are substitutable for each other in L. ˆ = G(K). We prove the lemma by induction on the derivation of Proof. Let G ˆ [[y]]) due to the rule w ∈ L(G(K), [[y ]]). Suppose that w = f (w 1, . . . , w n ) ∈ L(G, ˆ [[yi ]]) for i = 1, . . . , n. Note that [[y ]] → f ([[y1 ]], . . . , [[yn ]]) of type I and w i ∈ L(G, the base case is when n = 0. The presence of the rule implies the existence of [|w|] x ∈ Σ such that x y = x f (y1 , . . . , yn ) ∈ K ⊆ L. The induction hypothesis says that yi and w i are substitutable for each other in L. Recall that the rule f is designed to be good. This allows us the following inference: x y = x f (y1 , y2 , . . . , yn ) ∈ L =⇒ x f (w 1 , y2 , . . . , yn ) ∈ L =⇒ . . . n−1 , yn ) ∈ L =⇒ x f (w 1, . . . , w n−1 , w n) = x w ∈ L. =⇒ x f (w 1, . . . , w Thus y and w are substitutable for each other. ˆ [[y ]]) due to the rule [[y ]] → Im ([[y ]]) of Type II and Suppose that w ∈ L(G, ˆ w ∈ L(G, [[y ]]). By the presence of the rule, y and y are substitutable for each other. By the induction hypothesis, y and w are substitutable for each other. Hence y and w are also substitutable for each other in L.
Lemma 5. For any L ∈ S(p) and any finite subset K of L, it holds that L(G(K)) ⊆ L. ˆ = G(K). If w ∈ L(G), ˆ i.e., w ∈ L(G, ˆ S), then there is [[y]] ∈ VK Proof. Let G ˆ such that S → I1 ([[y]]) is a rule of Type III of G and y ∈ K. By Lemma 4, y and w are substitutable for each other in L. y ∈ K ⊆ L implies that w = w ∈ L.
The conjectured language may be properly smaller than the target, when the given data are not rich enough. We now define a finite subset of the target language which ensures correct convergence of the conjecture of our learning algorithm. For a good mcfg G ∈ G(p, r) generating the target language, we define KG so that for each rule from G, it contains a shortest string from L(G) which is derived using that rule at least once. For the sake of rigorousness, we give a formal definition of KG here. Let X (G, A/B) be defined by: – dim(A) ∈ X (G, A/A), – if A → f (B1 , . . . , Bn ) is a rule and xj ∈ X (G, Bj /C) for some j ∈ {1, . . . , n} and yi ∈ L(G, Bi ) for the other i = 1, . . . , j − 1, j + 1, . . . , n, then f (y1 , . . . , yj−1 , xj , yj+1 , . . . , yn ) ∈ X (G, A/C), – nothing else is in X (G, A/B).
288
R. Yoshinaka
We then define the set KG as follows: yA = min L(G, A), [dim(A)]
| x ∈ X (G, S/A) }, xA = min{ x ∈ Σ KG = { xA f (yB1 , . . . , yBn ) | A → f (B1 , . . . , Bn ) ∈ P }, where min S for a set S of m-words means an element y from S whose size y is the smallest, and min S for a set S of m-contexts means an element x from S whose length |x| is the smallest. Lemma 6. For any G ∈ G(p, r), if KG ⊆ K, then L(G) ⊆ L(G(K)). ˆ = G(K). We show by induction that w Proof. Let G ∈ L(G, A) implies w ∈ ˆ ˆ has the rule S → I1 ([[yS ]]), this proves the lemma. L(G, [[yA ]]). Because G Suppose that w = f (w 1, . . . , w n ) ∈ L(G, A) due to the rule A → f (B1 , . . . , Bn ) and w i ∈ L(G, Bi ) for i = 1, . . . , n. The base case is when n = 0. Let y = f (yB1 , . . . , yBn ). By definition we have xA y ∈ K and thus [[y ]] → f ([[yB1 ]], . . . , ˆ By xA yA ∈ K, G ˆ has the rule [[yA ]] → Idim(A) ([[y ]]), too. [[yBn ]]) is a rule of G. ˆ [[yBi ]]) for i = 1, . . . , n, which are obtained Applying those two rules to w i ∈ L(G, ˆ [[yA ]]). by the induction hypothesis, we have that w = f (w 1, . . . , w n ) ∈ L(G,
Corollary 1. For any mcfg G ∈ G(p, r) generating a language in S(p), A(p, r) identifies L(G) in the limit from positive data. Proof. If the conjectured language L(G(K)) is not correct, Lemma 5 ensures the existence of w ∈ L(G) − L(G(K)). By K ⊆ L(G(K)) (Lemma 3), w has not yet appeared in K and A(p, r) will see w later. Hence A(p, r) will discard the current conjecture at some point. Finally A(p, r) converges to the target language by Lemma 6.
One can modify the learning algorithm so that it learns SL(p, ∗) by removing the restriction on the rank of the hypothesized grammar. The rank is now bounded by the length K of a longest example given so far, because we still restrict functions of grammars to be λ-free. Let us call the learning algorithm obtained by this way A(p, ∗). Corollary 2. A(p, ∗) identifies SL(p, ∗) in the limit from positive data. 4.3
Efficiency of the Algorithm
We discuss in this subsection the efficiency of our learning algorithm in terms of time for updating the conjecture and the amount of data for convergence. This measurement is proposed by de la Higuera [11]. His definition was initially designed for learning of regular languages and it is controversial whether it is suitable for learning non-regular languages. Wakatsuki and Tomita [18] have proposed to measure the complexity of an algorithm dealing with context-free grammars by the parameter called the maximal thickness tG of G together with the size of the grammar G. The thickness of a nonterminal symbol A is defined to
Learning Mildly Context-Sensitive Languages
289
be the length of a shortest string derived from A and tG is the maximum of the thicknesses of the nonterminals. Instead of the original definition, we would like the thickness τG of a grammar G to be defined as the maximal of the thickness of the rules where the thickness of a rule ρ is defined to be the length of a shortest string in L(G) that is derived with using ρ at least once. It is easy to see that τG ≤ GtG . This works well for multiple context-free grammars as well as for context-free grammars. Hence a value is bounded by a polynomial in τG if and only if it is bounded by a polynomial in GtG . The following is our criterion for efficient learning, which is a slight modification of de la Higuera’s definition [11]. Definition 1. A representation class G of mcfgs is identifiable in the limit from positive data with polynomial time and data if and only if there exists an algorithm A such that ˆ in polynomial 1. given a set K of positive examples, A returns a hypothesis G time in K, 2. for each grammar G∗ ∈ G, there exists a finite set K∗ of examples such that – |K∗ | is bounded by a polynomial in G∗ [4], – K∗ is bounded by a polynomial in G∗ τG∗ , ˆ such that L(G) ˆ = – if K∗ ⊆ K ⊆ L(G∗ ), A converges to a grammar G L(G∗ ). Clark and Eyraud’s [8] and Yoshinaka’s [19] learning algorithms for (k, l-)substitutable context-free languages satisfy this definition. ˆ in polynomial time Lemma 7. Our algorithm A(p, r) computes its hypothesis G in the total size of the given examples. Proof. By Proposition 4, the membership of the new example w to the current p(r+1) ˆ is decidable in O(G|w| ˆ ) time. As we will see below, it hypothesis G 2pr+2p+1 2 ˆ ∈ O(|K| holds that G ) where K = max{ |w| | w ∈ K }. Thus K 2 3pr+3p+1 ) time. Suppose that the new the membership is decidable in O(|K| K example w is not generated by the current hypothesis. Then A(p, r) computes G(K) = Σ, VK , FK , PK , S. Each rule of Type I is constructed from a single word w ∈ K. If G(K) has [m] [[y ]] → f ([[y1 ]], . . . , [[yn ]]), there is x = x0 x1 . . . xm ∈ Σ such that w = x f (y1 , . . . , yn ) ∈ K, where m = |y |. Here the occurrences of x0 , . . . , xm and yij from yi = yi1 , . . . , yimi with 1 ≤ i ≤ n and 1 ≤ j ≤ mi are pairwise non-overlapping in w. Let k = m + 1 + m1 + · · · + mn denote the number of those substrings. The fragments of w that are not covered by those k substrings are from f itself. This factorization of w is determined by specifying where each of the k substrings starts and ends except that the starting position of x0 and the ending position of xm are predetermined. Thus there exist at most (|w| + 1)2k−2 ≤ (|w| + 1)2p(r+1) such factorizations of w, because m, mi ≤ p, n ≤ r and thus k ≤ p + 1 + pr. The size of the rule is bounded by O(|w|). Hence 2pr+2p+1 we need O(|K|K ) time to compute rules of Type I.
290
R. Yoshinaka
Each rule of Type II is constructed by comparing two words w1 , w2 ∈ K. There are at most (|wi | + 1)2p pairs of xi and yi to be considered such that xi yi = wi for each i = 1, 2. Determining whether x1 = x2 is done in linear time in |w1 | + |w2 | and the size of the rule has the same bound O(|w1 | + |w2 |). Thus we need O(|w1 |2p |w2 |2p |w1 w2 |) time to construct all the possible rules of Type II from w1 , w2 ∈ K. Hence we need O(|K|2 4p+1 ) time to compute rules K of Type II. G(K) has exactly |K| rules of Type III of size O(K ). 2pr+2p+1 All in all, it takes O(|K|2 K ) time to construct G(K) and its size G(K) has the same bound.
Hence A(p, r) updates its hypothesis quickly if p and r are small. Lemma 8. |KG | ≤ |P | and KG ≤ |P |τG where G = Σ, V, F, P, S. Proof. Each rule of G determines one element in KG , whose length is exactly the thickness of the rule. We have |KG | ≤ G and KG ≤ |KG |τG ≤ |P |τG .
The size of KG does not depend on p and r, while the updating time is polynomial only when p and r are fixed. This contrasts to Yoshinaka’s discussion on the learning efficiency of k, l-substitutable context-free languages [19], which are another extension of Clark and Eyraud’s work [8]. His algorithm updates the conjecture in polynomial time independently of k and l, while the size of data for convergence is bounded by a polynomial whose degree is linear in k + l. Theorem 1. The learning algorithm A(p, r) identifies SL(p, r) in the limit from positive data with polynomial time and data. Concerning the algorithm A(p, ∗) for SL(p, ∗), its updating time is not bounded by a polynomial any longer, while KG still works well for A(p, ∗).
5
Discussions
This paper has demonstrated how Clark and Eyraud’s approach with substitutability [8] works in learning mildly context-sensitive languages. pd-substitutability seems nicely fit into p-dimensional mcfls as a generalization of 1d-substitutability in context-free languages, which is the exact analogue of reversibility in regular languages. The obtained learnable classes are however not rich, as we have seen in Section 3 several rather simple languages that are not 2d-substitutable. pd-substitutability easily causes too much generalization from finite languages even when p = 2. The author hopes that this work provides a clue for further investigation on learning mildly context-sensitive languages possibly in other learning schemes. One naive trial for enriching the expressive power from 2d-substitutable languages might be considering the following property in addition to 1d-substitutability: x1 y1 x2 y2 , x1 y1 x2 y2 , x1 y1 x2 y2 ∈ L implies x1 y1 x2 y2 ∈ L
Learning Mildly Context-Sensitive Languages
291
for any x1 , x1 , y2 , y2 ∈ Σ ∗ and x2 , x2 , y1 , y1 ∈ Σ + . This property is stronger than 1d-substitutability and slightly weaker than 2d-substitutability (and might be thought of as 2d-reversibility). However, this property is still too strong; neither { an #an | n ≥ 1 }, Lreverse nor Lcopy satisfies this property. In order to control some kind of dependent structures in pd-substitutable languages, Examples 1 and 2 insert delimiters #i . This trick is necessary even in 1d-substitutable languages. While { an #bn | n ≥ 0 } is 1d-substitutable, { an bn | n ≥ 0 } is not. Yoshinaka’s approach of k, l-substitutability [19] enables us to remove such delimiters. Thus again one may consider k1 , . . . , k2m -substitutability: x v y , x v y , x v y ∈ L implies x v y ∈ L for any v ∈ (Σ k1 Σ k2 ) × · · · × (Σ k2m−1 Σ k2m ). Indeed { an bn cn dn | n ≥ 1 } is 14 -substitutable, but neither { an bn an bn | n ≥ 1 } nor { an #an | n ≥ 1 } is k-substitutable for any k ∈ N4 . Clark et al. [9] have developed their work on substitutable context-free languages to learning a much richer class of context-free languages with positive examples and membership queries. Their approach would be generalized also for mildly context-sensitive languages, where multidimensional substitutable languages should be regarded as a special case. This seems to be the most convincing approach for future work.
Acknowledgement The author deeply appreciates Thomas Zeugmann and the anonymous reviewers for their valuable comments and advice. This work was supported by Grant-in-Aid for Young Scientists (B-20700124) and a grant from the Global COE Program, “Center for Next-Generation Information Technology based on Knowledge Discovery and Knowledge Federation”, from the Ministry of Education, Culture, Sports, Science and Technology of Japan.
References 1. Angluin, D.: Inference of reversible languages. Journal of the Association for Computing Machinery 29(3), 741–765 (1982) 2. Becerra-Bonache, L., Case, J., Jain, S., Stephan, F.: Iterative learning of simple external contextual languages. In: Freund, Y., Gy¨ orfi, L., Tur´ an, G., Zeugmann, T. (eds.) ALT 2008. LNCS (LNAI), vol. 5254, pp. 359–373. Springer, Heidelberg (2008) 3. Becerra-Bonache, L., Yokomori, T.: Learning mild context-sensitiveness: Toward understanding children’s language learning. In: Paliouras, G., Sakakibara, Y. (eds.) ICGI 2004. LNCS (LNAI), vol. 3264, pp. 53–64. Springer, Heidelberg (2004) 4. Carme, J., Gilleron, R., Lemay, A., Niehren, J.: Interactive learning of node selecting tree transducer. Machine Learning 66(1), 33–67 (2007)
292
R. Yoshinaka
5. Clark, A.: PAC-learning unambiguous NTS languages. In: Sakakibara, et al [16], pp. 59–71 6. Clark, A., Coste, F., Miclet, L. (eds.): ICGI 2008. LNCS (LNAI), vol. 5278. Springer, Heidelberg (2008) 7. Clark, A., Eyraud, R.: Identification in the limit of substitutable context-free languages. In: Jain, S., Simon, H.U., Tomita, E. (eds.) ALT 2005. LNCS (LNAI), vol. 3734, pp. 283–296. Springer, Heidelberg (2005) 8. Clark, A., Eyraud, R.: Polynomial identification in the limit of context-free substitutable languages. Journal of Machine Learning Research 8, 1725–1745 (2007) 9. Clark, A., Eyraud, R., Habrard, A.: A polynomial algorithm for the inference of context free languages. In: Clark, et al [6], pp. 29–42 10. Gold, E.M.: Language identification in the limit. Information and Control 10(5), 447–474 (1967) 11. de la Higuera, C.: Characteristic sets for polynomial grammatical inference. Machine Learning 27, 125–138 (1997) 12. Kasprzik, A.: A learning algorithm for multi-dimensional trees, or: Learning beyond context-freeness. In: Clark, et al [6], pp. 111–124 13. Kudlek, M., Mart´ın-Vide, C., Mateescu, A., Mitrana, V.: Contexts and the concept of mild context-sensitivity. Linguistics and Philosophy 26(6), 703–725 (2003) 14. Oates, T., Armstrong, T., Becerra-Bonache, L., Atamas, M.: Inferring grammars for mildly context sensitive languages in polynomial-time. In: Sakakibara, et al [16], pp. 137–147 15. Rambow, O., Satta, G.: Independent parallelism in finite copying parallel rewriting systems. Theor. Comput. Sci. 223(1-2), 87–120 (1999) 16. Sakakibara, Y., Kobayashi, S., Sato, K., Nishino, T., Tomita, E. (eds.): ICGI 2006. LNCS (LNAI), vol. 4201. Springer, Heidelberg (2006) 17. Seki, H., Matsumura, T., Fujii, M., Kasami, T.: On multiple context-free grammars. Theoretical Computer Science 88(2), 191–229 (1991) 18. Wakatsuki, M., Tomita, E.: A fast algorithm for checking the inclusion for very simple deterministic pushdown automata. IEICE transactions on information and systems E76-D(10), 1224–1233 (1993) 19. Yoshinaka, R.: Identification in the limit of k, l-substitutable context-free languages. In: Clark, et al [6], pp. 266–279
Uncountable Automatic Classes and Learning Sanjay Jain1 , Qinglong Luo1 , Pavel Semukhin2 , and Frank Stephan1,2 1 Department of Computer Science, National University of Singapore, Singapore 117417, Republic of Singapore
[email protected],
[email protected] 2 Department of Mathematics, National University of Singapore, Singapore 117543, Republic of Singapore
[email protected],
[email protected]
Abstract. In this paper we consider uncountable classes recognizable by ω-automata and investigate suitable learning paradigms for them. In particular, the counterparts of explanatory, vacillatory and behaviourally correct learning are introduced for this setting. Here the learner reads in parallel the data of a text for a language L from the class plus an ω-index α and outputs a sequence of ω-automata such that all but finitely many of these ω-automata accept the index α iff α is an index for L. It is shown that any class is behaviourally correct learnable if and only if it satisfies Angluin’s tell-tale condition. For explanatory learning, such a result needs that a suitable indexing of the class is chosen. On the one hand, every class satisfying Angluin’s tell-tale condition is vacillatory learnable in every indexing; on the other hand, there is a fixed class such that the level of the class in the hierarchy of vacillatory learning depends on the indexing of the class chosen. We also consider a notion of blind learning. On the one hand, a class is blind explanatory (vacillatory) learnable if and only if it satisfies Angluin’s tell-tale condition and is countable; on the other hand, for behaviourally correct learning there is no difference between the blind and non-blind version. This work establishes a bridge between automata theory and inductive inference (learning theory).
1
Introduction
Usually, in learning theory one considers classes consisting of countably many languages from some countable domain. A typical example here is a class of all recursive subsets of {0, 1, 2}∗, the set of all finite strings in the alphabet {0, 1, 2}. However, each countably infinite domain has uncountably many subsets, and thus we miss out many potential targets when we consider only countable classes. The main goal of this paper is to find a generalization of the classical model of learning which would be suitable for working with uncountable classes of languages. The classes, which we consider, can be uncountable but they still
The first and fourth author are supported in part by NUS grant R252-000-308-112; the third and fourth author are supported by NUS grant R146-000-114-112.
R. Gavald` a et al. (Eds.): ALT 2009, LNAI 5809, pp. 293–307, 2009. c Springer-Verlag Berlin Heidelberg 2009
294
S. Jain et al.
have some structure, namely, they are recognizable by B¨ uchi automata. We will investigate, how the classical notions of learnability have to be adjusted in this setting in order to obtain meaningful results. To explain our approach in more detail, we first give an overview of the classical model of inductive inference which is the underlying model of learning in our paper. Consider a class L = {Li }i∈I , where each language Li is a subset of Σ ∗ , the set of finite strings in an alphabet Σ. In a classical model of learning, which was introduced and studied by Gold [9], a learner M receives a sequence of all the strings from a given language L ∈ L, possibly with repetitions. Such a sequence is called a text for the language. After reading the first n strings from the texts, the learner outputs a hypothesis in about what the target language might be. The learner succeeds if it eventually converges to an index that correctly describes the language to be learned, that is, if limn in = i and L = Li . If the learner succeeds on all texts for all languages from a class, then we say that it learns this class. This is the notion of explanatory learning (Ex). Such a model became the standard one for the learnability of countable classes. Besides Ex, several other paradigms for learning have been considered like, e.g., behaviourally correct (BC) learning [3], vacillatory or finite explanatory (FEx) learning [8], partial identification (Part) [13] and so on. The indices that the learner outputs are usually finite objects like natural numbers or finite strings. For example, Angluin [1] initiated the research on learnability of uniformly recursive families indexed by natural numbers, and in their recent work Jain, Luo and Stephan [10] considered automatic indexings by finite strings in place of uniformly recursive ones. The collection of such finite indices is countable, and hence we can talk only about countable classes of languages. On the other hand, the collection of all the subsets of Σ ∗ is uncountable, and it looks too restrictive to consider only countable classes. Because of this, it is interesting to find a generalization of the classical model which will allow us to study the learnability of uncountable classes. Below is the informal description of the learning model that we investigate in this paper. First, since we are going to work with uncountable classes, we need uncountably many indices to index a class to be learned. For this purpose we will use infinite strings (or ω-strings) in a finite alphabet. There are computing machines, called B¨ uchi automata or ω-automata, which can be used naturally for processing ω-strings. They were first introduced by B¨ uchi [6,7] to prove the decidability of S1S, the monadic second-order theory of the natural numbers with successor function S(x) = x + 1. Because of this and other decidability results the theory of ω-automata has become a popular area of research in theoretical computer science, see, e.g., [14]. So, we will assume that a class to be learned has an indexing by ω-strings which is B¨ uchi recognizable. The main difference between our model and the classical one is that the learner does not output hypotheses as it processes a text. The reason for this is that it is not possible to output an arbitrary infinite string in a finite amount of time. Instead, in our model the learner is presented with an index α and a text T , and it must decide whether T is a text for the set with the index α. During its
Uncountable Automatic Classes and Learning
295
work, the learner outputs an infinite sequence of B¨ uchi automata {An }n∈ω such that An accepts the index α if and only if the learner at stage n thinks that T is indeed a text for the set with the index α. The goal of the learner is to converge in the limit to the right answer. As one can see from the description above, the outputs of a learner take form of ω-automata instead of just binary answers ‘yes’ or ‘no’. We chose such definition due to the fact that a learner can read only a finite part of an infinite index in a finite amount of time. If we required that a learner outputs its ‘yes’ or ‘no’ answer based on such finite information, then our model would become too restrictive. On the other hand, a B¨ uchi automaton allows a learner to encode additional infinitary conditions that have to be verified before the index will be accepted or rejected, for example, if the index contains infinitely many 1’s or not. This approach makes a learner more powerful, and more nontrivial classes become learnable. Probably the most interesting property of our model is that for many learning criteria, the learnability coincides with Angluin’s classical tell-tale condition for the countable case (see the table at the end of this section). Angluin’s condition states that for every set L from a class L, there is a finite subset DL ⊆ L such that for any other L ∈ L with DL ⊆ L ⊆ L we have that L = L. It is also well-known that in the classical case, every r.e. class is learnable according to the criterion of partial identification. We will show that in our model every ω-automatic class can be learned according to this criterion. The results above show that the notions defined in this paper match the intuition of learnability, and that our model is a natural one and is suitable for investigating the learnability of uncountable classes of languages. We also consider a notion of a blind learning. A learner is called blind if it does not see an index presented to it. Such a learner can see only an input text, but nevertheless it must decide whether the index and the text match each other. It turns out that for the criterion of behaviourally correct learning, the blind learners are as powerful as the non-blind ones without even the need to change the indexing of a class, but for the other learning criteria this notion becomes more restrictive. The reader can find all formal definitions of the notions discussed here and some necessary preliminaries in the next section. We summarize our results: Criterion Ex FEx BC Part BlindBC BlindEx BlindFEx BlindPart
Condition ATTC ATTC ATTC Any class ATTC ATTC & Countable ATTC & Countable Countable
Indexing New Original Original Original Original Original Original Original
Theorem 17, 20 13, 20 20 21 18, 20 19 19 22
In this table, the first column lists the learning criteria that we studied. Here, Ex stands for explanatory learning, BC for behaviourally correct learning, FEx
296
S. Jain et al.
for finite explanatory or vacillatory learning, and Part for partial identification. A prefix Blind denotes the blind version of the corresponding criterion. The second column describes equivalent conditions for a given learning criterion. Here, ATTC means that the class must satisfy Angluin’s tell-tale condition, and Countable means that the class must be countable. The next column indicates whether the learner uses the original indexing of the class or a new one. The last column gives a reference to a theorem/corollary where the result is proved.
2
Preliminaries
An ω-automaton is mainly a finite automaton operating on ω-strings with an infinitary acceptance condition which decides — depending upon the infinitely often visited nodes — which ω-strings are accepted and which are rejected. For a general background on the theory of finite automata the reader is referred to [11]. Definition 1 ([6,7]). A nondeterministic ω-automaton is a tuple A = (S, Σ, I, T ), where (a) (b) (c) (d)
S is a finite set of states, Σ is a finite alphabet, I ⊆ S is the set of initial states, and T is the transition function T : S × Σ → P(S), where P(S) is the power set of S.
An automaton A is deterministic iff |I| = 1, and for all s ∈ S and a ∈ Σ, |T (s, a)| = 1. An ω-string in an alphabet Σ is a function α : ω → Σ, where ω is the set of natural numbers. We often identify an ω-string with the infinite sequence α = α0 α1 α2 . . . , where αi = α(i). Let Σ ∗ and Σ ω denote the set of all finite strings and the set of all ω-strings over the alphabet Σ, respectively. We always assume that the elements of an alphabet Σ are linearly ordered. This order can be extended to the length-lexicographical order ≤llex on Σ ∗ ; here x ≤llex y iff |x| < |y| or |x| = |y| ∧ x ≤lex y, where ≤lex is the standard lexicographical order. Given an ω-automaton A = (S, Σ, I, T ) and an ω-string α, a run of A on α is an ω-string r = s0 . . . sn sn+1 . . . ∈ S ω such that s0 ∈ I and for all n, sn+1 ∈ T (sn , αn ). Note that if an ω-automaton A is deterministic, then for every α, there is a unique run of A on α. In this case we will use the notation St A (α, k) to denote the state of A after it has read the first k symbols of α. Definition 2. Let Inf (r) denote the infinity set of a run r, that is, Inf (r) = {s ∈ S : s appears infinitely often in r}. We define the following accepting conditions for the run r:
Uncountable Automatic Classes and Learning
297
1) B¨ uchi condition is determined by a subset F ⊆ S. The run r is accepting iff Inf (r) ∩ F = ∅. 2) Muller condition is determined by a subset F ⊆ P(S). The run r is accepting iff Inf (r) ∈ F . 3) Rabin condition is determined by Ω = {(L1 , R1 ), . . . , (Lh , Rh )}, where all Li and Ri are subsets of S. The run r is accepting iff there is an i such that 1 ≤ i ≤ h, Inf (r) ∩ Li = ∅ and Inf (r) ∩ Ri = ∅. It can be shown that all these acceptance conditions are equivalent (see [11]). Therefore, we will say that an ω-automaton A accepts a string α iff there is a run of A on α that satisfies the chosen accepting condition defined above. Let L(A) denote the set of strings accepted by an automaton A. Furthermore, every ω-automaton is equivalent to a deterministic one with Muller acceptance condition (again see [11]). Thus, if not explicitly stated otherwise, by an automaton we will always mean a deterministic ω-automaton with Muller acceptance condition. Definition 3 ([12]) 1) A finite automaton is a tuple A = (S, Σ, I, T, F ), where S, Σ, I and T are the same as in the definition of an ω-automaton, and F ⊆ S is the set of final states. 2) For a finite string w = a0 . . . an−1 ∈ Σ ∗ , a run of A on w is a sequence s0 . . . sn ∈ S ∗ such that s0 ∈ I and si+1 ∈ T (si , ai ) for all i ≤ n − 1. The run is accepting iff sn ∈ F . The string w = a0 . . . an−1 is accepted by A iff there is an accepting run of A on w. Definition 4. 1) A convolution of k ω-strings α1 , . . . , αk ∈ Σ ω is an ω-string ⊗(α1 , . . . , αk ) in the alphabet Σ k defined as ⊗(α1 , . . . , αk )(n) = (α1 (n), . . . , αk (n)) for every n ∈ ω. 2) A convolution of k finite strings w1 , . . . , wk ∈ Σ ∗ is a string ⊗(w1 , . . . , wk ) k of length l = max{|w1 |, . . . , |wk |} in the alphabet (Σ ∪ {#}) , where # is a new padding symbol, defined as ⊗(w1 , . . . , wk )(n) = (v1 (n), . . . , vk (n)) for every n < l, where for each i = 1, . . . , k and n < l, wi (n) vi (n) = #
if n < |wi | otherwise.
3) Correspondingly one defines the convolution of finite strings and ω-strings: one identifies each finite string σ with the ω-string σ#ω and forms then the corresponding convolution of ω-strings. 4) A convolution of k-ary relation R on finite or ω-strings is defined as ⊗R = {⊗(x1 , . . . , xk ) : (x1 , . . . , xk ) ∈ R}.
298
S. Jain et al.
5) A relation R on finite or ω-strings is automatic iff its convolution ⊗R is recognizable by a finite or an ω-automaton, respectively. For the ease of notation, we often just write (x, y) instead of ⊗(x, y) and so on. It is well-known that the automatic relations are closed under union, intersection, projection and complementation. In general, the following theorem holds, which we will often use in this paper. Theorem 5 ([4,5]). If a relation R on ω-strings is definable from other automatic relations R1 , . . . , Rk by a first-order formula, then R itself is automatic. Remark 6. 1) If we use additional parameters in a first-order definition of R, then the parameters must be ultimately periodic strings. 2) Furthermore, in a definition of R we can use first-order variables of two sorts, namely, one ranging over ω-strings and one ranging over finite strings. We can do this because every finite string v can be identified with its ω-expansion v#ω , and the set of all ω-expansions of the finite strings in alphabet Σ is automatic. A class L is a collection of sets of finite strings over some alphabet Γ , i.e., L ⊆ P(Γ ∗ ). An indexing for a class L is an onto mapping f : I → L, where I is the set of indices. We will often denote the indexing as {Lα }α∈I , where Lα = f (α). An indexing {Lα }α∈I is automatic iff I is an automatic subset of Σ ω for some alphabet Σ and the relation {(x, α) : x ∈ Lα } is automatic. A class is automatic iff it has an automatic indexing. If it is not stated otherwise, all indexings and all classes considered herein are assumed to be automatic. Example 7. Here are some examples of automatic classes: 1) the class of all open intervals I = {q ∈ D : p < q < r} of dyadic rationals where the border points p and r can be any real numbers; 2) the class of such intervals where r − p is either 1 or 2 or 3; 3) the class of all sets of finite strings which are given as the prefixes of an infinite sequence; 4) the class of all sets of natural numbers in unary coding. A text is an ω-string T of the form T = u0 , u1 , u2 , . . . , such that each ui is either equal to the pause symbol # or belongs to Γ ∗ , where Γ is some alphabet. We call ui the ith input of the text. The content of a text T is the set content(T ) = {ui : ui = #}. If content(T ) is equal to a set L ⊆ Γ ∗ , then we say that T is a text for L. Definition 8. Let Γ and Σ be alphabets for sets and indices, respectively. A learner is a Turing machine M that has the following:
Uncountable Automatic Classes and Learning
299
1) two read-only tapes: one for an ω-string from Σ ω representing an index and one for a text for a set L ⊆ Γ ∗ ; 2) one write-only output tape on which M writes a sequence of automata (in a suitable coding); 3) one read-write working tape. Let Ind(M, α, T, s) and Txt(M, α, T, s) denote the number of symbols read in the index and text tapes by learner M up to step s when it processes an index α and a text T . Without loss of generality, we will assume that lim Ind(M, α, T, s) = lim Txt(M, α, T, s) = ∞
s→∞
s→∞
for any α and T . By M(α, T, k) we denote the kth automaton output by learner M when processing an index α and a text T . Without loss of generality, for the learning criteria considered in this paper, we assume that M(α, T, k) is defined for all k. Definition 9 (see [3,8,9,13]). Let a class L = {Lα }α∈I (together with its indexing) and a learner M be given. We say that 1) M BC-learns L iff for any index α ∈ I and any text T with content(T ) ∈ L, there exists n such that for every m ≥ n, M(α, T, m) accepts α
iff
Lα = content(T ).
2) M Ex-learns L iff for any index α ∈ I and any text T with content(T ) ∈ L, there exists n such that for every m ≥ n, M(α, T, m) = M(α, T, n) and M(α, T, m) accepts α
iff
Lα = content(T ).
3) M FEx-learns L iff M BC-learns L and for any α ∈ I and any text T with content(T ) ∈ L, the set {M(α, T, n) : n ∈ ω} is finite. 4) M FExk -learns L iff M BC-learns L and for any α ∈ I and any text T with content(T ) ∈ L, there exists n such that |{M(α, T, m) : m ≥ n}| ≤ k. 5) M Part-learns L iff for any α ∈ I and any T with content(T ) ∈ L, there exists a unique automaton A such that M outputs A infinitely often, and A accepts α
iff
Lα = content(T ).
Here the abbreviations BC, Ex, FEx and Part stand for ‘behaviourally correct’, ‘explanatory’, ‘finite explanatory’ and ‘partial identification’, respectively; ‘finite explanatory learning’ is also called ‘vacillatory learning’. We will also use the notations BC, Ex, FEx, FExk and Part to denote the collection of classes (with corresponding indexings) that are BC-, Ex-, FEx-, FExk - and Partlearnable, respectively.
300
S. Jain et al.
Definition 10. A learner is called blind if it does not see the tape which contains an index. The classes that are blind BC-, Ex-, etc. learnable are denoted as BlindBC, BlindEx, etc., respectively. Definition 11 ([1]). We say that a class L satisfies Angluin’s tell-tale condition iff for every L ∈ L there is a finite DL ⊆ L such that for every L ∈ L, if DL ⊆ L ⊆ L then L = L. Such DL is called a tell-tale set for L. Fact 12 ([1]). If a class L is BC-learnable, then L satisfies Angluin’s tell-tale condition. The converse will also be shown to be true, hence for automatic classes one can equate “L is learnable” with “L satisfies Angluin’s tell-tale condition”. Note that the second and the third class given in Example 7 satisfy Angluin’s tell-tale condition.
3
Vacillatory Learning
In the following it is shown that every learnable class can even be vacillatorily learned and that the corresponding FEx-learner uses overall on all possible inputs only a fixed number of automata. Theorem 13. Let {Lα }α∈I be a class that satisfies Angluin’s tell-tale condition. Then there are finitely many automata A1 , . . . , Ac and an FEx-learner M for the class {Lα }α∈I with the property that for any α ∈ I and any text T for a set from {Lα }α∈I , the learner M oscillates only between some of the automata A1 , . . . , Ac on α and T . Proof. Let M be a deterministic automaton recognizing the relation {(x, α) : x ∈ Lα }, and let N be a deterministic automaton recognizing { (x, α) : {y ∈ Lα : y ≤llex x} is a tell-tale for Lα }. Such N exists since the relation is first-order definable from ‘x ∈ Lα ’ and ≤llex by the formula: N accepts (x, α) ⇐⇒ ∀α ∈ I if ∀y ((y ∈ Lα & y ≤llex x) → y ∈ Lα ) & ∀y (y ∈ Lα → y ∈ Lα ), then ∀y(y ∈ Lα ↔ y ∈ Lα ) . For each α ∈ I, consider an equivalence relation ≡M,α defined as x ≡M,α y
⇐⇒
there is a t > max{|x|, |y|} such that St M (⊗(x, α), t) = St M (⊗(y, α), t).
An equivalence relation ≡N,α is defined in a similar way.
Uncountable Automatic Classes and Learning
301
Note that the number of equivalence classes of ≡M,α is bounded by the number of states of M , and for every x, y, if x ≡M,α y then x ∈ Lα ↔ y ∈ Lα . Therefore, Lα is the union of finitely many equivalence classes of ≡M,α . Let m and n be the number of states of M and N , respectively. Consider the set of all finite tables U = {Ui,j : 1 ≤ i ≤ m, 1 ≤ j ≤ n} of size m × n such that each Ui,j is either equal to a subset of {1, . . . , i} or to a special symbol Reject. With each such table U we will associate an automaton A as described below. The algorithm for learning {Lα }α∈I is now roughly as follows. On every step, the learner M reads a finite part of the input text and based on this information constructs a table U . After that M outputs the automaton associated with U . First, we describe the construction of an automaton A for each table U . For every α ∈ I, let m(α) and n(α) be the numbers of equivalence classes of ≡M,α and ≡N,α , respectively. Also, let x1 , . . . , xm(α) be the length-lexicographically least representatives of equivalence classes of ≡M,α such that x1
⇐⇒ Um(α),n(α) is a subset of {1, . . . , m(α)} such that Lα = {y : y ≡M,α xk for some k ∈ Um(α),n(α) }.
Let EqSt M (α, x, y, z) be the relation defined as EqSt M (α, x, y, z)
⇐⇒
St M (⊗(x, α), |z|) = St M (⊗(y, α), |z|).
The relation EqSt N (α, x, y, z) is defined similarly. Note that these relations are automatic. Instead of constructing A explicitly, we will show that the language which A needs to recognize is first-order definable from EqSt M (α, x, y, z), EqSt N (α, x, y, z) and the relations recognized by M and N . First, note that the equivalence relation x ≡M,α y can be defined by a formula: ∃z (|z| > max{|x|, |y|} and EqSt M (α, x, y, z)). Similarly one can define x ≡N,α y. The fact that ≡M,α has exactly k many equivalence classes can be expressed by a formula: ClNum M,k (α) = ∃x1 . . . ∃xk xi ≡M,α xj & ∀y y ≡M,α xi . 1≤i<j≤k
1≤i≤k
Again, ClNum N,k (α) expresses the same fact for ≡N,α . Finally, the fact that A accepts α can be expressed by the following first-order formula: ClNum M,i (α) & ClNum N,j (α) & ∃x1 . . . ∃xi (i,j) : Ui,j =Reject
x1
1≤k≤i
k∈Ui,j
∀y (y
xk ) .
302
S. Jain et al.
We now describe the algorithm for learning the class {Lα }α∈I . We will use the notation x ≡M,α,s y as an abbreviation of “there is t such that s ≥ t > max{|x|, |y|} and St M (⊗(x, α), t) = St M (⊗(y, α), t).” As before, let m and n be the numbers of states of automata M and N , respectively. At step s, M computes ≤llex least representatives of equivalence classes of ≡M,α,s and ≡N,α,s on the strings with length shorter than s. In other words, it computes x1 , . . . , xp and y1 , . . . , yq such that a) x1 is the empty string, b) xk+1 is the ≤llex least x >llex xk such that |x| ≤ s and x ≡M,α,s xi for all i ≤ k. If such x does not exists then the process stops. The sequence y1 , . . . , yq is computed in a similar way. Next, M constructs a table U of size m × n. For every i and j, the value of Ui,j is defined as follows. If i > p or j > q, then let Ui,j = Reject. Otherwise, let τs be the initial segment of the input text T consisting of the first s strings in the text T . Check if the following two conditions are satisfied: 1) for every x, x ≤llex yj , if x ≡M,α,s x , then x ∈ content(τs ) iff x ∈ content(τs ), 2) for every k ≤ i and every y, if y ∈ content(τs ) and y ≡M,α,s xk , then xk ∈ content(τs ). If yes, then let Ui,j = {k : k ≤ i and xk ∈ content(τs )}. Otherwise, let Ui,j = Reject. After U is constructed, M outputs an automaton A associated with U as described above. As the number of different possible U is finite, the number of distinct corresponding automata output by M is finite. Let M(α, T, s) be an automaton output by learner M at step s when processing the index α and the text T . To prove that the algorithm is correct we need to show that for every α ∈ I and every text T such that content(T ) ∈ {Lα }α∈I , a) if content(T ) = Lα then for almost all s, M(α, T, s) accepts α, b) if content(T ) = Lα then for almost all s, M(α, T, s) rejects α. Recall that m(α) and n(α) are the numbers of equivalence classes of ≡M,α and ≡N,α , respectively. Note that there is a step s0 after which the values x1
Uncountable Automatic Classes and Learning
303
3) for every k ≤ m(α) and every y, if y ∈ content(τs ) and y ≡M,α,s xk , then xk ∈ content(τs ). The last two conditions are satisfied since content(T ) = Lα is the union of finitely many ≡M,α equivalence classes. Therefore, on every step s ≥ s1 , the learner M constructs a table U such that Um(α),n(α) = {k : k ≤ m(α) and xk ∈ content(T )}. By our construction of the automaton A associated with U , A accepts α if Lα = {y : y ≡M,α xk for some xk ∈ content(T )}. But since content(T ) = Lα , this condition is satisfied. Now suppose that content(T ) = Lα . Note that for every s ≥ s0 , yn(α) computed by M at step s has the property that Dα = {x ∈ Lα : x ≤llex yn(α) } is a tell-tale set for Lα . This follows from the definition of the automaton N and the fact that yn(α) is the ≤llex largest among the representatives of ≡N,α equivalence classes. First, consider the case when Dα content(T ), that is, there is x ∈ Lα , x ≤llex yn(α) but x ∈ / content(T ). Let s1 ≥ s0 be such that x ≡M,α,s1 xk for some k ≤ m(α). Note that xk ≤llex x since xk is the minimal representative in its equivalence class. If for some s2 ≥ s1 , xk ∈ content(τs2 ), then from this step on Um(α),n(α) will be equal to Reject, and M(α, T, s) will reject α for all s ≥ s2 . If xk ∈ / content(T ), then for all s ≥ s1 , M(α, T, s) will reject α either due to the fact that Um(α),n(α) = Reject at step s, or because k ∈ / Um(α),n(α) while it should be in Um(α),n(α) since both x and xk are in Lα . Now suppose that Dα ⊆ content(T ). Since Dα is a tell-tale set for Lα and content(T ) = Lα , there is x ∈ content(T ) \ Lα . Let s1 ≥ s0 be such that x ∈ content(τs1 ) and x ≡M,α,s1 xk for some k ≤ m(α). If xk ∈ / content(T ) then for every s ≥ s1 , Um(α),n(α) = Reject and M(α, T, s) will reject α. If there is s2 ≥ s1 such that xk ∈ content(τs2 ), then for every s ≥ s2 either Um(α),n(α) = Reject or k ∈ Um(α),n(α) . In both cases M(α, T, s) will reject α since xk ∈ / Lα . Definition 14. 1) Let α ∈ {0, 1, . . . , k}ω and β ∈ {1, . . . , k}ω . The function fα,β is defined as follows: α(m) if m = min{x ≥ n : α(x) = 0}, fα,β (n) = lim supx→∞ β(x) if such m does not exist. Let Lα,β be the set of all nonempty finite prefixes of fα,β , that is, Lα,β = {fα,β (0) . . . fα,β (n) : n ∈ ω}. 2) Define the class Lk as follows Lk = { Lα,β : α ∈ {0, 1, . . . , k}ω , β ∈ {1, . . . , k}ω }. Note that the class Lk is uncountable and automatic. Theorem 15. For every k ≥ 2, the class Lk is in FExk \ FExk−1 .
304
S. Jain et al.
Remark 16. The last result can be strengthened in the following sense: for every k ≥ 1 there is an indexing {Lβ }β∈I of the class L = { {α0 α1 α2 . . . αn−1 : ω n ∈ ω} : α ∈ {1, 2} } such that {Lβ }β∈I is FExk+1 -learnable but not FExk learnable. That is, the class can be kept fixed and only the indexing has to be adjusted.
4
Explanatory Learning
The main result of this section is that for every learnable class, there is an indexing such that the class with this indexing is explanatorily learnable. Furthermore, one can observe that the learner, as above, on any text T for a language in the class and an index α, first might output automata which reject α, then automata which accept α and at the end again automata which reject α; so, in short, the sequence is of the form “reject–accept–reject” (or a subsequence of this). Theorem 17. If a class L = {Lα }α∈I satisfies Angluin’s tell-tale condition, then there is an indexing for L such that L with this indexing is Ex-learnable. Proof. Let M be a deterministic automaton recognizing {(x, α) : x ∈ Lα }, and QM be its set of states. The set J of new indices for L will consist of convolutions ⊗(α, β, γ), where α ∈ I, β ∈ {0, 1}ω defines a tell-tale set for Lα , and γ ∈ {P(QM )}ω keeps track of states of M when it reads ⊗(x, α) for some finite strings x ∈ Lα . To simplify the notations we will write (α, β, γ) instead of ⊗(α, β, γ). Formally, J is defined as follows: (α, β, γ) ∈ J
⇐⇒ α ∈ I, β = 0n 1ω for the minimal n such that {x ∈ Lα : |x| < n} is a tell-tale set for Lα , and for every k, γ(k) = {q ∈ QM : ∃x ∈ Lα (|x| ≤ k and St M (⊗(x, α), k) = q)}.
We want to show that J is automatic. Again, it is enough to show that it is firstorder definable from other automatic relations. We can rewrite the definition for β as ⊆β→ β ∈ 0∗ 1ω and ∀σ ∈ 0∗ σ ⊆ β & σ0
{x ∈ Lα : |x| < |σ|} is a tell-tale set for Lα . The first-order definition for a tell-tale set is given in the beginning of the proof of Theorem 13. All other relations in this definition are clearly automatic. The definition for γ can be written as ∀σ ∈ 0∗ q ∈ γ(|σ|) ↔ ∃x ∈ Lα ( |x| ≤ |σ| & St M (⊗(x, α), |σ|) = q) . q∈QM
For every q ∈ QM , there are automata Aq and Bq that recognize the relations {(σ, γ) : σ ∈ 0∗ & q ∈ γ(|σ|) } and {(σ, x, α) : σ ∈ 0∗ & St M (⊗(x, α), |σ|) = q) }.
Uncountable Automatic Classes and Learning
305
Therefore, J is first-order definable from automatic relations, and hence itself is automatic. We define a new indexing {Hα,β,γ }(α,β,γ)∈J for the class L as follows Hα,β,γ = Lα . Clearly, this indexing is automatic since x ∈ Hα,β,γ ⇐⇒ x ∈ Lα and (α, β, γ) ∈ J. We now describe a learner M that can Ex-learn the class L in the new indexing. Let A be an automaton that recognizes the set J, and let Z be an automaton that rejects all ω-strings. The learner M will output only automata A and Z in a sequence Z–A–Z (or a subsequence of this). In other words, M can start outputting automaton Z, then change its mind to A and then again change its mind to Z, after which it will be outputting Z forever. When an index (α, β, γ) is given to the learner M, it always assumes that β and γ are correctly defined from α. Otherwise, it does not matter which automaton M will output in the limit, since both A and Z will reject the index (α, β, γ). We now show that for every finite string x, x ∈ Lα ⇐⇒ St M (⊗(x, α), |x|) ∈ γ(|x|), provided that γ is correct. Indeed, if x ∈ Lα , then St M (⊗(x, α), |x|) ∈ γ(|x|) by the definition of γ. On the other hand, if St M (⊗(x, α), |x|) ∈ γ(|x|), then, again by the definition of γ, there is y ∈ Lα with |y| ≤ |x| such that St M (⊗(y, α), |x|) = St M (⊗(x, α), |x|). Therefore, after |x| many steps the run of M on ⊗(x, α) coincides with the run on ⊗(y, α). Hence M accepts ⊗(x, α), and x is in Lα . At every step s, M reads the first s inputs x1 , . . . , xs from the input text. Then M outputs A if the following conditions hold: – There exists n ≤ s such that 0n 1 ⊆ β. – For every i with xi = #, xi belongs to Lα according to γ, i.e., St M (⊗(xi , α), |xi |) ∈ γ(|xi |). – For every x with |x| < n, if x belongs to Lα according to γ, then x ∈ {x1 , . . . , xs }. Otherwise, M outputs Z. This concludes the step s. Note that M makes a change from Z to A or from A to Z at most once. Thus it always converges to one of these automata. If the index (α, β, γ) is not in J, then M always rejects it. If (α, β, γ) ∈ J, then for every x, we have that x ∈ Lα according to γ iff x is indeed in Lα . Moreover, the set Dn = {x : |x| < n and x ∈ Lα according to γ} is a tell-tale set for Lα , where n is such that β = 0n 1ω .
306
S. Jain et al.
Let T be the input text. If content(T ) = Hα,β,γ , then there is a step s ≥ n such that Dn is contained in {x1 , . . . , xs }. Therefore, M will output only A from step s onward. If content(T ) = Hα,β,γ , then Dn content(T ) or content(T ) Hα,β,γ . In the first case, M will output Z on every step. In the second case, there is a step s and an xi ∈ {x1 , . . . , xs } such that xi = # and xi is not in Lα according to γ. Therefore, M will output Z from step s onward. This proves the correctness of the algorithm.
5
Blind Learning
Blind learning is distinguished from learning in that the learner itself does not see the index; so the learner has to code up all the necessary information into the automata which permit to decide whether the index is correct or incorrect. In the case of behaviourally correct learning, this is done by coding more and more finite information in a way that almost all automata recognize an incorrect index and reject it (where the point from which on this is recognized depends on the index). In the case of explanatory learning, this is impossible and hence one has to simulate a traditional learner (for countable classes) and to code up its conjecture into the automaton which then checks whether the index provided is equivalent to the one to which the traditional learner has converged; hence explanatorily learnable classes have to be countable. Theorem 18. If a class L = {Lα }α∈I satisfies Angluin’s tell-tale condition, then L is BlindBC-learnable. Theorem 19. For every class L = {Lα }α∈I , the following are equivalent 1) L is BlindEx-learnable. 2) L is BlindFEx-learnable. 3) L is at most countable and satisfies Angluin’s tell-tale condition. The following corollary summarizes the main results from the previous sections. Corollary 20. For every automatic class L, the following are equivalent: 1) 2) 3) 4) 5)
L L L L L
satisfies Angluin’s tell-tale condition. is BC-learnable. is BlindBC-learnable. is FEx-learnable. is Ex-learnable in a suitable indexing.
Proof. The implications 3) ⇒ 2) and 4) ⇒ 2) are trivial; 2) ⇒ 1) and 5) ⇒ 1) follow from Fact 12; 1) ⇒ 3) follows from Theorem 18; 1) ⇒ 4) follows from Theorem 13; and 1) ⇒ 5) follows from Theorem 17.
Uncountable Automatic Classes and Learning
6
307
Partial Identification
Partial identification is, in the traditional setting of inductive inference, a learning criterion where the learner outputs on every text of an r.e. language infinitely many (not necessarily distinct) hypotheses such that exactly one hypothesis occurs infinitely often and that hypothesis is correct. There is a recursive learner succeeding on all r.e. sets, hence this concept is omniscient in the traditional setting. Also in our model, every automatic class is partially identifiable. Theorem 21. Every class with every given automatic indexing is Part-learnable. Theorem 22. A class L = {Lα }α∈I is in BlindPart if and only if it is at most countable.
References 1. Angluin, D.: Inductive inference of formal languages from positive data. Information and Control 45(2), 117–135 (1980) 2. B´ ar´ any, V., Kaiser, L ., Rubin, S.: Cardinality and counting quantifiers on omegaautomatic structures. In: Proceedings of the 25th International Symposium on Theoretical Aspects of Computer Science, STACS 2008, pp. 385–396 (2008) 3. B¯ arzdi¸ nˇs, J.: Two theorems on the limiting synthesis of functions. Theory of Algorithms and Programs 1, 82–88 (1974) 4. Blumensath, A., Gr¨ adel, E.: Automatic structures. In: 15th Annual IEEE Symposium on Logic in Computer Science, Santa Barbara, CA, pp. 51–62. IEEE Computer Society Press, Los Alamitos (2000) 5. Blumensath, A., Gr¨ adel, E.: Finite presentations of infinite structures: automata and interpretations. Theory of Computing Systems 37(6), 641–674 (2004) 6. Richard B¨ uchi, J.: Weak second-order arithmetic and finite automata. Zeitschrift f¨ ur Mathematische Logik und Grundlagen der Mathematik 6, 66–92 (1960) 7. Richard B¨ uchi, J.: On a decision method in restricted second order arithmetic. In: Logic, Methodology and Philosophy of Science (Proceedings 1960 International Congress), pp. 1–11. Stanford University Press, Stanford (1962) 8. Case, J.: The power of vacillation in language learning. SIAM Journal on Computing 28(6), 1941–1969 (1999) (electronic) 9. Mark Gold, E.: Language identification in the limit. Information and Control 10, 447–474 (1967) 10. Jain, S., Luo, Q., Stephan, F.: Learnability of automatic classes. Technical Report TRA1/09, School of Computing, National University of Singapore (2009) 11. Khoussainov, B., Nerode, A.: Automata theory and its applications. Birkh¨ auser Boston, Inc., Boston (2001) 12. Khoussainov, B., Nerode, A.: Automatic presentations of structures. In: Leivant, D. (ed.) LCC 1994. LNCS, vol. 960, pp. 367–392. Springer, Heidelberg (1995) 13. Osherson, D.N., Stob, M., Weinstein, S.: Systems that learn. An introduction to learning theory for cognitive and computer scientists. Bradford Book—MIT Press, Cambridge (1986) 14. Vardi, M.Y.: The B¨ uchi complementation saga. In: Thomas, W., Weil, P. (eds.) STACS 2007. LNCS, vol. 4393, pp. 12–22. Springer, Heidelberg (2007)
Iterative Learning from Texts and Counterexamples Using Additional Information Sanjay Jain1, and Efim Kinber2 1
School of Computing, National University of Singapore, Singapore 117417, Republic of Singapore
[email protected] 2 Department of Computer Science, Sacred Heart University, Fairfield, CT 06825-1000, U.S.A.
[email protected]
Abstract. A variant of iterative learning in the limit (cf. [LZ96]) is studied when a learner gets negative examples refuting conjectures containing data in excess of the target language and uses additional information of the following four types: a) memorizing up to n input elements seen so far; b) up to n feedback memberships queries (testing if an item is a member of the input seen so far); c) the number of input elements seen so far; d) the maximal element of the input seen so far. We explore how additional information available to such learners (defined and studied in [JK07]) may help. In particular, we show that adding the maximal element or the number of elements seen so far helps such learners to infer any indexed class of languages class-preservingly (using a descriptive numbering defining the class) — as it is proved in [JK07], this is not possible without using additional information. We also study how, in the given context, different types of additional information fare against each other, and establish hierarchies of learners memorizing n + 1 versus n input elements seen and n + 1 versus n feedback membership queries.
1
Introduction
In this paper, we study some variants of learning in the limit from positive data and negative counterexamples to conjectures, with restricted access to input data. The general framework for study of learning in the limit was introduced in [Gol67]. In Gold’s original model, TxtEx, a learner is able to hold full input data seen so far in its long-term memory. However, this assumption is apparently too strong for modeling many learning and cognitive processes. Wiehagen in [Wie76] (see also [LZ96]) suggested a model for learning in the limit where the long-term memory of the learners is limited to what they can store in their conjectures. These learners are called iterative learners. This learning model, while strongly limiting long-term memory, still makes salient an important aspect of learnability in the limit: its incremental character. Some variants of iterative learning proved to be quite useful in the context of applied machine learning (for example, [LZ06]
Supported in part by NUS grant number R252-000-308-112.
R. Gavald` a et al. (Eds.): ALT 2009, LNAI 5809, pp. 308–322, 2009. c Springer-Verlag Berlin Heidelberg 2009
Iterative Learning from Texts and Counterexamples
309
applies the idea of iterative learning in the context of training Support Vector Machines). The iterative learning model has been used for study of learnability from all positive examples (the corresponding formal model being denoted as TxtIt) as well as all positive and negative examples (denoted as InfIt, see [LZ92]). One can argue that TxtIt may be too weak (a learner gets only positive data and can memorize only very limited amount of input), whereas InfIt may be too strong: it is hard to conceive a realistic learning process, where the learner would be able to get access to full negative data. For example, children learning languages, while getting some negative data (in the form of corrections by parents or teachers), never get the full set of negative data. In [JK08], the model TxtEx was extended to allow negative counterexamples to conjectures by a learner. This model is an example of active learning, where a learner communicates with a teacher (formally, an oracle) making queries and getting responses from the teacher. Active learning as a general framework for study of learning processes was introduced by D. Angluin in [Ang88] and has been widely utilized in various studies of theoretical and applied models of learnability from examples since then. The model of iterative learning from full positive data and negative counterexamples, NCIt (NC here stands for “negative counterexample”), defined in [JK07] actually combines two approaches: Gold’s framework (as the learner incrementally gets access to full positive data) and active learning (the learner, using subset queries, checks with the teacher if each conjecture does not contain data in excess of the target languages and if the answer is negative, the learner gets a negative counterexample showing an error). In linguistic terms, non-grammatical sentences in conjectures are, thus, being corrected. It should be noted that K. Popper [Pop68] regarded refutation of overgeneralizing conjectures as a vital part of learning and discovery processes. In this paper, we extend the NCIt model to incorporate some additional features. Specifically, we consider the following two extensions of this model: in addition to subset queries (for conjectures), the learner a) can ask up to n feedback queries: whether the queried element belongs to the input seen so far; b) can store up to n input elements seen so far in its long-term memory (note that when the long-term memory used by a learner is n-bounded, if the memory is full then, in order to save a new input datum, the learner must sacrifice at least one element currently stored in the memory); In the context of iterative learning of languages from positive data, these two types of “looking back” (in the context of feedback — using just one query per conjecture) were defined in [LZ96] (an earlier variant of memory-bounded learning can be found in [OSW86], and the idea of feedback learning goes back to [Wie76], where it was applied in the context of learning recursive functions in the limit). Both these concepts were reformalized (the former named n-feedback learning, and the latter named n-bounded memory learning) and thoroughly studied and discussed in [CJLZ99]. Motivation for these sorts of learnability models,
310
S. Jain and E. Kinber
as discussed in [CJLZ99], comes from the rapidly developing field of knowledge discovery in databases, which includes, in particular, data mining, knowledge extraction, information discovery, data pattern processing, information harvesting, etc. Many of these tasks represent interactive incremental iterative processes (cf., e.g., [BA96] and [FPSS96]), working on huge data sets, finding regularities, and verifying them on small samples of the overall data. While the authors in [CJLZ99] explore the aforementioned formalizations of “looking back” at small (uniformly limited by some upper bound n) portions of input data in the context of regular iterative learning, we, in this research, allow the learner to test with the teacher if conjectures do not contain data in excess of the target language. Our learners may also be allowed to memorize some “bounds” derived from the input data seen so far — in the form of the maximal element or the length of input seen so far (the latter type of additional information for iterative learners was first considered in [CM08]). In this research, we study how the aforementioned types of additional information can enhance capabilities of the NCIt-learnability model in general, and how they, while helping a learner, fare against each other. Specifically, in Section 3, we discover some general effects of additional information on NCIt-learners. In particular, it was established in [JK07] that iterative learners getting access to full positive and full negative data are, surprisingly, weaker than NCIt-learners (note that the latter ones get negative data just in the form of a finite set of negative counterexamples — however, only when these negative data is really necessary). We now show that when the learners getting full positive and full negative data are allowed to memorize just one input datum or ask just one feedback membership query, they can sometimes learn more than any learner that gets access to full positive data, that can use negative counterexamples (to conjectures), and that can store all data seen so far in its long-term memory (see Theorem 7). A known capability of NCIt-learners (established in [JK07]) is of special importance for many practical classes of languages: they can learn every indexed class of languages (that is, any class of recursive languages, where it is decidable, for any index k and any element m, if m is a member of the language with index k; examples of such classes are the classes of all regular languages and all pattern languages ([Ang80]). However, as it was established in [JK07], NCIt-learners sometimes cannot learn an indexed class class-preservingly (cf. [LZZ08])— that is, they cannot learn by using any descriptive numbering defining just the target class as hypothesis space. It turns out that this feature of NCIt-learners remains even if they can make n-feedback membership queries (see Theorem 10). However, class-preserving learning becomes possible if an NCIt-learner gets access to either the maximal element or to the number of elements seen so far (see Theorem 9). In Section 4, we strengthen some results in [CJLZ99], establishing non-trivial hierarchies of NCIt-learners using n-feedback queries or n-bounded memory based on the number n (see Theorems 11 and 12). Our examples of classes witnessing the hierarchies in question also show that additional information in the form of the maximal element seen so far and the number of elements seen
Iterative Learning from Texts and Counterexamples
311
so far might not match the help that an NCIt-learner gets in the form of one extra feedback membership query, or one extra long-term memory cell. In Section 5, we study tradeoffs between different types of additional information used by NCIt-learners (the main purpose of this study is to make salient advantages of each type of additional information for the learners in question). In particular, similarly to corresponding results in [CJLZ99], we show that one memory cell used by an NCIt-learner can give more help than any n feedback membership queries (even in presence of the maximal element and the number of elements seen so far), see Theorem 13, and, conversely, one feedback membership query can give more help than n-bounded memory (plus the maximal element and the number of elements seen so far), see Theorem 14. Interestingly, the maximal element seen so far alone can give more help than any number of feedback membership queries, see Theorem 17. Also, the number of elements and the maximal element seen so far combined together can provide more help than any bounded number of memory cells or feedback membership queries, see Theorem 19. We also show how an extra memory cell can simulate maximal element for NCIt-learners using n memory cells, see Proposition 15. We also obtain some partial results for other possible tradeoffs.
2 2.1
Preliminaries Notation
For any unexplained recursion theoretic notation we refer the reader to [Rog67]. The symbol N denotes the set of natural numbers, {0, 1, 2, 3, . . .}. Languages are subsets of N. Symbols ∅, ⊆, ⊂, ⊇, and ⊃ respectively denote the empty set, subset, proper subset, superset, and proper superset. The cardinality of a set S is denoted by card(S). The maximum and minimum of a set are denoted by ∞
max(·), min(·), respectively, where max(∅) = 0 and min(∅) = ∞. ∀ denotes ‘for all but finitely many’. We let Dx denote the finite set with canonical index x [Rog67]. We let ·, ·
stand for an arbitrary, computable, 1–1 mapping from N × N onto N, which is increasing in both its arguments [Rog67]. The pairing function can be extended to n-tuples in a natural way (for example, by using x, y, z = x, y, z
). By Wi we denote the i-th r.e. language in some fixed acceptable programming system. We also say that i is a grammar for Wi . E denotes the set of all r.e. languages. L, with or without decorations, ranges over E. L, with or without decorations, ranges over subsets of E. χL denotes the characteristic function of L, and L = N − L, that is the complement of L. L is said to be an indexed family iff there exists an indexing L0 , L1 , . . . of all and only the languages in L such that for some recursive function f , f (i, x) = χLi (x). 2.2
Basic Definitions for Learning
A text T is a mapping from N into (N ∪ {#}). T (i) represents the (i + 1)-th element in the text. We let T , with or without decorations, range over texts.
312
S. Jain and E. Kinber
content(T ) denotes the set of natural numbers in the range of T . A text T is for a language L iff content(T ) = L. Intuitively, T (i) denotes the element presented to the learner at time i, and #’s represent pauses in the presentation of data. T [n] denotes the initial sequence of T of length n, that is T [n] = T (0)T (1) . . . T (n−1). SEQ = {T [n] : n ∈ N, T is a text}. The empty sequence is denoted by λ. σ, τ, α range over SEQ. σ τ denotes concatenation of σ and τ . An informant [Gol67] I is a mapping from N to (N×{0, 1})∪{#} such that for no x ∈ N, both (x, 0) and (x, 1) are in the range of I. content(I) = set of pairs in the range of I. We say that I is an informant for L iff content(I) = {(x, χL (x)) : x ∈ N}. Intuitively, informants give both all positive and all negative data for the language being learned. I[n] denotes the first n elements of the informant I. An inductive inference machine (IIM) [Gol67] learning from texts is an algorithmic device which computes a (possibly partial) mapping from SEQ into N. One can similarly define learners from informants and other modes of input as considered below. We use the term learner or learning machine as synonyms for inductive inference machines. We let M range over IIMs. M (T [n]) (or M (I[n])) is interpreted as the grammar (index for an accepting program) conjectured by the IIM M on the initial sequence T [n] (or I[n]). We say that M converges on ∞
T to i, (written: M (T ) ↓ = i) iff ( ∀ n)[M (T [n]) = i]. Convergence on informants is similarly defined. There are several criteria for an IIM to be successful on a language. In this paper we will be mainly concerned with explanatory (abbreviated Ex) criteria of learning. Definition 1. [Gol67, CL82] (a) M TxtEx-identifies an r.e. language L (written: L ∈ TxtEx(M )) just
∞
in case for all texts T for L, M (T [n]) is defined for all n and (∃i : Wi = L)( ∀ n)[M (T [n]) = i]. (b) M TxtEx-identifies a class L of r.e. languages (written: L ⊆ TxtEx(M )) just in case M TxtEx-identifies each language from L. (c) TxtEx = {L ⊆ E : (∃M )[L ⊆ TxtEx(M )]}. One can similarly define learning criterion InfEx for learning from informants instead of texts. Next we consider iterative learning. Definition 2. [Wie76, LZ96] (a) M is iterative, iff there exists a partial recursive function F such that, for all T and n, M (T [n + 1]) = F (M (T [n]), T (n)). Here M (λ) is viewed as some predefined hypothesis. (b) M TxtIt-identifies L, iff M is iterative, and M TxtEx-identifies L. (c) TxtIt = {L : (∃M )[M TxtIt-identifies L]}. InfIt can be defined similarly. Intuitively, an iterative learner [Wie76, LZ96] is a learner whose hypothesis depends only on its last conjecture and current input. That is, for some recursive
Iterative Learning from Texts and Counterexamples
313
function F , for n ≥ 0, M (T [n+1]) = F (M (T [n]), T (n)). Here, note that M (T [0]) is predefined to be some constant value. We will often identify F above with M (that is use M (p, x) = F (p, x) to describe M (T [n + 1]), where p = M (T [n]) and x = T (n)). This is for ease of notation. Context determines which interpretation of the learner M is meant. For Ex models of learning (for learning from texts or informants or their variants when learning from positive data and negative counterexamples, as defined below), one may assume without loss of generality that the learners are total, that is, defined on all initial segments of all texts (see, for example [OSW86]). However for iterative learning one cannot assume so. Thus, we explicitly require in the definition that iterative learners are defined on all inputs which are initial segments of texts (informants) for a language in the class. Note that, although it is not stated explicitly, an It-type learner might store some input data in its conjecture (thus serving as a limited long-term memory). However, the amount of stored data cannot grow indefinitely, as the learner must stabilize to one (right) conjecture. Learning with feedback and learning with bounded memory is a generalization of iterative learning where the learner has access to some past data using queries or via some finite amount of memory. Thus, in feedback learning an iterative learner is additionally allowed to query whether some elements were present in the past input data. In bounded memory, an iterative learner is able to memorize in its memory some (bounded) finite number of data (in addition to its latest conjecture). Below are the formal definitions. Definition 3. [CJLZ99] (a) Suppose M is a learning machine (for a class L of languages). We say that M is an m-feedback learner iff there exist partial recursive functions F and Q such that for all L ∈ L, and all texts T for L, (i) for all n: Q(M (T [n]), T (n)) ↓ ∈ Nm , and (ii) If Q(M (T [n]), x) = (x1 , x2 , . . . , xm ) then M (T [n+1]) = F (M (T [n]), T (n), y1 , y2 , . . . , ym ), where yi = 1 iff xi ∈ content(T [n]). (b) We say that M TxtIt-identifies L with m-feedback iff M TxtEx-identifies L and M is a m-feedback learner. Such learners M are also called TxtIt-learners using m-feedback. Definition 4. [LZ96] (a) Suppose M is a learning machine (for a class L of languages). We say that M is an m-memory-bounded learner iff there exists a (partial) recursive memory function mem (mapping finite sequences to finite sets) and partial recursive functions F, F such that for all L ∈ L, and all texts T for L, (i) for all n: mem(T [n]) ↓ ⊆ content(T [n]) and card(mem(T [n])) ≤ m (ii) for all n: mem(T [n + 1]) = F (M (T [n]), T (n), mem(T [n])) ↓ , and mem(T [n + 1]) − mem(T [n]) ⊆ {T (n)}. (iii) M (T [n + 1]) = F (M (T [n]), T (n), mem(T [n])) ↓ . (b) We say that M TxtIt-identifies L with m-memory iff M TxtEx-identifies L and M is a m-memory-bounded learner. Such learners M are also called TxtItlearner using m-memory or m-memory bounded TxtIt-learner.
314
S. Jain and E. Kinber
In both the above definitions, M (T [0]) is some fixed initial hypothesis. Again, we often identify the learner M with the function F (along with identifying mem with F ) as defined above, and the context determines which interpretation of the learner M is meant. One can similarly define feedback and memory bounded learning for learning from informants. Besides the above models of learning, we sometimes allow the learner access to the maximal element in the input seen so far, or the number of elements in the input seen so far as an additional input. In the sequel, we will typically refer to the “maximal element seen so far” and the “number of elements seen so far” as simply the “maximal element” and, respectively, the “number of elements”. 2.3
Learning with Negative Counterexamples
In this section we consider our models of learning from full positive data and negative counterexamples as given by [JK08]. Intuitively, for learning with negative counterexamples, we may consider the learner being provided a text, one element at a time, along with a negative counterexample to the latest conjecture, if any. (One may view this negative counterexample as a response of the teacher to the subset query when it is tested if the language generated by the conjecture is a subset of the target language). One may model the list of negative counterexamples as a second text for negative counterexamples being provided to the learner. Thus the IIMs get as input two texts, one for positive data, and other for negative counterexamples. ∞ We say that M (T, T ) converges to a grammar i, iff ( ∀ n)[M (T [n], T [n]) = i]. First, we define the model of learning from positive data and negative counterexamples. NC in the definition below stands for negative counterexample. Definition 5. [JK08] (a) M NCEx-identifies a language L (written: L ∈ NCEx(M )) iff for all texts T for L, and for all T satisfying the condition: = ∅) and (T (n) = #, if Sn = ∅), (T (n) ∈ Sn , if Sn where Sn = L ∩ WM(T [n],T [n]) M (T, T ) converges to a grammar i such that Wi = L. (b) M NCEx-identifies a class L of languages (written: L ⊆ NCEx(M )), iff M NCEx-identifies each language in the class. (c) NCEx = {L : (∃M )[L ⊆ NCEx(M )]}. For ease of notation, we sometimes define M (T [n], T [n]) also as M (T [n]), where we separately describe how the counterexamples T (n) are presented to the conjecture of M on input T [n]. One can similarly define NCIt-learning, where the learner’s output depends only on the previous conjecture, the latest positive data, and the counterexample provided.
Iterative Learning from Texts and Counterexamples
315
Definition 6. [JK07] (a) M is iterative (for learning from positive data and negative counterexamples), iff there exists a partial recursive function F such that, for all T, T and n, M (T [n+1], T [n+1]) = F (M (T [n], T [n]), T (n), T (n)). Here M (λ, λ) is some predefined constant. (b) M NCIt-identifies L, iff M is iterative, and M NCEx-identifies L. (c) NCIt = {L : (∃M )[M NCIt-identifies L]}. We will often identify F above with M (that is use M (p, x, y) = F (p, x, y) to describe M (T [n + 1], T [n + 1]), where p = M (T [n], T [n]) and x = T (n), y = T (n)). This is for ease of notation. One should also note that the NCIt model is equivalent to allowing finitely many subset queries (with counterexamples for the answer “no”) in iterative learning. One can extend the above definition to NCIt-learning with m-feedback or mmemory, by allowing the learner M up to m queries about whether some element x has appeared in the previous text or allowing the learner M to remember up to m elements of the past data. The resulting criteria are called NCItlearning with m-feedback and NCIt-learning with m-memory, respectively. The resulting learners are called m-feedback NCIt-learner (or NCIt-learners using m-feedback) and m-memory bounded NCIt-learner (or NCIt-learner using mmemory) respectively. It follows from the definition that NCIt-learning is contained in NCItlearning using m-feedback and NCIt-learning using m-memory, which, in turn, are contained in NCEx.
3
Some General Effects of Additional Information on NCIt-learning
In this section, we look at some known capabilities of NCIt-learners and explore whether they hold when a learner has access to additional information. It was shown in [JK07] that capabilities of NCIt-learners exceed capabilities of InfIt-learners. In this section, we show that if an InfIt-learner can store just one element seen so far, or can use just one feedback query, then it can sometimes learn more than any NCEx-learner (which can memorize the whole input seen so far!). However, total InfIt-learners having access to the maximal element still can be simulated by NCIt-learners having access to the maximal element. An important result established in [JK07] is that NCIt-learners can infer any indexed class of recursive languages. However, it is also shown in [JK07] that, surprisingly, such NCIt-learners cannot learn indexed classes class-preservingly (cf.[LZZ08]), that is, using a numbering of languages containing exactly the target class (and no other languages). Still class-preserving learnability is important, as any natural hypotheses space for an indexed class is class-preserving. We now show that NCIt-learners can learn indexed classes class-preservingly if they have access to the maximal element or the number of elements seen so far. However, adding the capability of using n feedback queries might not be enough to help an NCIt-learner to infer an indexed class class-preservingly.
316
3.1
S. Jain and E. Kinber
Informants versus Negative Counterexamples
First we show how storing just one element seen so far or using one feedback query can make an InfIt-learner stronger than any learner storing the whole input seen so far, but getting only positive data and negative counterexamples. Theorem 7. There exists a class which can be learnt using 1-memory bounded (or 1-feedback) InfIt learner, which cannot be learnt by an NCEx-learner. Let A be a semi-recursive, nonrecursive r.e. set, such that for every x ∈ N, either both 2x and 2x + 1 are in A or both 2x and 2x + 1 are not in A. Let L = {A ∪ {y} : y ∈ N}. We leave it to the reader to verify that the class L can be iteratively learnt from informant using 1-feedback or 1-bounded memory and cannot be NCEx-learnt. Still, as the next theorem demonstrates, total InfIt-learners (that is, the ones that are defined on all, even, possibly, non-valid inputs — that is, data which does not represent a possible previous conjecture, a new input element, and the maximal element possible in a valid learning process for a language in the class being learnt) can be simulated by NCIt-learners if both have access to the maximal element. For learning from informants, the maximal element present in the input is the maximal y such that (y, 0) or (y, 1) is present in the input given so far. Theorem 8. Any class which is InfIt learnable using the maximal element by a total learner is also NCIt-learnable using the maximal element. 3.2
Indexed Families
Unlike the case of NCIt-learnability (without access to additional information), class-preserving learnability of indexed classes can be achieved if an NCItlearner has access to the maximal element or the number of elements seen. Theorem 9. (a) Every indexed family can be NCIt-identified (using a class preserving hypothesis space) given the maximal element seen so far. (b) Every indexed family can be NCIt-identified (using a class preserving hypothesis space) given the number of elements seen so far. Proof. (a) Suppose L is an indexed family, and L0 , L1 , . . . is its listing where x ∈ Li can be effectively determined in x and i. Let Li [m] denote {x ∈ Li : x ≤ m}. The conjectures of the learner would be of the form: p(j, S, X), where p(j, S, X) is a grammar for Lj , and S, X are finite sets with some properties. Suppose T is an input text for a language L, where T (n) = xn . Inductively, if p(jn , Sn , Xn ) is output after T [n] has been seen, then the following invariants will hold. (A) For each j ∈ Sn , Lj ⊆ L, and Xn ⊆ L. (B) content(T [n]) ⊆ Xn ∪ j∈Sn Lj , (C) For all j < jn , Lj = L. (D) If jn ∈ Sn , then either n = 0 or jn = jn−1 + 1. (E) Xn ⊆ Xn+1 , Sn ⊆ Sn+1 , jn ≤ jn+1 .
Iterative Learning from Texts and Counterexamples
317
Initially, M (λ) = (0, ∅, ∅). The learner on the input p(jn , Sn , Xn ) and the new element xn , the counterexample yn , and the maximal element m seen so far, does the following: (i) If yn = #, then Sn+1 = Sn ∪ {jn }; otherwise Sn+1 = Sn . (ii) If (Xn ∪ {xn } ∪ j∈Sn Lj [m]) − {#} ⊆ Ljn , and yn = #, then jn+1 = jn , Xn+1 = Xn . Otherwise, jn+1 = jn + 1 and Xn+1 = Xn ∪ {xn } − {#}. It is easy to verify that the invariants are satisfied. Furthermore, jn never goes beyond the minimal grammar for L (see invariant (C)). Thus, the sequence of jn converges, as well as Sn and Xn converge (as Xn+1 = Xn implies jn+1 = jn , and Sn ⊆ {j : j ≤ jn }, and using invariants (D) and (E)). Moreover, the last conjecture is correct by (A) and (B), and using (Xn ∪ {xn } ∪ j∈Sn Lj [m]) − {#} ⊆ Ljn from clause (ii) (as there is no further mind change). (b) Only change is in (ii) above which is replaced by: (m below denotes the number of elements seen so far by the learner) (ii) If the first m elements in (Xn ∪ {xn } ∪ j∈Sn Lj ) − {#} are included in Ljn , and y = #, then jn+1 = jn , Xn+1 = Xn . Otherwise, jn+1 = jn + 1 and Xn+1 = Xn ∪ {xn } − {#}. The rest of the proof is similar to the part (a), and we omit the details. Still, any n feedback queries might not help to achieve class-preserving learnability of indexed classes by NCIt-learners. Theorem 10. There exists an indexed family which cannot be learnt by an NCIt-learner with n-feedback using a class preserving hypothesis space.
4
Hierarchy of n-Feedback and n-Memory Learners
In this section, we show that, in the context of NCIt-learnability, n + 1 stored input elements seen and n + 1 feedback queries provide more capability than n stored input elements seen and, respectively, n feedback queries. Note that, on the negative sides of both results, neither NCIt-learners storing just up to n elements seen, nor NCIt-learner using just up to n feedback queries can be helped even if they have access to the maximal element and the number of elements seen so far. On the other hand, learners witnessing the positive sides of both results do not need access to negative counterexamples (refuting conjectures containing data in excess of the target language). Theorem 11. Fix n ∈ N. There exists a class L such that (a) L can be iteratively learnt by an n + 1-feedback learner. (b) L cannot be NCIt-learnt using n-feedback queries even if the maximal element and the number of elements in the input seen so far is given to the learner as additional information. Theorem 12. Let n ∈ N. There exists a L such that (a) L can be iteratively learnt using (n + 1)-bounded-memory.
318
S. Jain and E. Kinber
(b) L cannot be learnt by an NCIt-learner using n-memory, even if the learner is given the number of elements and the maximal element seen so far as additional information. Proof. (sketch) Let L1 = {L : (∃e)[∅ ⊂ L ⊆ { e, j, x : j, x ∈ N} and We = L and for all x, [card({j : e, j, x ∈ L, j ≥ 1}) ≤ n + 1 or [card({j : e, j, x ∈ L, j ≥ 1}) = n + 2 and e,j,x∈L j is a prime number ]]]}. : e, j, x ∈ We , j ≥ 1}) and Let L2 = {L : (∃e, x)[We ∈ L1 , x > max({x [card({j : e, j, x ∈ L, j ≥ 1}) = n + 2 and e,j,x∈L j is not a prime number ] and L = We ∪ { e, j, x : j ≥ 1, e, j, x ∈ L}]}. Let L = L1 ∪ L2 . It can be shown that L can be iteratively learnt using (n+1)-bounded memory. For the diagonalization against a learner M , we use Kleene’s recursion theorem [Rog67], to construct a set We in stages s = 0, 1, 2, . . ., along with initial segments σs and counterexample function fs . Let Wes denote We enumerated before stage s, Es = range(fs ), and xs denote the least number such that Wes ∪Es ⊆ { e, j, x : x < xs }. Initially, We0 contains e, 0, 0 , σ0 is a sequence with content { e, 0, 0 }, and f0 (i) = #, for all i. In stage s, the algorithm updates the above parameters as follows. In stage such that (ws + |σs | + define ws , ws to be large enough numbers ws s, n 2) < n+1 and for all distinct c, c ≤ (n + 1) ∗ 2ws , there exists a p with 2ws < p ≤ ws such that c + p is a prime, but c + p is not a prime. Let ms > xs be such that e, 0, ms > max(content(σs ) ∪ { e, j, xs : 1 ≤ j ≤ ws }). Let τs = σs e, 0, ms , and enumerate e, 0, ms into We . Dovetail among the following two searches: (a) search for an initial segment σ of τs such that fs (M (σ)) = #, but WM(σ) − content(τs ) = ∅; (b) search for a τ such that content(τ ) − content(τs ) ⊆ { e, j, xs : 1 ≤ j ≤ ws }, M (τ ) = M (τs ) and either (i) card(content(τ ) − content(τs )) ≤ n + 1 or (ii) card(content(τ ) − content(τs )) = n + 2, and e,j,x∈content(τ )−content(τs ) j is a prime number. Here we assume that search in (a) has some priority in the sense that if one can find such a σ within s steps, then (a) succeeds first with the shortest such σ. In case (a) succeeds first, we let σs+1 = τs , Wes+1 = content(τs ), and fs+1 (M (σ)) = the element found in WM(σ) − content(σs ) (rest of fs+1 is same as fs ). In case (b) succeeds first, we let σs+1 = τ , Wes+1 = content(σs+1 ) and let fs+1 = fs . Now one can show that if there are infinitely many stages, then M does not converge on s σs . On the other hand, if there are only finitely many stages, then one can show that, for some appropriate distinct S, S ⊆ {2i : 1 ≤ i ≤ ws }, and corresponding p, αS , αS with content(αX ) = { e, j, xs : j ∈ X} (for X = S or S ), one has that M (τs αS e, p, xs ∞ ) = M (τs αS e, p, xs ∞ ), though τs αS e, p, xs ∞ and τs αS e, p, xs ∞ are texts for different languages in L. We omit the details.
Iterative Learning from Texts and Counterexamples
5
319
Advantages of Different Types of Additional Information over Other Types
In this section we study tradeoffs between different types of additional information in the context of NCIt-learnability. 5.1
Comparison of Feedback and Memory Bounded Learning
Results of this subsection significantly strengthen corresponding results given in [CJLZ99]. Namely, they demonstrate that, in the context of NCIt-learnability, just one stored inputted element can provide more than any n feedback queries (even if, in addition, the learner has access to the maximal element and the number of elements seen so far), and, conversely, one feedback query can do more than any n stored input elements seen so far (and, additionally, the maximal element and the number of elements seen so far). Moreover, the iterative learners witnessing the positive sides of these results do not not use negative counterexamples to conjectures containing extra elements. Theorem 13. There exists an L which can be iteratively learnt by a 1-memory bounded learner, but which cannot be NCIt-learnt using n-feedback (even if the learner is given the maximal element and the number of elements in the input so far as additional information). Theorem 14. There exists an L which can be iteratively learnt by a 1-feedback learner, but which cannot be NCIt-learnt by a n-memory bounded learner (even if the learner is given the maximal element and the number of elements in the input so far as additional information). 5.2
Advantages of Using Maximal Element/Number of Elements
Results of this subsection demonstrate various advantages that NCIt-learners can get while using the maximal element or/and the number of elements as additional information. The following proposition works if memory, instead of being a set, is allowed to be a multiset (when updating the memory, if a new input element is greater than the current maximal one, the learner must replace the old maximal by the new one, however, the learner may also decide to store a separate copy of the new element — for reasons different from it being maximal, so that it would not be sacrificed when a new greater element appeared). It is open at present whether this proposition holds if memory is just a set, as in the current paper. Proposition 15. Any n-bounded memory learner with the maximal element in the input as additional information can be simulated by an n+1-bounded memory learner by using the extra memory for the maximal element seen, as long as the memory of the learner is considered as a multi-set, rather than just a set.
320
S. Jain and E. Kinber
Our next result shows that adding access to the maximal element increases learning capability of NCIt-learners storing up to n input elements seen so far. Moreover, a learner witnessing the positive side of the result does not need access to negative counterexamples refuting conjectures containing data in excess of the language to be learned. Theorem 16. There exists a class L which can be iteratively learnt by an nbounded memory learner with maximal element as additional input that cannot be NCIt-learnt by a n-bounded memory learner. Our next two results demonstrate that an NCIt-learner having access to just the maximal element or the number of elements seen so far can sometimes do more than any NCIt-learner using up to n feedback queries. First, as the next theorem demonstrates, NCIt-learners (or even iterative learners — not using negative counterexamples to conjectures) using the maximal element as additional information can sometimes learn more than NCIt-learners using n feedback queries and getting the number of elements as additional information. However, we were not able to achieve a result of similar strength while faring the number of elements seen so far against n feedback queries and the maximal element as additional information. Whether it is possible, remains open. Theorem 17. There exists a class L which can be iteratively learnt when the learner is provided the maximal element in the input so far, but the class L cannot be NCIt-learnt using n-feedback, for any n, even if the learner is given the number of elements in the input as additional information. Theorem 18. There exists a L which can be NCIt learnt using the number of elements in the input as additional information, but, for all n, L cannot be NCIt-learnt using n-feedback. Note that, obviously, the maximal element can always be memorized by a learner and, thus, cannot add more to the learning power of iterative learners than even one memory cell for storing input elements. Therefore, we explore if the number of elements seen so far can give an NCIt-learner more advantages than n memorized input elements seen so far. We were able to achieve only a partial solution — showing that the number of elements and the maximal element (or one memory cell) together can provide more power to NCIt-learners than n memorized input elements. Theorem 19. There exists a class L such that L can be NCIt-learnt using 1memory (or the maximal element) and the number of elements, but cannot be learnt using n feedback or n-memory bounded learner in NCIt manner, even if it is given the maximal element. Can the maximal element give more power to NCIt-learners than the number of elements seen so far? The answer to this question is positive — even if the learners using the maximal element are just iterative (not using negative counterexamples to conjectures): it immediately follows from Theorem 17. However, we do not
Iterative Learning from Texts and Counterexamples
321
know if the number of elements can give more in the context of NCIt-learnability than the maximal element. We have some partial solution to the above problem, when one considers iterative learners rather than NCIt-learners. Theorem 20. (a) Suppose L can be NCIt-identified using the number of elements, where the learner converges on all inputs (here the text input would be from the target class, but the number of elements may sometimes not be valid — we still expect the learner to converge). Then, L can be NCIt-identified using access to the maximal element. (b) There exists an L such that (i) L can be iteratively learnt when given the number of elements in the input seen so far as additional information (such a learner, however, may not be total). (ii) For all n, L cannot be iteratively learnt by an n-feedback learner even if it gets the maximal element as additional information. (iii) For all n, L cannot be iteratively learnt by an n-memory bounded learner.
6
Conclusions
As we have shown, additional information of the types studied in this paper can add interesting new capabilities to iterative learners getting negative examples to conjectures containing data in excess of the target language. Some problems related to comparisons of help provided by additional information remain open (they are mentioned in Section 5), and solving these problems can offer new (and, possibly, unexpected) insight into advantages of using additional information of certain types for the learners in question. Similarly to [JK07], one might also consider different types of negative examples (refuting conjectures containing extra elements) by iterative learners and explore how these different types of negative examples may interplay with different types of additional information. Yet another interesting area of research is studying iterative learnability with counterexamples and additional information of specific indexed classes of languages (for example, regular languages or patterns) — as we have shown all such classes are learnable class-preservingly using maximal element or number of elements as additional informantion, and, therefore, one can now study if and when learnability of such classes may be efficient. A general open problem for iterative learners of any type using additional (bounded) memory is whether a multiset type memory (when a learner can store the same inputted item several times; for example, the learner may decide to store, say, 10 copies of the next input element) can have an advantage over a set type memory (where every item is stored just once). We have not been able to find an answer to this very interesting problem. Acknowledgments. The authors are grateful to anonymous referees of ALT’2009 for many useful remarks and suggestions. We specially thank a referee for a simpler proof of Theorem 7.
322
S. Jain and E. Kinber
References [Ang80] [Ang88] [BA96]
[CJLZ99]
[CL82]
[CM08] [FPSS96]
[Gol67] [JK07]
[JK08]
[LZ92]
[LZ96] [LZ06] [LZZ08]
[OSW86]
[Pop68] [Rog67] [Wie76]
Angluin, D.: Finding patterns common to a set of strings. Journal of Computer and System Sciences 21, 46–62 (1980) Angluin, D.: Queries and concept learning. Machine Learning 2, 319–342 (1988) Brachman, R., Anand, T.: The process of knowledge discovery in databases: A human centered approach. In: Fayyad, U.M., PiatetskyShapiro, G., Smyth, P., Uthurusam, R. (eds.) Advances in Knowledge Discovery and Data Mining, pp. 37–58. AAAI Press, Menlo Park (1996) Case, J., Jain, S., Lange, S., Zeugmann, T.: Incremental concept learning for bounded data mining. Information and Computation 152(1), 74–110 (1999) Case, J., Lynes, C.: Machine inductive inference and language identification. In: Nielsen, M., Schmidt, E.M. (eds.) ICALP 1982. LNCS, vol. 140, pp. 107–115. Springer, Heidelberg (1982) Case, J., Moelius, S.: U-shaped, iterative, and iterative-with-counter learning. Machine Learning 72, 63–88 (2008) Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.: From data mining to knowledge discovery. In: Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusam, R. (eds.) Advances in Knowledge Discovery and Data Mining, pp. 1–34. AAAI Press, Menlo Park (1996) Gold, E.M.: Language identification in the limit. Information and Control 10, 447–474 (1967) Jain, S., Kinber, E.: Iterative learning from positive data and negative counterexamples. Information and Computation 205(12), 1777–1805 (2007) Jain, S., Kinber, E.: Learning languages from positive data and negative counterexamples. Journal of Computer and System Sciences 74(4), 431– 456 (2008); Special Issue: Carl Smith memorial issue Lange, S., Zeugmann, T.: Types of monotonic language learning and their characterization. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pp. 377–390. ACM Press, New York (1992) Lange, S., Zeugmann, T.: Incremental learning from positive data. Journal of Computer and System Sciences 53, 88–103 (1996) Li, Y., Zhang, W.: Simplify support vector machines by iterative learning. Neural Processsing Information - Letters and Reviews 10, 11–17 (2006) Lange, S., Zeugmann, T., Zilles, S.: Learning indexed families of recursive languages from positive data: A survey. Theoretical Computer Science 397, 194–232 (2008) Osherson, D., Stob, M., Weinstein, S.: Systems that Learn: An Introduction to Learning Theory for Cognitive and Computer Scientists. MIT Press, Cambridge (1986) Popper, K.: The Logic of Scientific Discovery, 2nd edn. Harper Torch Books, New York (1968) Rogers, H.: Theory of Recursive Functions and Effective Computability. McGraw-Hill, New York (1967); Reprinted by MIT Press (1987) Wiehagen, R.: Limes-Erkennung rekursiver Funktionen durch spezielle Strategien. Journal of Information Processing and Cybernetics (EIK) 12, 93–99 (1976)
Incremental Learning with Ordinal Bounded Example Memory Lorenzo Carlucci Department of Computer Science, University of Rome “La Sapienza”, Via Salaria 113, 00198, Roma, Italy
[email protected]
Abstract. A Bounded Example Memory learner is a learner that operates incrementally and maintains a memory of finitely many data items. The paradigm is well-studied and known to coincide with setdriven learning. A hierarchy of stronger and stronger learning criteria is obtained when one considers, for each k ∈ N, iterative learners that can maintain a memory of at most k previously processed data items. We report on recent investigations of extensions of the Bounded Example Memory model where a constructive ordinal notation is used to bound the number of times the learner can ask for proper global memory extensions.
1
Introduction
In many learning contexts a learner is confronted with the task of inductively forming hypotheses while being presented with an incoming stream of data. In contexts the learning process can be said to be successful if, eventually, the hypotheses that the learner forms provide a correct description of the observed stream of data. Each single step of the learning process in this scenario involves an observed data item and the formation of a new hypothesis. It is very reasonable to assume that a real-world learner - be it artificial or human - has memory limitations. A learner with memory limitations is a learner that is unable to store such complete information about the previous stages of the learning process. Each stage of the learning process is completely described by the flow of data seen so far, and the sequence of the learner’s hypotheses so far. The action of a learner with memory limitations, at each step of the learning process, is completely determined by a limited portion of the previous stages of the learning process. Let us call intensional memory the learner’s memory of its own previously issued hypotheses. Let us call extensional memory the learner’s memory of previously observed data items. In the context of Gold’s formal theory of language learning [6], models with restrictions on intensional and on extensional memories have been studied. In [9] the paradigm of Bounded Example Memory is introduced. A bounded example
Partially supported by grant number 1339 of the John Templeton Foundation, and by a Telecom Italia “Progetto Italia” Fellowship.
R. Gavald` a et al. (Eds.): ALT 2009, LNAI 5809, pp. 323–337, 2009. c Springer-Verlag Berlin Heidelberg 2009
324
L. Carlucci
memory learner is a learner whose intensional memory is, at each step of the learning process, limited to remembering its own previous hypothesis, and whose extensional memory is limited to storage of a finite number of previously observed data items. At each step of the learning process, such a learner must decide, based on (i) knowledge of its own previous hypothesis, on (ii) the content of its current memory, and on (iii) the currently observed data item, whether to change its hypothesis and whether to store in memory the currently observed data item. For each number k one can similarly define a k-bounded example memory learner as a bounded example memory learner whose memory can never exceed size k. For k = 0 one obtains the paradigm of iterative learning [12], in which the learner has no extensional memory and can only remember its own previous conjecture. One of the main results of [9] is the following. For every k, there is a class of languages that can be learned by a bounded example memory learner with memory k + 1 but not by any bounded example memory learner with memory k. [3] and the recent [7] present further results on this and related models. In this paper we present some results on a new extension of the Bounded Example Memory paradigm. Following a suggestion in [3], we investigate a paradigm in which the learner is allowed to change its mind on how many data items to store in memory as a function of some constructive ordinal α. Ordinals are canonical representatives of well-orderings. A constructive ordinal can be defined as the order-type of a computable well-ordering of the natural numbers. Equivalently, constructive ordinals are those ordinals that have a program (a notation) that specifies how to build them from below using standard operations such as successor and constructive limit. Every constructive ordinal is countable and notations for constructive ordinals are algorithmic finite objects. For each initial segment of the constructive ordinals a univalent system of notations can be defined. On the other hand, a universal (not univalent) system of notation containing at least one notation for every constructive ordinal has been defined by Kleene. For more details, see, e.g., [10]. For the sake of this paper, ordinals can be treated in an informal way: we blur the distinction between a constructive ordinal and a notation for it. The treatment can be made rigorous without effort and without altering our results. Count-down from ordinal notations has been applied in a number of ways in algorithmic learning theory, starting with [4], where ordinal notations are used to bound the number of mind-changes that a learning machine is allowed to make on its way to convergence. A different use of ordinal notations is in the recent [2]. For every (notation for a) constructive ordinal α, the paradigm of α-bounded example memory is defined. Intuitively, a learner with example memory bounded by α must (algorithmically) count-down from (a notation for) α each time a proper global memory extension occurs during the learning process (i.e., each time the size of the memory set becomes strictly larger than the size of all the previous memory sets). We show that this paradigm is strictly stronger than k-bounded example memory but strictly weaker than finitely-bounded example memory (with no form of a priori bound on memory size). We also show that the concept of ordinal bounded example memory gives rise to a hierarchy and we
Incremental Learning with Ordinal Bounded Example Memory
325
exhibit a hiearchy up through ordinal ω 2 . We do not prove a general hierarchy result for all constructive ordinals. Yet we believe that such a result is within the reach of the methods of the present paper.
2
Preliminaries
Unexplained notation follows Rogers [10]. N denotes the set of natural numbers {0, 1, 2, . . . }. N+ denotes the set of positive natural numbers. The set of finite subsets of N is denoted by F in(N). We use the following set-theoretic notations: ∅ (empty set), ⊆ (subset), ⊂ (proper subset), ⊇ (superset), ⊃ (proper superset). If X and Y are sets, then X ∪ Y , X ∩ Y , and X − Y denote the union, the ˙ to intersection, and the difference of X and Y , respectively. We use Z = X ∪Y abbreviate (Z = X ∪ Y ∧ X ∩ Y = ∅). The cardinality of a set X if denoted by card(X). By card(X) ≤ ∗ we indicate that the cardinality of X if finite. We let λx, y. x, y stand for a standard pairing function. We extend the notation to pairing of n-tuples of numbers in the straightforward way. We denote with πin (i ≤ n) the projection function of an n-tuple to its i-th component. We omit the superscript when clear from context. We use α, β to range over constructive ordinals. We blur the distinction between ordinals and their notations. We use O to denote the set of constructive ordinals. This symbol traditionally denotes Kleene’s universal system of notations for constructive ordinals. This system would be used in a completely rigorous presentation of our results. We fix an acceptable programming system ϕ0 , ϕ1 , . . . for the partial computable functions of type N → N. We denote by Wi the domain of the i-th partial computable function ϕi . We could equivalently (modulo isomorphism of numberings) define Wi as the set generated by grammar i. A language is a subset of N. We are only interested in recursively enumerable languages, whose collection we denote by E. The symbol L ranges over elements in E. L ranges over subsets of E, called language classes. Let λx, y.pad(x, y) be an injective padding function (i.e., Wpad(x,y) = Wx ). A sequence is a mapping from an initial segment of N+ into N# , where # is a reserved symbol which we call pause symbol. We use N# to abbreviate N ∪ {#}. The symbols σ, τ range over sequences. content(σ) denotes the range of σ minus the # symbol. |σ| denotes the length of σ. We use ⊆, ⊂ for sequence containment and proper containment respectively. A text is a mapping from N+ into N# . The symbol t ranges over texts. If t = (xi )i∈N+ is a text, t[n] denotes the initial segment of t of length n, i.e., the sequence (x1 x2 . . . xn ). We use · for concatenation. If the range of t minus the # symbol is equal to L, then we say that t is a text for L. A language learning machine is a partial computable function mapping finite sequences to natural numbers. We now define the basic paradigm of explanatory identification from text [5]. Definition 1 (Gold, [5]). Let M be a language learning machine, let L be a language, let L be a language class.
326
L. Carlucci
(1) M TxtEx-identifies L if and only if, for every text t = (xi )i∈N+ for L, there exists n ∈ N+ such that WM(t[n]) = L and for all n ≥ n M(t[n ]) = M(t[n]). (2) M TxtEx-identifies L if and only if M TxtEx-identifies L for all L ∈ L. (3) TxtEx(M) = {L : MTxtEx-identifies L}. (4) TxtEx = {L : (∃M)[L ⊆ TxtEx(M)]}. We define the paradigm of iterative learner. This is the basic paradigm of incremental learning upon which the paradigms of bounded example memory learning are built. Definition 2 (Wiehagen, [12]). Let M : N×N# → N be a partial computable function, let j0 ∈ N, let L be a language. (1) (M, j0 ) TxtIt-identifies L if and only if, for each text t = (xi )i∈N+ for L, the following hold. (i) For each n ∈ N, Mn (t) is defined, where M0 (t) = j0 and Mn+1 (t) = M (Mn (t), xn+1 ) = jn+1 . (ii) (∃n ∈ N)[Wjn = L ∧ (∀n ≥ n)[jn = jn ]]. (2) For M , j0 as above, TxtIt(M, j0 ) = {L : (M, j0 ) TxtIt-identifies L}. (3) TxtIt(M, j0 ) = {L : L ⊆ TxtIt(M, j0 )}. (4) TxtIt = {L : (∃M, j0 )[L ⊆ TxtIt(M, j0 )]}. TxtIt is known to be strictly contained in TxtEx. We observe that a function M as in the previous definition can be used to define a language learning machine M as follows. M(t[0]) = M0 (t) = j0 , and, for all n ∈ N, M(t[n + 1]) = M (M(t[n]), xn+1 ). Note that M is uniquely determined by M and j0 . For every L such that (M, j0 ) TxtIt-identifies L, we also say that M TxtIt-identifies L.
3
Ordinal Bounded Example Memory Learning
The paradigm of Bounded Example Memory was introduced in [9] and further investigated in [3] and in the recent [7]. A bounded example memory learner is an iterative learner that is allowed to store at most k data items chosen from the input text. Definition 3 (Lange and Zeugmann, [9]). Let k ∈ N+ ∪ {∗}, where ∗ is a new symbol. (1) Let M : (N×F in(N))×N# → N×F in(N) be a partial computable function, let j0 ∈ N, let L be a language. (M, j0 ) Bemk -identifies L if and only if, for each text t = (xi )i∈N+ for L, the following hold. (i) For each n ∈ N, Mn (t) is defined, where M0 (t) = (j0 , ∅) and Mn+1 (t) = M (Mn (t), xn+1 ) = (jn+1 , Sn+1 ). (ii) S0 = ∅ and ∀n ∈ N, Sn+1 ⊆ Sn ∪ {xn+1 }. (iii) ∀n ∈ N, card(Sn+1 ) ≤ k. (iv) (∃n ∈ N)[Wjn = L ∧ (∀n ≥ n)[jn = jn ]].
Incremental Learning with Ordinal Bounded Example Memory
327
(2) We say that (M, j0 ) Bemk -identifies L if and only if(M, j0 )Bemk -identifies L for every L ∈ L. A machine of the appropriate type that satisfies points (i)-(iii) above is referred to as a Bemk -learner. By [8], Bem∗ is known to coincide with set-driven learning [11]. With a slight abuse of notation we sometimes use Bem0 to denote TxtIt. We now introduce an extension of the Bounded Example Memory model. Definition 4. Let α be a fixed constructive ordinal (notation). Let M : (N × F in(N) × O) × N# → N × F in(N) × O be a partial computable function. Let j0 ∈ N, let L be a language. (1) We say that (M, j0 ) OBemα -identifies L if and only if for every t = (xj )j∈N+ text for L, points (i) to (v) below hold. (i) for all n ∈ N, Mn (t) is defined, where M0 (t) = (j0 , S0 , α0 ), S0 = ∅, α0 ≤ α, Mn+1 (t) = M (Mn (t), xn+1 ) = (jn+1 , Sn+1 , αn+1 ). (ii) Sn+1 ⊆ Sn ∪ {xn+1 }. (iii) αn ≥ αn+1 . (iv) αn > αn+1 if and only if card(Sn+1 ) > max({card(Si ) : i ≤ n}). (v) (∃n)(∀n ≥ n)[jn = jn ∧ Wjn = L]. (2) We say that (M, j0 ) OBemα -identifies L if and only if (M, j0 ) OBemα infers L for every L ∈ L. A machine of the appropriate type that satisfies points (i)-(iv) above is referred to as a OBemα -learner. OBemα -learning is a species of incremental learning: each new hypothesis depends only on the previous hypothesis, the current memory, and the current data item. The above Definition can be simplified in case the following is true. Call cumulative a bounded example memory learner that never erases an element from memory without replacing it with a new one. If cumulative learning does not restrict learning power, then in point (iv) it is sufficient to ask that card(Sn+1 ) > card(Sn ). For I ∈ {Bemk , Bem∗ , OBemα }, M of the appropriate type and j0 ∈ N – We write I(M, j0 ) for {L : (M, j0 ) I-identifies L}, and – We write I for {L : (∃M, j0 )[L ⊆ I(M, j0 )]}. We write M (t) to indicate the conjecture to which M converges while processing text t. We always assume that such a conjecture exists when we use this notation. We state some basic facts in the following Lemma. Lemma 1. For all k ∈ N+ , for all constructive ordinals α, β, the following hold. (1) OBemk = Bemk . (2) If α < β, then OBemα ⊆ OBemβ . (3) OBemα ⊆ Bem∗ . Proof. The proof is omitted for brevity. Note that to go from a Bemk -learner to an OBemk -learner, one just needs to keep track of the maximum cardinality of a memory set, a quantity which eventually stabilizes and can thus be padded in the conjecture as long as needed. To go from an OBemk -learner to a Bemk learner, one dually pads the ordinal counter in the next conjecture. This also is a quantity that eventually stabilizes on all relevant texts.
328
L. Carlucci
As a word of caution note that a rigorous version of point (2) would read: for every notation a, b, respectively for α and β, such that a
4
Learning with ω-Bounded Example Memory
We prove that learners with ω-bounded example memory can learn strictly more than learners with bounded finite memory. Still, OBemω is strictly weaker than Bem∗ . The same is actually true for all OBemα . We start by recalling the definitions of the classes used in [9] to show that TxtIt ⊂ Bem1 ⊂ Bem2 ⊂ . . . Where Lange and Zeugmann [9] use symbols a, b, we use 3, 2, and where they use string concatenation we use exponentiation. For p ∈ N we use {p}+ to denote the set {p, p2 , p3 , . . . }. Let p0 , p1 , . . . an enumeration of the prime numbers in increasing order. We denote the set {pi , p2i , p3i , . . . } by {p1 }+ . Definition 5 (Class Lk ). Let k ∈ N+ . Lk is the class consisting of the following languages. – L1 = {p1 }+ , – L(j,1 ,...,k ) = {p11 , . . . , pj1 } ∪ {pj0 } ∪ {p11 , . . . , p1k } for all j, 1 , . . . , k ∈ N+ . Note that Lk ⊂ Lk+1 and that, in fact, Lk+1 = {L ∪ {p1 } : ∈ N+ , L ∈ Lk }. One of the main results in [9] is the following Theorem. Theorem 1 (Lange and Zeugmann, [9]). For all k ∈ N, Lk+1 ∈ (Bemk+1 − Bemk ). For a language L and a constructive ordinal α, we denote by L[α] (the α-tagged variant of L) the language obtained from L by replacing each element x by α, x , (i.e., L[α] = {α} × L). For L a language class, we denote by L[α] the class {L[α] : L ∈ L}. We now define a generalization of Lk .
Incremental Learning with Ordinal Bounded Example Memory
329
Definition 6 (Class Ck ). Let k ∈ N+ . Ck is the class (Lk )[k] . Ck is the “k-tagged” variant of Lk . We denote by Ck,≥d , for d ∈ N, the subclass of Ck containing (L1 )[k] and (L(j,1 ,...,k ) )[k] with j ≥ d. For succinctness, we sometimes denote (L1 )[k] by Ck and (L(j,1 ,...,k ) )[k] by C(j,1 ,...,k ) . Since the proof of Theorem 1 in [9] is essentially asymptotic, the following Proposition holds. Proposition 1. For all d, k ∈ N+ , Ck+1,≥d ∈ (Bemk+1 − Bemk ). For the sake of the present Section, we identify Ck with the class obtained from s s t t it by replacing each element k, p1 by pk , and k, p0 by p0 , for ease of notation. Let Cω = k∈N+ Ck . Theorem 2. Cω ∈ (OBemω − k∈N+ OBemk ). Proof. Let j, k ∈ N+ . For a set X ⊆ {pk }+ such that card(X) ≤ k, we write C(j,X) for the set {p1k , . . . , pjk } ∪ {pj0 } ∪ X. We first show Cω ∈ OBemω . Let X be a finite subset of Ck of cardinality ≤ k. Let s ∈ N+ . We define the set update(X, k, psk ) as the set containing the (at most) k elements of X ∪ {psk } with largest exponents. Formally, we define update(X, k, psk ) = X ∪ {psk } if card(X) < k, and otherwise update(X, k, psk ) = z {pzk2 , . . . , pkk+1 } if X = {pxk 1 , . . . , pxk k }, where x1 < · · · < xk , and {x1 , . . . , xk } ∪ {s} = {z1 , . . . , zk+1 }, where z1 < · · · < zk+1 . For technical convenience we define update(X, k, a) = X for all a ∈ / {pk }+ . We now define a learner M and a j0 ∈ N such that (M, j0 ) OBemω -identifies Cω . M ’s conjectures have the form jn = pad(cn , An , Bn ), where An , Bn ∈ N, and – An records the exponent of the first pj0 seen (j = 0), – Bn records the subscript of the first pak seen (k, a = 0) , For every text t = (xi )i∈N+ , we define M0 (t) = (j0 , S0 , α0 ), where A0 = B0 = 0, α0 = ω, c0 = an index for ∅, and Mn+1 (t) = (pad(cn+1 , An+1 , Bn+1 ), Sn+1 , αn+1 ), where An+1 , Bn+1 , Sn+1 , αn+1 , cn+1 are defined as follows. Let k, j, a below indicate elements in N+ . For every n ∈ N, j, if xn+1 = pj0 , k, if xn+1 = pak , An+1 = , Bn+1 = An , otherwise. Bn , otherwise. ˙ We also define α− n to be αn if card(Sn ) ≥ card(Sn+1 ) and αn −1 otherwise. We complete the description of M ’s behaviour by the following case distinction.
330
L. Carlucci
(Case 1) If (An = 0∧xn+1 = pak ), then Sn+1 = update(Sn , k, xn+1 ), αn+1 = k, cn+1 = index for Ck . (Case 2) If (( An , Bn = j, 0 ∨ An , Bn = j, k ) ∧ xn+1 = pak ), then Sn+1 = update(Sn , k, xn+1 ), αn+1 = k, cn+1 = index for C(j,Sn+1 ) . (Case 3) If ( An , Bn = 0, k ∧ xn+1 = pj0 ), then Sn+1 = update(Sn , k, xn+1 ), αn+1 = α− n , cn+1 = index for C(j,Sn+1 ) . (Case 4) If ( An , Bn = j, k ∧ xn+1 = pak ), then Sn+1 = update(Sn , k, xn+1 ), αn+1 = α− n , cn+1 = index for C(j,Sn+1 ) . (Case 5) Else Sn+1 = Sn , αn+1 = αn , cn+1 = cn . The above cases are exhaustive and M is a OBemω -learner for Cω . Let t be a text for L ∈ Cω . Suppose first that t is for Ck for some k > 0. Then the first case M enters is (Case 1). Afterwards, M always enters either (Case 1) or (Case 4) and thus stabilizes on a conjecture for Ck . Suppose now that t is for C(j,1 ,...,k ) for some k, j, 1 , . . . , k ∈ N+ . If the first non-trivial element of text t is pj0 , then M enters (Case 4) and pads An = j in its next conjecture. As soon as the first element of the form psk appears, M enters (Case 2), stores psk in memory and outputs a canonical index for C(j,psk ) . Afterwards, M will always enter either (Case 5) or (Case 4). If the first non-trivial element of text t is psk , then M first enters (Case 1). From then on, until eventually pj0 appears for the first time in t, M pads k in Bn , stores the k maximal elements seen and conjectures Ck . As soon as pj0 appears in t, M enters (Case 3) and afterwards always enters either (Case 5) or (Case 4). Thus Meventually stabilizes on an index for C(j,1 ,...,k ) . We now prove that Cω ∈ / k∈N+ OBemk . Suppose that Cω ∈ OBemk for some k, as witnessed by (M, j0 ). Then Cω ∈ Bemk . But Ck+1 ⊆ Cω , and Ck+1 is not Bemk -identifiable. Contradiction. With a very minor change the above proof shows that Cω is learnable by an OBemω -learner with temporary memory as defined in [7]. We now observe that ordinal bounded example memory learning does not exhaust Bem∗ , i.e., set-driven learning. = ∅. Theorem 3. For all α ∈ O, (Bem∗ − OBemα ) Proof. Consider the following class from [9]. For each j ∈ N, Lj = {2}+ −{2j+1 }, L− = {L j : j ∈ N}. This class is obviously in Bem∗ but it is shown in [9] not to be in k Bemk . To show that L− ∈ / OBemα , we can then argue exactly as in the proof of Theorem 5, Claim 2 in [9]. Suppose otherwise as witnessed by (M, j0 ). Let σ be a locking sequence of the second type for M on L0 . Let M|σ| (σ) = (j|σ|+1 , S|σ|+1 , α|σ|+1 ). Let β = α|σ|+1 ≤ α such that M ’s ordinal counter is equal to β for all extensions of σ in L0 . Then, on all extensions of σ in L0 , M does not make any proper global memory extension. Thus, M ’s memory on all such extensions is bounded by b = max({card(Si ) : i ≤ |σ|}). We omit further details for brevity. We prove a technical Lemma that will be used in Section 5 below. Let M : (N × F in(N) × O) × N# → N × F in(N) × O be a partial computable function.
Incremental Learning with Ordinal Bounded Example Memory
331
Let j0 ∈ N. We say that M is well-behaved on a text t = (xi )i∈N+ if and only if (M, j0 ) satisfies conditions (i)-(iv) of Definition 4. In other words, (M, j0 ) is an OBemα -learner of t, for some α. Lemma 3 (Extraction Lemma). Let C be a class of languages, σ a finite sequence, M a function of the appropriate type, β a constructive ordinal, such that the following properties hold ∀L ∈ C, ∀t such that σ · t is for L: – WM(σ·t) = L, – M is well-behaved on σ · t, – π3 (M|σ| (σ)) ≤ β. ˜ , and j0 ∈ N such that C ⊆ OBemβ+b (M ˜ , j0 ), where b = Then there exists M max({card(π2 (Mi (σ))) : i ≤ |σ|}). ˜ as follows: M ˜ 0 (t) = Proof. We start with an observation. If we define a map M ˜ M|σ| (σ), Mn+1 (t) = M|σ|+n+1 (σ · t), then we don’t necessarily obtain a function satisfying point (iv) in the definition of an OBemα -learner on t. This is because M ’s memory after processing σ, i.e., S|σ|+1 , may be a non-empty subset of content(σ). The idea of the simulation for proving the Lemma is the following. If M makes ˜ , j|σ|+1 ) is the no memory extension while processing σ (i.e., b = 0), then (M ˜ and simulating desired OBemβ -learner. Otherwise, while processing t with M M on σ · t, we can dynamically transfer the part of M ’s memory on the current ˜ on the corresponding initial segment of σ · t that can qualify as a part of M ˜ ’s memory, while padding the residual part in the next initial segment of t to M conjecture. The residual part is eventually stable. How this is done step-by-step is described below. The number of proper global memory extensions made by ˜ is less than the number of proper global memory extensions made by M M beyond σ plus the number of the proper global memory extensions made by M while processing σ, i.e., b. Therefore we can define an appropriate ordinal counter ˜ starting at α|σ|+1 + b. We now prove part (1) in detail. Let M|σ| (σ) be for M (j|σ|+1 , S|σ|+1 , α|σ|+1 ). Let s = |σ| + 1. We distinguish two cases. (Case 1) M makes no memory extension while processing σ. See the above discussion. (Case 2) Not (Case 1). By hypothesis, card(Ss ) ≤ b. Set ˜ 0 (t) = (pad(js , Ss , αs , b0 , m0 ), ∅, α M ˜ 0 ), where b0 = 0, m0 = b, α ˜ 0 = αs + b, and ˜ n+1 (t) = M ˜ ((˜jn , S˜n , α M ˜ n ), xn+1 ) ˜ ˜ ˜ n+1 ) = (jn+1 , Sn+1 , α ˜ n+1 ) = (pad(js+n , Rn+1 , αs+(n+1) , bn+1 , mn+1 ), S˜n+1 , α ˜ , mn+1 where bn+1 records the maximum cardinality of a memory set of M records the maximum cardinality of a memory set of M beyond σ, and the other quantities are defined as follows.
332
L. Carlucci
R0 = Ss , S˜0 = ∅, and Rn+1 , S˜n+1 are defined according to the following case distinction. (Case i) (xn+1 ∈ / Ss+n ) ∧ (xn+1 ∈ Ss+(n+1) ). (Case i.a) xn+1 enters Ss+(n+1) by substituting an element of Ss+n . Thus ˙ n+1 } for some S ⊂ S. Ss+n = S where xn+1 ∈ / S, and Ss+(n+1) = S ∪{x (Case i.b) xn+1 enters in Ss+(n+1) as a new element without substituting any ˙ n+1 }, S = S. element, i.e., Ss+n = S, xn+1 ∈ / S, Ss+(n+1) = S ∪{x ˙ n+1 } and (Case ii) (xn+1 ∈ Ss+n ) ∧ (xn+1 ∈ Ss+(n+1) ). Thus Ss+n = S ∪{x ˙ Ss+(n+1) = S ∪{xn+1 } for some S ⊆ S. (Case iii) (xn+1 ∈ Ss+n ) ∧ (xn+1 ∈ / Ss+(n+1) ). Thus Ss+n = S, xn+1 ∈ S, and Ss+(n+1) = S , for some S ⊂ S, xn+1 ∈ / S , (Case iv) (xn+1 ∈ / Ss+n ) ∧ (xn+1 ∈ / Ss+(n+1) ). Thus Ss+n = S and Ss+(n+1) = S , for some S ⊂ S, and xn+1 ∈ / S ∪ S. We set S˜n+1 = (S˜n ∩ S ) ∪ {xn+1 } in (Case i) and (Case ii), and we set S˜n+1 = (S˜n ∩ S ) in (Case iii) and (Case iv). We set Rn+1 = (S − S˜n ). One can always recover the memory content Ss+(n+1) as Rn+1 ∪ S˜n+1 . The S˜n ’s satisfy the conditions on memory, while the Rn+1 ’s eventually stabilize. The ordinal-counter ˜ is initialized at α of M ˜0 = αs + b. Each time a proper global extension of ˜ occurs, the ordinal-counter is updated as follows. If the the memory S˜n of M extension corresponds to an extension of M ’s memory before σ (this can happen at most b times), then the second component is decreased by 1. If the extension corresponds to an extension of M ’s memory beyond σ, then the first component is decreased, emulating the corresponding ordinal-counter of M (which is padded in the previous conjecture). Lemma 3 above lends itself to a number of variations. E.g., one can conclude ˆ , and j0 ∈ N witnessing that Cσ = from the same hypotheses that there exists M {L − content(σ) : L ∈ C} is in OBemβ . This can be seen as follows: for every t for (L − content(σ)), σ · t is a text for L, and for all L ∈ Cσ , L ∩ content(σ) = ∅. Thus, no element of σ is ever transfered to bounded example memory in the ˜ in the above proof. Therefore no such element contributes a process defining M ˆ can be defined similarly to M ˜ with the following proper memory extension. M extras: M ’s conjecture is always padded in the hypothesis, and f (i) is output instead of i, where f is an injective computable function such that for all x, Wf (x) = (Wx − content(σ)) (f exists by the S-m-n Theorem [10]). Also, if β in Lemma 3 is < ω, then there exists an s ∈ N such that max({card(π2 (Mi (σ · t))) : i ∈ N}) ≤ s, and the conclusion of the Lemma is that there exists an OBems -learner for C.
5
Hierarchy Results above ω
We first exhibit a family of language classes witnessing that OBemω ⊂ OBemω+1 ⊂ · · · ⊂ OBemω+k ⊂ · · · ⊂ OBemω+ω . Then we indicate how to extend it up through ω 2 .
Incremental Learning with Ordinal Bounded Example Memory
333
Definition 7 (Class Cω+k ). For k ∈ N+ , Cω+k is the class of all languages L [ω] such that L = La ∪ Lb where – La is empty or in Cω , and – Lb is empty or in Ck . Thus, Cω+k consists of the following languages, for every choice of i,j,h, 1 , . . . , i , m1 , . . . , mk in N+ . [ω]
– Ci –
= { ω, i, p1 , ω, i, p21 , ω, i, p31 , . . . },
[ω] C(j,m1 ,...,mi )
mi 1 = {ω, i, p1 , . . . , ω, i, pj1 } ∪ {ω, i, pj0 } ∪ {ω, i, pm 1 , . . . , ω, i, p1 },
– Ck = { k, p1 , k, p21 , k, p31 , . . . }, mk 1 – C(j,m1 ,...,mk ) = { k, p1 , . . . , k, pj1 } ∪ { k, pj0 } ∪ { k, pm 1 , . . . , k, p1 }, [ω] [ω] – Ci ∪ Ck , C(h,1 ,...,i ) ∪ C(j,m1 ,...,mk ) , [ω]
– Ci
[ω]
∪ C(j,m1 ,...,mk ) , C(h,1 ,...,i ) ∪ Ck .
Cω+(k+1) as just defined contains more languages than strictly needed to show (OBemω+(k+1) − OBemω+k ) = ∅, yet we have chosen to present this definition for uniformity with extensions to higher ordinals. Let us consider the following subclass of Cω+1 . Let d ∈ N+ . {L ∈ Cω+1 : { 1, p11 , . . . , 1, pd1 } ⊆ L}. This class contains the following language class, for each s ∈ N+ . [ω]
[ω]
{C1 , C(d,d) ∪ Cs+1 , C(d,d) ∪ C(h,1 ,...,s+1 ) : ∀h, 1 , . . . , s+1 ∈ N+ }. For each s ∈ N+ , let us denote the latter class by Cs ⊕ { 1, p11 , . . . , 1, pd1 }. In fact, this class is the same as the following class. [ω]
[ω]
{C1 } ∪ {L ∪ C(d,d) : L ∈ Cs+1 }. The proof of Theorem 1 from [9], can be easily adapted to show the following. Proposition 2. For each d ∈ N+ , for each s ∈ N+ , Cs+1 ⊕ { 1, p1 , . . . , 1, pd1 } ∈ (Bems+1 − Bems ). [ω]
We use this fact essentially in the proof of the following Theorem. Theorem 4. Cω+1 ∈ (OBemω+1 − OBemω ). Proof. To see why Cω+1 ∈ OBemω+1 it is sufficient to notice that a learner M can pad in its previous hypothesis the following information and act accordingly using a straightforward case distinction. – Whether an element of the form 1, pa1 has appeared, – Whether an element of the form ω, s, pa1 has appeared,
334
L. Carlucci
– Whether an element of the form 1, pe0 appeared, – Whether an element of the form ω, 1, pe0 appeared, We now prove Cω+1 ∈ / OBemω . Suppose otherwise as witnessed by (M, j0 ). Without loss of generality suppose α0 = ω. Let σ be a locking sequence (of [1] the first type) for M on L1 . Let σ be the repletion of σ. Then σ is also [1] a locking sequence for M on L1 and content(σ ) = { 1, p11 , . . . , 1, pd1 } for d = max({i : 1, pi1 ∈ content(σ)}). We distinguish two cases. (Case 1) M ’s memory undergoes at least one extension while processing σ . Let b be the value of M ’s counter after processing σ . Then the following holds, because (M, j0 ) OBemω -identifies Cω+1 by hypothesis. (∀L ∈ Cω+1 : content(σ ) ⊆ L)(∀τ ⊂ L|τ ⊃ σ )[card(π2 (M|τ | (τ ))) ≤ b]. By choice of σ , S := {L ∈ Cω+1 : content(σ ) ⊆ L} ⊇ Cs[ω] ⊕ { 1, p11 , . . . , 1, pd1 }, for every s ∈ N+ , in particular for s = b + 1. Now the conditions of Lemma 3 apply by taking σ, C, β in the statement of that Lemma to be, respectively, σ , [ω] [ω] Cb+1 ⊕ { 1, p11 , . . . , 1, pd1 } and b. In fact, ∀L ∈ Cb+1 ⊕ { 1, p11 , . . . , 1, pd1 }, ∀t such that σ · t is for L, – WM(σ ·t) = L, – M is well-behaved on σ · t. – π3 (M|σ | (σ )) ≤ b. The first and second items are true because M by hypothesis OBemω -identifies Cω+1 , the third item is true by hypotheses of the present case. Let b be the maximum cardinality attained by a memory set of M while processing σ . By ˜ , and j0 ∈ N such that (M ˜ , j0 ) OBem(b +b) -identifies Lemma 3, one can define M / S. But C(b +b+1) ⊕{ 1, p11 , . . . , 1, pd1 } ⊆ S, and C(b +b+1) ⊕{ 1, p11 , . . . , 1, pd1 } ∈ OBem(b +b) , by Proposition 2. Contradiction. (Case 2) M ’s memory undergoes no extension while processing σ . We distinguish two subcases. (Case 2.1) For every extension τ of σ in C1 , M ’s memory is not extended while processing τ . Then σ · 1, pd+1
and σ · 1, pd+2
are equal for M . Then M cannot 1 1 distinguish between the texts σ · 1, pd+1
· 1, pd0 ·#∗ and σ · 1, pd+2
· 1, pd0 ·#∗ , 1 1 [1] [1] respectively for L(d,d+1) and L(d,d+2) , both in Cω+1 . (Case 2.2) There exists an extension τ of σ in C1 such that M ’s memory undergoes an extension while processing τ . Then τ is a locking sequence for M on C1 to which (Case 1) applies. [ω]
[ω]
Theorem 5. For all k ∈ N, Cω+(k+1) ∈ (OBemω+(k+1) − OBemω+k ). Proof. The base case k = 0 is Theorem 4. For the k > 1 case one can argue as follows. Suppose by way of contradiction that (M, j0 ) witnesses Cω+(k+1) ∈
Incremental Learning with Ordinal Bounded Example Memory
335
OBemω+k . Let σ be a locking sequence (of the first type) for M on Ck+1 . Consider the following cases. (Case 1) For every extension σ ⊇ σ in Ck+1 , M makes no memory extension while processing σ . Then M can be fooled as an iterative learner in Case [k+1] 2.1 of Theorem 4 above. The relevant languages here are L(k+1,1,...,k,k+1) and [k+1]
L(k+1,1,...,k,k+2) . (Case 2) Not (Case 1), and for some extension σ of σ in Ck+1 , M makes more than k memory extensions while processing σ . Thus, M commits to finite memory b for some b ∈ N. Then one can argue as in Case 1 of Theorem 4 above, considering the class of those languages in Cω+(k+1) that contain content(σ ). (Case 3) Not (Case 1) and not (Case 2). Then there exists an extension σ of σ in Ck+1 , such that M makes at least one memory extension while processing σ and for all extension σ of σ in Ck+1 , M makes at most k memory extensions while processing σ . Then one can argue as in the proof of Theorem 1 (Claim 3 of Theorem 5 in [9]). The point is that the number of possible sets extending content(σ ) by adding k + 1 elements of the form k + 1, pt1 with d < t ≤ d + 3n 3n (where d = max({i : k + 1, pi1 ∈ content(σ )})) grows as k+1 in n, while the k number of possible memory contents of M beyond σ is on such sets is i=0 3n i , which is asymptotically smaller. This allows one to select appropriate sets in Cω+(k+1) which M fails to distinguish on two texts extending σ . Let Cω+ω be
k∈N+
Cω+k .
Theorem 6. Cω+ω ∈ (OBemω+ω −
k∈N+
OBemω+k ).
Proof. Cω+ω ∈ OBemω+ω is easy. At step n, M can pad into its conjecture a quadruple An , Bn , Jn , Hn that keeps track of the following information, and act accordingly. An records the minimal x > 0 such that a x, pa1 has occurred. Bn records the minimal x > 0 such that a ω, x, pa1 has occurred. Jn records the minimal z > 0 such that a i, pz0 has occurred. Hn records the minimal z > 0 such that a ω, i, pz0 has occurred. It is easy to see that Cω+ω ∈ / k∈N+ OBemω+k . Suppose otherwise. Then for some k ∈ N, there exists (M, j0 ) witnessing Cω+ω ∈ OBemω+k . But Cω+(k+1) ⊆ Cω+ω . A contradiction to Theorem 5. – – – –
We now indicate how to extend the hierarchy up through ordinal ω 2 . A similar pattern can be possibly adapted to yield a general non-collapsing result. Definition 8 (Class Cω·m+k ). For m, k ∈ N+ , the class Cω·m+k is the class containing all languages L such that L = La1 ∪· · ·∪Lam+1 , where, for 1 ≤ i ≤ m, [ω·i] Lai∈ n∈N+ Ln or is empty, and Lam+1 ∈ Ck . The class Cω·(m+1) is defined as k∈N+ Cω·m+k .
336
L. Carlucci
Theorem 7. For all m, k ∈ N+ , − OBemω·m+k ), and (1) Cω·m+(k+1) ∈ (OBemω·m+(k+1) (2) Cω·m ∈ (OBemω·m − β<ω·m OBemβ ). Proof. The base cases are given by Theorem 5 and Theorem 6, respectively. Learnability is straightforward: an element ω · i, s, pt1 informs the learner that s elements of type ω · i are to be memorized, using one of the m available additive terms of form ω in the ordinal counter. We sketch the general unlearnability proof for point (1). Given, by way of contradiction, a candidate OBemω·m+k learner (M, j0 ) for Cω·m+(k+1) we consider a locking sequence σ for M on Ck+1 and distinguish three cases. (Case 1) For every extension σ ⊇ σ in Ck+1 , M makes no memory extension while processing σ . Then M can be fooled as an iterative learner in Case 2.1 of Theorem 4 above. (Case 2) Not (Case 1), and for some extension σ of σ in Ck+1 , M makes more than k memory extensions while processing σ . For some m ≤ m and s ∈ N, M commits to using memory bounded by ω · m + s beyond σ . Then one can argue analogously as in Case 1 of Theorem 4 above, considering the class of languages in Cω·m+(k+1) that contain content(σ ). This class can be shown to include a class that is hard for OBemω·m -learners. (Case 3) Not (Case 1) and not (Case 2). Then there exists an extension σ of σ in Ck+1 , such that M makes at least one memory extension while processing σ and for all extension σ of σ in Ck+1 , M makes at most k memory extensions while processing σ . Then one can use the counting argument from the proof of Theorem 1 (Claim 3 of Theorem 5 in [9]), as done in Theorem 5 above.
6
Conclusion
We have introduced a proper extension of the Bounded Example Memory model featuring algorithmic count-down from constructive ordinals to bound the number of proper, global memory extensions an incremental learner is allowed on its way to convergence. We have shown that the concept gives rise to criteria that lie strictly between the finite Bounded Example Memory hierarchy k Bemk and set-driven learning Bem∗ . We have exhibited a hierarchy of learning criteria up through ordinal ω 2 . We are confident that the general problem - given constructive ordinals α > β, is it the case that (OBemα − OBemβ ) = ∅? can be attacked using similar methods. We also plan to investigate ordinal versions of feedback learning from [3]. An interesting side-question is: Are learners with cumulative memory as powerful as learners that have the freedom to erase memory content? Acknowledgments. The author thanks the ALT 2009 anonymous referees for useful comments. Special thanks go to one of the referees, who also suggested how to extend the results of the present paper. Making justice of his suggestion would have required a substantial reworking of the presentation and will be taken up in future work.
Incremental Learning with Ordinal Bounded Example Memory
337
References [1] Blum, L., Blum, M.: Towards a mathematical theory of inductive inference. Information and Control 28, 125–155 (1975) [2] Carlucci, L., Case, J., Jain, S.: Learning correction grammars. In: Bshouty, N., Gentile, C. (eds.) Proceedings of the 20th Annual Conference on Learning Theory, San Diego, USA, pp. 203–217 (2007) [3] Case, J., Jain, S., Lange, S., Zeugmann, T.: Incremental concept learning for bounded data mining. Information and Computation 152(1), 74–110 (1999) [4] Freivalds, R., Smith, C.H.: On the role of procrastination for Machine Learning. Information and Computation 107, 237–271 (1993) [5] Gold, E.M.: Language identification in the limit. Information and Control 10, 447–474 (1967) [6] Jain, S., Osherson, D., Royer, J., Sharma, A.: Systems that learn: an introduction to learning theory, 2nd edn. MIT Press, Cambridge (1999) [7] Lange, S., Moelius, S.E., Zilles, S.: Learning with Temporary Memory. In: Freund, Y., Gy¨ orfi, L., Tur´ an, G., Zeugmann, T. (eds.) ALT 2008. LNCS (LNAI), vol. 5254, pp. 449–463. Springer, Heidelberg (2008) [8] Kinber, E., Stephan, F.: Language learning from texts: mind-changes, limited memory, and monotonicity. Information and Computation 123(2), 224–241 (1995) [9] Lange, S., Zeugmann, T.: Incremental learning from positive data. Journal of Computer and System Sciences 53(1), 88–103 (1996) [10] Rogers, H.: Theory of recursive functions and effective computability. McGrawHill, New York (1967); Reprinted by MIT Press (1987) [11] Wexler, K., Culicover, P.W.: Formal principles of language acquisition. MIT Press, Cambridge (1980) [12] Wiehagen, R.: Limes-Erkennung rekursiver Funktionene durch spezielle Strategien. Elektronische Informationsverarbeitung und Kybernetik 12(1/2), 93–99 (1976)
Learning from Streams Sanjay Jain1, , Frank Stephan2, , and Nan Ye3 1 Department of Computer Science, National University of Singapore, Singapore 117417, Republic of Singapore
[email protected] 2 Department of Computer Science and Department of Mathematics, National University of Singapore, Singapore 117417, Republic of Singapore
[email protected] 3 Department of Computer Science, National University of Singapore, Singapore 117417, Republic of Singapore
[email protected]
Abstract. Learning from streams is a process in which a group of learners separately obtain information about the target to be learned, but they can communicate with each other in order to learn the target. We are interested in machine models for learning from streams and study its learning power (as measured by the collection of learnable classes). We study how the power of learning from streams depends on the two parameters m and n, where n is the number of learners which track a single stream of input each and m is the number of learners (among the n learners) which have to find, in the limit, the right description of the target. We study for which combinations m, n and m , n the following inclusion holds: Every class learnable from streams with parameters m, n is also learnable from streams with parameters m , n . For the learning of uniformly recursive classes, we get a full characterization which depends only on the ratio m ; but for general classes the picture is more complin cated. Most of the noninclusions in team learning carry over to noninclusions with the same parameters in the case of learning from streams; but only few inclusions are preserved and some additional noninclusions hold. Besides this, we also relate learning from streams to various other closely related and well-studied forms of learning: iterative learning from text, learning from incomplete text and learning from noisy text.
1
Introduction
The present paper investigates the scenario where a team of learners observes data from various sources, called streams, so that only the combination of all these data give the complete picture of the target to be learnt; in addition the communication abilities between the team members is limited. Examples of such a scenario are the following: some scientists perform experiments to study a phenomenon, but no one has the budget to do all the necessary experiments and
Supported in part by NUS grant number R252-000-308-112. Supported in part by NUS grant number R146-000-114-112.
R. Gavald` a et al. (Eds.): ALT 2009, LNAI 5809, pp. 338–352, 2009. c Springer-Verlag Berlin Heidelberg 2009
Learning from Streams
339
therefore they share the results; various earth-bound telescopes observe an object in the sky, where each telescope can see the object only during some hours a day; several space ships jointly investigate a distant planet. This concrete setting is put into the abstract framework of inductive inference as introduced by Gold [2,5,9]: the target to be learnt is modeled as a recursively enumerable set of natural numbers (which is called a “language”); the team of learners has to find in the limit an index for this set in a given hypothesis space. This hypothesis space might be either an indexed family or, in the most general form, just a fixed acceptable numbering of all r.e. sets. Each team member gets as input a stream whose range is a subset of the set to be learnt; but all team members together see all the elements of the set to be learnt. Communication between the team members is modeled by allowing each team member to finitely often make its data available to all the other learners. The notion described above is denoted as [m, n]StreamEx-learning where n is the number of team members and m is the minimum number of learners out of these n which must converge to the correct hypothesis in the limit. Note that this notion of learning from streams is a variant of team learning, denoted as [m, n]TeamEx, which has been extensively studied [1,11,15,16,18,19]; the main difference between the two notions is that in team learning, all members see the same data, while in learning from streams, each team member sees only a part of the data and can exchange with the other team members only finitely much information. In the following, Ex denotes the standard notion of learning in the limit from text; this notion coincides with [1, 1]StreamEx. In related work, Baliga, Jain and Sharma [4] investigated a model of learning from various sources of inaccurate data where most of the data sources are nearly accurate. We start with giving the formal definitions in Section 2. In Section 3 we first establish a characterization result for learning indexed families. Our main theorem in this section, Theorem 7, shows a tell-tale like characterization for learning from streams for indexed families. An indexed family L = {L0 , L1 , . . .} n is [m, n]StreamEx-learnable iff it is [1, m ]StreamEx-learnable iff there exists a uniformly r.e. sequence E0 , E1 , . . . of finite sets such that Ei ⊆ Li and there n are at most m many languages L in L with Ei ⊆ L ⊆ Li . Thus, for indexed families, the power of learning from streams depends only on the success ratio. Additionally, we show that for indexed families, the hierarchy for stream learning is similar to the hierarchy for team function learning (see Corollary 9); note 1 that there is an indexed family in [m, n]TeamEx − [m, n]StreamEx iff m n ≤ 2. We further show (Theorem 11) that a class L can be noneffectively learned from streams iff each language in L has a finite tell-tale set [2] with respect to the class L, though these tell-tale sets may not be uniformly recursively enumerable from their indices. Hence the separation among different stream learning criteria is due to computational reasons rather than information theoretic reasons. In Section 4 we consider the relationship between stream learning criteria with different parameters, for general classes of r.e. languages. Unlike the indexed family case, we show that more streaming is harmful (Theorem 13): There are classes of languages which can be learned by all n learners when the data is
340
S. Jain, F. Stephan, and N. Ye
divided into n streams, but which cannot be learned even by one of the learners when the data is divided into n > n streams. Hence, for learning r.e. classes, [1, n]StreamEx and [1, n ]StreamEx are incomparable for different n, n ≥ 1. This stands in contrast to the learning of indexed families where we have that [1, n]StreamEx is properly contained in [1, n + 1]StreamEx for each n ≥ 1. Theorem 14 shows that requiring fewer number of machines to be successful gives more power to stream learning even if the success ratio is sometimes high. For each m there exists a class which is [m, n]StreamEx-learnable for all n ≥ m but not [m + 1, n ]StreamEx-learnable for any n ≥ 2m. In Section 5 we first show that stream learning is a proper restriction of team learning in the sense that [m, n]StreamEx ⊂ [m, n]TeamEx, as long as 1 ≤ m ≤ n and n > 1. We also show how to carry over several separation results from team learning to learning from streams, as well as give one simulation 2 result which carries over. In particular we show in Theorem 17 that if m n > 3 then [m, n]StreamEx = [n, n]StreamEx. Also, in Theorem 19 we show that if m 2 ⊆ Ex. One can similarly carry over several more n ≤ 3 then [m, n]StreamEx separation results from team learning. One could consider streaming of data as some form of “missing data” as each individual learner does not get to see all the data which is available, even though potentially any particular data can be made available to all the learners via synchronization. Iterative learning studies a similar phenomenon from a different perspective: though the (single) learner gets all the data, it cannot remember all of its past data; its new conjecture depends only on its just previous conjecture and the new data. We show in Theorem 20 that in the context of iterative learning, learning from streams is not restrictive (and is advantageous in some cases, as Corollary 8 can be adapted for iterative stream learners). We additionally compare stream learning with learning from incomplete or noisy data as considered in [8,13].
2
Preliminaries and Model for Stream Learning
For any unexplained recursion theoretic notation, the reader is referred to the textbooks of Rogers [17] and Odifreddi [12]. The symbol N denotes the set of natural numbers, {0, 1, 2, 3, . . .}. Subsets of N are referred to as languages. The symbols ∅, ⊆, ⊂, ⊇ and ⊃ denote empty set, subset, proper subset, superset and proper superset, respectively. The cardinality of a set S is denoted by card(S). max(S) and min(S), respectively, denote the maximum and minimum of a set S, where max(∅) = 0 and min(∅) = ∞. dom(ψ) and ran(ψ) denote the domain and range of ψ. Furthermore, ·, · denotes a recursive 1–1 and onto pairing function [17] from N × N to N which is increasing in both its arguments:
x, y = (x+y)(x+y+1) + y. The pairing function can be extended to n-tuples by 2 taking x1 , x2 , . . . , xn = x1 , x2 , . . . , xn . The information available to the learner is a sequence consisting of exactly the elements in the language being learned. In general, any sequence T on N ∪ {#} is called a text, where # indicates a pause in information presentation. T (t)
Learning from Streams
341
denotes the (t + 1)-st element in T and T [t] denotes the initial segment of T of length t. Thus T [0] = , where is the empty sequence. ctnt(T ) denotes the set of numbers in the text T . If σ is an initial segment of a text, then ctnt(σ) denotes the set of numbers in σ. Let SEQ denote the set of all initial segments. For σ, τ ∈ SEQ, σ ⊆ τ denotes that σ is an initial segment of τ . |σ| denotes the length of σ. A learner from texts is an algorithmic mapping from SEQ to N ∪ {?}. Here the output ? of the learner is interpreted as “no conjecture at this time.” For a learner M , one can view the sequence M (T [0]), M (T [1]), . . ., as a sequence of conjectures (grammars) made by M on T . Intuitively, successful learning is characterized by the sequence of conjectured hypotheses eventually stabilizing on correct ones. The concepts of stabilization and correctness can be formulated in various ways and we will be mainly concerned with the notion of explanatory (Ex) learning. The conjectures of learners are interpreted as grammars in a given hypothesis space H, which is always recursively enumerable family of r.e. languages (in some cases, we even take the hypothesis space to be a uniformly recursive family, also called an indexed family). Unless specified otherwise, the hypothesis space is taken to be a fixed acceptable numbering W0 , W1 , . . . of all r.e. sets. Definition 1 (Gold [9]). Given a hypothesis space H = {H0 , H1 , . . .} and a language L, a sequence of indices i0 , i1 , . . . is said to be an Ex-correct grammar sequence for L, if there exists s such that for all t ≥ s, Hit = L and it = is . A learner M Ex-learns a class L of languages iff for every L ∈ L and every text T for L, M on T outputs an Ex-correct grammar sequence for L. We use Ex to also denote the collection of language classes which are Exlearnt by some learner. Now we consider learning from streams. For this the learners would get streams of texts as input, rather than just one text. Definition 2. Let n ≥ 1. T = (T1 , . . . , Tn ) is said to be a streamed text for L if ctnt(T1 ) ∪ . . . ∪ ctnt(Tn ) = L. Here n is called the degree of dispersion of the streamed text. We sometimes call a streamed text just a text, when it is clear from the context what is meant. Suppose T = (T1 , . . . , Tn ) is a streamed text. Then, for all t, σ = (T1 [t], . . . , Tn [t]), is called an initial segment of T . Furthermore, we define T [t] = (T1 [t], . . . , Tn [t]). We define ctnt(T [t]) = ctnt(T1 [t]) ∪ . . . ∪ ctnt(Tn [t]) and similarly for the content of streamed texts. We let SEQn = {(σ1 , σ2 , . . . , σn ) : σ1 , σ2 , . . . , σn ∈ SEQ and |σ1 | = |σ2 | = . . . = |σn |}. For σ = (σ1 , σ2 , . . . , σn ) and τ = (τ1 , τ2 , . . . , τn ), we say that σ ⊆ τ if σi ⊆ τi for i ∈ {1, . . . , n}. Let L be a language collection and H be a hypothesis space. When learning from streams, a team M1 , ..., Mn of learners accesses a streamed text T = (T1 , . . . , Tn ) and works as follows. At time t, each learner Mi sees as input Ti [t] plus the initial segment T [synct ], outputs a hypothesis hi,t and might update synct+1 to t. Here, initially sync0 = 0 and synct+1 = synct whenever no team member updates synct+1 at time t.
342
S. Jain, F. Stephan, and N. Ye
Assume that 1 ≤ m ≤ n. A team (M1 , . . . , Mn ) [m, n]StreamEx-learns L iff for every L ∈ L and every streamed text T for L, (a) there is a maximal t such that synct+1 = t and (b) for at least m indices i ∈ {1, 2, . . . , n}, the sequence of hypotheses hi,0 , hi,1 , . . . is an Ex-correct sequence for L. We let [m, n]StreamEx denote the collection of language classes which are [m, n]StreamEx-learnt by some team. The ratio m n is called the success-ratio of the team. Note that a class L is [1, 1]StreamEx-learnable iff it is Ex-learnable. A further important notion is that of team learning [18]. This can be reformulated in our setting as follows: L is [m, n]TeamEx-learnable iff there is a team of learners (M1 , . . . , Mn ) which [m, n]StreamEx-learn every language L ∈ L from every streamed text (T1 , . . . , Tn ) for L when T1 = T2 = · · · = Tn (and thus each Ti is a text for L). For notational convenience we sometimes use Mi (T [t]) = Mi (T1 [t], . . . , Tn [t]) (along with Mi (Ti [t], T [synct ])) to denote Mi ’s output at time t when the team M1 , . . . , Mn gets the streamed text T = (T1 , . . . , Tn ) as input. Note that here the learner sees several inputs rather than just one input as in the case of learning from texts (Ex-learning). It will be clear from context which kind of learner is meant. One can consider updating of synct+1 to t as synchronization, as the data available to any of the learners is passed to every learner. Thus, for ease of exposition, we often just refer to updating of synct+1 to t by Mi as request for synchronization by Mi . Note that in our models, there is no more synchronization after some finite time. If one allows synchronization without such a constraint, then the learners can synchronize at every step and thus there would be no difference from the team learning model. Furthermore, in our model there is no restriction on how the data is distributed among the learners. This is assumed to be done in an adversary manner, with the only constraint being that every datum appears in some stream. A stronger form would be that the data is distributed via some mechanism (for example, x, if present, is assigned to the stream x mod n + 1). We will not be concerned with such distributions but only point out that learning in such a scenario is easier. The following proposition is immediate from Definition 2. Proposition 3. Suppose 1 ≤ m ≤ n. Then the following statements hold. (a) [m, n]StreamEx ⊆ [m, n]TeamEx. (b) [m + 1, n + 1]StreamEx ⊆ [m, n + 1]StreamEx. (c) [m + 1, n + 1]StreamEx ⊆ [m, n]StreamEx. The following definition on stabilizing sequence and locking sequences are generalizations of similar definitions for learning from texts. Definition 4 (Based on Blum and Blum [5], Fulk [7]). Suppose that L is a language and M1 , . . . , Mn are learners. Then, σ = (σ1 , . . . , σn ) is called a stabilizing sequence for M1 , . . . , Mn on L for [m, n]StreamEx-learning iff ctnt(σ) ⊆ L
Learning from Streams
343
and there are at least m numbers i ∈ {1, . . . , n} such that for all streamed texts T for L with σ = T [|σ|] and for all t ≥ |σ|, when M1 , . . . , Mn are fed the streamed text T , for synct and hi,t as defined in Definition 2, (a) synct ≤ |σ| and (b) hi,t = hi,|σ| . A stabilizing sequence σ is called a locking sequence for M1 , . . . , Mn on L for [m, n]StreamEx-learning iff in (b) above hi,|σ| is additionally an index for L (in the hypothesis space used). The following fact is based on a result of Blum and Blum [5]. Fact 5. Assume that L is [m, n]StreamEx-learnable by M1 , ..., Mn . Then there exists a locking sequence σ for M1 , M2 , . . . , Mn on L.
3
Some Characterization Results
In this section we first consider a characterization for learning from streams for indexed families. Our characterization is similar in spirit to Angluin’s characterization for learning indexed families. Definition 6 (Angluin [2]). L is said to satisfy the tell-tale set criterion if for every L ∈ L, there exists a finite set DL such that for any L ∈ L with L ⊇ DL , we have L ⊂ L. DL is called a tell-tale set of L. {DL : L ∈ L} is called a family of tell-tale sets of L. Angluin [2] used the term exact learning to refer to learning using the language class to be learned as the hypothesis space and she showed that a uniformly recursive language class L is exactly Ex-learnable iff it has a uniformly recursively enumerable family of tell-tale sets [2]. A similar characterization holds for noneffective learning [10, pp. 42–43]: Any class L of r.e. languages is noneffectively Ex-learnable iff L satisfies the tell-tale criterion. For learning from streamed text, we have the following corresponding characterization. 1 1 Theorem 7. Suppose k ≥ 1 and 1 ≤ m ≤ n and k+1 < m n ≤ k . Suppose L = {L0 , L1 , . . .} is an indexed family where one can effectively (in i, x) test whether x ∈ Li . Then L ∈ [m, n]StreamEx iff there exists a uniformly r.e. sequence E0 , E1 , . . . of finite sets such that for each i, Ei ⊆ Li and there are at most k sets L ∈ L with Ei ⊆ L ⊆ Li .
Proof. (⇒): Suppose M1 , M2 , . . . , Mn witness that L is in [m, n]StreamEx. Consider any Li ∈ L. Let σ = (σ1 , σ2 , . . . , σn ) be a stabilizing sequence for M1 , M2 , . . . , Mn on Li . Fix any j such that 1 ≤ j ≤ n and for all streamed texts T for Li which extend σ, for all t ≥ |σ|, Mj (T [t]) = Mj (σ). Let Tr = σr #∞ for r ∈ {1, . . . , n} − {j}. Thus, for any L ∈ L and text Tj for L such that Tj extends σj and ctnt(σ) ⊆ L ⊆ Li , we have that m of M1 , . . . , Mn on (T1 , . . . , Tn ) converge to grammars for L. Since the sequence of grammars output by Mr on (T1 , T2 , . . . , Tn ) is independent of L chosen above (with the only constraint being n L satisfying ctnt(σ) ⊆ L ⊆ Li ), we have that there can be at most m such
344
S. Jain, F. Stephan, and N. Ye
L ∈ L. Now note that a stabilizing sequence σ for M1 , M2 , . . . , Mn on Li can be foundin the limit. Let σs denote the s-th approximation to σ. Then one can let Ei = s∈N ctnt(σ s ) ∩ Li . (⇐): Assume without loss of generality that each Li is distinct. Let Ei,s denote Ei enumerated within s steps by the uniform process for enumerating all the Ei ’s. Now, the learners M1 , . . . , Mn work as follows on a streamed text T . The learners keep variables it , st along with synct . Initially i0 = s0 = 0. At time t ≥ 0 the learner Mj does the following: If Eit ,st ⊆ ctnt(T [synct ]) or Eit ,st = Eit ,t or ctnt(Tj [t]) ⊆ Lit , then synchronize and let it+1 , st+1 be such that it+1 , st+1 = it , st + 1. Note that it , st can be recovered from T [synct ]. Note that for input streamed text T for Li , the values of it , st converge as t goes to ∞. Otherwise, synct also diverges, and once synct is large enough so that Ei ⊆ T [synct ] and one considers it , st for which it = i and Ei,s = Ei,st for s ≥ st , then the conditions above ensure that it , st and synct do not change any further. Furthermore, i = limt→∞ it satisfies that Ei ⊆ Li ⊆ Li . The output conjectures of the learners at time t are determined as follows: Let S be the set of (up to) k least elements below t such that each j ∈ S satisfies Eit ,st ⊆ Lj ∩ {x : x ≤ t} ⊆ Lit ∩ {x : x ≤ t}. Then, we allocate, for each j ∈ S, m learners to output grammars for Lj . It is easy to verify that, for large enough t, it and st would have stabilized to, say, i and s , respectively, and S will contain every j such that Ei ⊆ Lj ⊆ Li . Thus, the team M1 , M2 , . . . , Mn will [m, n]StreamEx-learn each Lj such that Ei ⊆ Lj ⊆ Li (the input language Li is one such Lj ). The theorem follows from the above analysis. Here note that the direction (⇒) of the theorem holds even for arbitrary classes L of r.e. languages, rather than just indexed families. The direction (⇐) does not hold for arbitrary classes of r.e. languages. Furthermore, the learning algorithm given above for the direction (⇐) uses the indexed family L itself as the hypothesis space: so this is exact learning. Corollary 8. Suppose 1 ≤ m ≤ n, 1 ≤ m ≤ n and contain the following sets:
n m
≥ k+1 >
n m .
Let L
– the sets {2e + 2x : x ∈ N} for all e; – the sets {2e + 2x : x ≤ |We | + r} for all e ∈ N and r ∈ {1, 2, . . . , k}; – all finite sets containing at least one odd element. Then L ∈ [m, n]StreamEx − [m , n ]StreamEx and L can be chosen as an indexed family. Proof sketch. First we show that L ∈ [1, k + 1]StreamEx. For each e and for each L ⊆ {2e, 2e + 2, 2e + 4, . . .} with {2e} ⊆ L, let EL = {2e}; also, for any language L ∈ L containing an odd number, let EL = L. Now, for an appropriate indexing L0 , L1 , . . . of L, {ELi : i ∈ N} is a collection of uniformly r.e. finite sets and for each L ∈ L, there are at most k + 1 sets L ∈ L such that EL ⊆ L ⊆ L. Thus, L ∈ [1, k+1]StreamEx by Theorem 7. On the other hand, for each L ∈ L, one cannot effectively (in indices for L) enumerate a finite subset EL of L such
Learning from Streams
345
that EL ⊆ L ⊆ L for at most k languages L ∈ L. We omit the details and the proof that L can be chosen as an indexed family. Corollary 9. Let IND denote the collection of all indexed families. Suppose 1 ≤ m ≤ n and 1 ≤ m ≤ n . Then [m, n]StreamEx∩IND ⊆ [m , n ]StreamEx n n ≤ m ∩ IND iff m . Remark 10. One might also study the inclusion problem for IND with respect to related criteria. One of them being conservative learning [2], where the additional requirement is that a team member Mi of a team M1 , . . . , Mn can change its hypothesis from Ld to Le only if it has seen, either in its own stream or in the synchronized part of all streams, some datum x ∈ / Ld . If one furthermore requires that the learner is exact, that is, uses the hypothesis space given by the indexed family, then one can show that there are more breakpoints than in the case of usual team learning. For example, there is a class which under these assumptions is conservatively [2, 3]StreamEx-learnable but not conservatively learnable. The indexed family L = {L0 , L1 , . . .} witnessing this separation is defined as follows. Let Φ be a Blum complexity measure. For e ∈ N and a ∈ {1, 2}, L3e+a is {e, e + 1, e + 2, . . .} if Φe (e) = ∞ and L3e+a is {e, e + 1, e + 2, . . .} − {Φe (e) + e + a} if Φe (e) < ∞. Furthermore, the sets L0 , L3 , L6 , . . . form a recursive enumeration of all finite sets D for which there is an e with Φe (e) < ∞, min(D) = e and max(D) ∈ {Φe (e) + e + 1, Φe (e) + e + 2}. Note that the usage of the exact hypothesis space is essential for this remark. However, the earlier results of this section do not depend on the choice of the 1 m hypothesis space. Assume that there is a k ∈ {1, 2, 3, . . .} with m n ≤ k < n . Then, similarly to Corollary 8, one can show that some class is conservatively [m, n]StreamEx-learnable but not conservatively [m , n ]StreamEx-learnable. The following result follows using the proof of Theorem 7 for noneffective learners. For noneffective learners one can consider every class as an indexed family. Furthermore, finitely many elements can be added to Ei to separate Li from the finitely many subsets of it which contain Ei and are proper subsets of Li — thus giving us a tell-tale set for Li . Theorem 11. Suppose 1 ≤ m ≤ n. L is noneffectively [m, n]StreamEx-learnable iff L satisfies Angluin’s tell-tale set criterion. The above theorem shows that any separation between learning from streams with different parameters must be due to computational difficulties. Remark 12. Behaviourally correct learning (Bc-learning) requires a learner to eventually output only correct hypotheses. Thus, the learner semantically converges to a correct hypothesis, but may not converge syntactically (see [6,14] for a formal definition). Suppose n ≥ 1. If an indexed family is [1, n]StreamExlearnable, then it is Bc-learnable using an acceptable numbering as hypothesis space. This follows from the fact that an indexed family is Bc-learnable using an acceptable numbering as hypothesis space iff it satisfies the noneffective tell-tale criterion [3].
346
4
S. Jain, F. Stephan, and N. Ye
Relationship between Various StreamEx-criteria
In this and the next section, for m, n, m , n with 1 ≤ m ≤ n and 1 ≤ m ≤ n , we consider the relationship between [m, n]StreamEx and [m , n ]StreamEx. We shall develop some basic theorems to show how the degree of dispersion, the success ratio and the number of successful learners required, affect the ability to learn from streams. First, we show that the degree of dispersion plays an important role in the power of learning from streams. The next theorem shows that for any n, there are classes which are learnable from streams when the degree of dispersion is not more than n, but are not learnable from streams when the degree of dispersion is larger than n, irrespective of the success ratio. Theorem 13. Forany n ≥ 1, there exists a language class L such that L ∈ [n, n]StreamEx − n >n [1, n ]StreamEx. Proof. Consider the class L = L1 ∪ L2 , where L1 = {L : L = Wmin(L) ∧ ∀x[card({(n + 1)x, . . . , (n + 1)x + n} ∩ L) ≤ 1]} and L2 = {L : ∃x [{(n + 1)x, . . . , (n + 1)x + n} ⊆ L] and L = Wx for the least such x}. It is easy to verify that L can be [n, n]StreamEx-learnt. The learners can use synchronization to first find out the minimal element e in the input language; thereafter, they can conjecture e, until one of the learners (in its stream) observes (n+1)x+j and (n+1)x+j for some x, j, j , where j = j and j, j ≤ n; in this case the learners use synchronization to find and conjecture (in the limit) the minimal x such that {(n + 1)x, . . . , (n + 1)x + n} is contained in the input language. Now suppose by way of contradiction that L is [1, n ]StreamEx-learnable by M1 , . . . , Mn for some n > n. We will use Kleene’s recursion theorem to construct a language in L which is not [1, n ]StreamEx-learned by M1 , . . . , Mn . First, we give an algorithm to construct in stages a set Se depending on a parameter e. At stage s, we construct (σ1,s , . . . , σn ,s ) ∈ SEQn where we will always have that σi,s ⊆ σi,s+1 . – Stage 0: (σ1,0 , σ2,0 , . . . , σn ,0 ) = (e, #, . . . , #). Enumerate e into Se . – Stage s > 0. Let σ = (σ1,s−1 , . . . , σn ,s−1 ). Search for a τ = (τ1 , . . . , τn ) ∈ SEQn , such that (i) for i ∈ {1, . . . , n }, σi,s−1 ⊂ τi , (ii) min(ctnt(τ )) = e and (iii) for all x, card({y : y ≤ n, (n + 1)x + y ∈ ctnt(τ )}) ≤ 1, and one of the following holds: (a) One of the learners requests for synchronization after τ is given as input to the learners M1 , . . . , Mn . (b) All the learners make a mind change between seeing σ and τ , that is, for all i with 1 ≤ i ≤ n , for some τ with σ ⊆ τ ⊆ τ , Mi (σ) = Mi (τ ). If one of the searches succeeds, then let σi,s = τi , enumerate ctnt(τ ) into Se and go to stage s + 1.
Learning from Streams
347
If each stage finishes, then by Kleene’s recursion theorem, there exists an e such that We = Se and thus We ∈ L1 . For i ∈ {1, . . . , n }, let Ti = s σi,s . Now, either the learners M1 , . . . , Mn synchronize infinitely often or each of them makes infinitely many mind changes when the streamed text T = (T1 , T2 , . . . , Tn ) is given to them as input. Hence M1 , . . . , Mn do not [1, n ]StreamEx-learn We ∈ L1 . Now suppose stage s starts but does not finish. Let σ = (σ1,s−1 , σ2,s−1 , . . . , σn ,s−1 ). Thus, as the learners only see their own texts and the data given to every learner up to the point of last synchronization, we have that for some j with 1 ≤ j ≤ n , for all τ = (τ1 , τ2 , . . . , τn ) extending σ = (σ1,s−1 , σ2,s−1 , . . . , σn ,s−1 ), such that min(ctnt(τ )) = e and for all x, i, card({y : y ≤ n, (n + 1)x + y ∈ ctnt(σ) ∪ ctnt(τi )}) ≤ 1, (a) none of the learners synchronize after seeing τ and (b) Mj does not make a mind change between σ and τ . Let rem(i) = i mod (n + 1). Let xs = 1 + max(ctnt(σ)). For 1 ≤ i ≤ n , such that rem(i) = rem(j), let Ti be an extension of σi,s such that ctnt(Ti ) − ctnt(σi,s ) = {(n + 1)(xs + x) + rem(i) : x ∈ N}. For i ∈ {1, . . . , n } with rem(i) = rem(j) and i = j, we let Ti = σi,s #∞ . We will choose Tj below such that σj,s−1 ⊆ Tj and ctnt(Tj ) − ctnt(σj,s−1 ) = {(n + 1)(xs + x) + rem(j) : xs + x ≥ k}, for some k > xs . Let pi be the grammar which Mi outputs in the limit, if any, when the team M1 , . . . , Mn is provided with the input (T1 , . . . , Tn ). As the learner Mi only sees Ti and the synchronized part of the streamed texts, by (a) and (b) above, we have that none of the members of team synchronize beyond σ and the learner Mj converges to the same grammar as it did after the team is provided with input σ, irrespective of which k > xs is chosen. Now, by Kleene’s recursion theorem there exists a k> xs such that Wk = ctnt(σj,s ) ∪ {(n + 1)(xs + x) + rem(j) : xs + x ≥ k} ∪ i∈{1,2,...,n }−{j} ctnt(Ti ) and Wk ∈ {Wpi : 1 ≤ i ≤ n }. Hence Wk ∈ L2 and Wk is not [1, n ]StreamEx-learnt by M1 , . . . , Mn . The theorem follows from the above analysis. The following result shows that the number of successful learners affects learnability from streams crucially. Theorem 14. Suppose k ≥ 1. Then, there exists an L such that for all n ≥ k and n ≥ 2k, L ∈ [k, n]StreamEx but L ∈ [k + 1, n ]StreamEx. Proof. Let k be as in the statement of the theorem. Let ψ be a partial recursive function such that ran(ψ) ⊆ {1, . . . , 2k}, the complement of dom(ψ) is infinite and for any r.e. set S such that S ∩ C is infinite, S ∩ B is nonempty, where B = { x, y : ψ(x) = y} and C = { x, j : x ∈ dom(ψ), 1 ≤ j ≤ 2k}. Note that one can construct such a ψ in a way similar to the construction of simple sets. Let Ax = B ∪ { x, j : 1 ≤ j ≤ 2k}. Let L = {B} ∪ {Ax : x ∈ dom(ψ)}. We claim that L ∈ [k, n]StreamEx for all n ≥ k and that L ∈ [k + 1, n ]StreamEx for all n ≥ 2k. We construct M1 , . . . , Mk which [k, n]StreamEx-learn L as follows.
348
S. Jain, F. Stephan, and N. Ye
On input T [t] = (T1 [t], . . . , Tn [t]), the learners synchronize if for some i, ctnt(Ti [t − 1]) does not contain x, j and x, j with j = j , but ctnt(Ti [t]) does contain such x, j and x, j . If synchronization has happened (in some previous step), then the learners output a grammar for B ∪ { x, j : 1 ≤ j ≤ 2k}, where x is the unique number such that x, j and x, j are in the synchronized text for some j = j . Otherwise, M1 , . . . , Mk output a grammar for B and each Mi with k + 1 ≤ i ≤ n does the following: it first looks for the least x such that x, j ∈ ctnt(Ti [t]) for some j, and x is not verified to be in dom(ψ) in t steps; then Mi outputs a grammar for Ax if such an x is found, and outputs ? if no such x is found. If the learners ever synchronize, then clearly all learners correctly learn the target language. Suppose no synchronization happens. If the language is B, then M1 , . . . , Mk correctly learn the input language. If the language is Ax for some x∈ / dom(ψ), then n ≥ 2k (otherwise synchronization would have happened) and at least k learners among Mk+1 , . . . , Mn eventually see exactly one pair of the form x, j, where 1 ≤ j ≤ 2k, and these learners will correctly learn the input language. Now suppose by way of contradiction that a team (M1 , . . . , Mn ) of learners [k + 1, n ]StreamEx-learns L. By Fact 5, there exists a locking sequence σ = (σ1 , . . . , σn ) for the learners M1 , . . . , Mn on B. Let S ⊆ {1, . . . , n } be of size k + 1 such that the learners Mi , i ∈ S, do not make a mind change beyond σ on any streamed text T for B which extends σ. By definition of ψ, there must be only finitely many x, j ∈ C such that the learners M1 , M2 , . . . , Mn synchronize or one of the learners Mi , i ∈ S, makes a mind change beyond σ on any streamed text extending σ for B ∪ { x, j} — otherwise we would have an infinite r.e. set S consisting of such pairs, with S ⊆ C but S ∩ B = ∅, a contradiction to the definitions of ψ, B, C. Let X be the set of these finitely many x, j. Let Z be the set of x such that, for some i with 1 ≤ i ≤ n , the grammar output by Mi on input σ is for Ax , or the grammar output by Mi (in the limit) on input σi #∞ (with the last point of synchronization being before all of input σ is seen) is for Ax . Select some z ∈ / dom(ψ) such that z ∈ Z and (z, j) ∈ X for any j. Now we construct a streamed text extending σ for Az on which the learners fail. Let S ⊇ S be a subset of {1, 2, . . . , n } of size 2k. If i is the j-th element of S then choose Ti such that Ti extends σi and ctnt(Ti ) = B ∪ { z, j} else (when i ∈ / S ) ∞ let Ti = σi # . Thus, T = (T1 , . . . , Tn ) is a streamed text for Az . However, only the learners Mi with i ∈ S − S can converge to correct grammars for Az (as the learners Mi with i ∈ S or i ∈ S , would not have converged to a grammar for Az by definition of z, X and Z above). It follows that L ∈ / [k + 1, n ]StreamEx.
5
Learning from Streams versus Team Learning
Team learning is a special form of learning from streams, in which all learners receive the same complete information about the underlying reality, thus team
Learning from Streams
349
learnability provides upper bounds for learnability from streams with the same parameters. These upper bounds are strict. Theorem 15. Suppose 1 ≤ m ≤ n and n > 1. Then [m, n]StreamEx ⊂ [m, n]TeamEx. Remark 16. Another question is how this transfers to the learnability of in1 dexed families. If m n > 2 and L is an indexed family, then L ∈ [m, n]StreamEx iff L ∈ [m, n]TeamEx iff L ∈ Ex. But if 1 ≤ m ≤ n2 , then the class L consisting of N and all its finite subsets is [1, 2]TeamEx-learnable and [m, n]TeamExlearnable but not [m, n]StreamEx-learnable. Below we will show how several results from team learning can be carried over to the stream learning situation. It was previously shown that in team learning, when the success ratios exceed a certain threshold, then the exact success ratio does not affect learnability any longer. Using a similar majority argument, we can show similar collapsing results for learning from streams (Theorem 17 and Theorem 18). Theorem 17. Suppose 1 ≤ m ≤ n. If [n, n]StreamEx.
m n
>
2 3,
then [m, n]StreamEx =
Theorem 18. Suppose 1 ≤ m ≤ n and k ≥ 1. Then [ 2k 3 (n − m) + km, kn]StreamEx ⊆ [m, n]StreamEx. One can also carry over several diagonalization results from team learning to learning from streams. An example is the following. Theorem 19. For all j ∈ N, [j + 2, 2j + 3]StreamEx ⊆ [j + 1, 2j + 1]TeamEx. The class witnessing the separation is Lj = {L : card(L) ≥ j + 3 and if e0 < . . . < ej+2 are the j +3 smallest elements of L, then either [We0 = . . . = Wej+1 = L] or [at least one of e0 , . . . , ej+1 is a grammar for L and Wej+2 is finite and max(Wej+2 ) is a grammar for L]}. We omit the details of the proof.
6
Iterative Learning and Learning from Inaccurate Texts
In this section, the notion of learning from streams is compared with other notions of learning where the data is used by the learner in more restricted ways or the data is presented in more adversarial manner than in the standard case of learning. The first notion to be dealt with is iterative learning where the learner only remembers the most recent hypothesis, but does not remember any past data [20]. Later, we will consider other adversary input forms: for example the case of incomplete texts where finitely many data-items might be omitted [8,13] or noisy texts where finitely many data-items (not in the input language) might be added to the input text.
350
S. Jain, F. Stephan, and N. Ye
The motivation for iterative learning is the following: When humans learn, they do not memorize all past observed data, but mainly use the hypothesis they currently hold, together with new observations to formulate new hypotheses. Many scientific results can be considered to be obtained in iterative fashion. Iterative learning for learning from a single stream/text was previously modeled by requiring the learners to be a function of the previous hypothesis and the current observed data. Formally, a single-stream learner M : (N∪{#})∗ → (N∪{?}) is iterative if there exists a recursive function F : (N∪{?})×(N∪{#}) → N∪{?} such that on a text T , M (T [0]) =? and for t > 0, M (T [t]) = F (M (T [t− 1]), T (t)). For notational simplicity, we shall write F (M (T [t−1]), T (t)) as M (M (T [t−1]), T (t)). We can similarly define iterative learning from several streams by requiring each learner’s hypothesis to be a recursive function of its previous hypothesis and the set of the newest datum received by each learner — here, when synchronization happens, the learners only share the latest data seen by the learners rather than the whole history of data seen. Iterative learning can be considered as a form of information incompleteness as the learner(s) do not memorize all the past observed data. Interestingly, every iteratively learnable class is learnable from streams irrespective of the parameters. Theorem 20. For any n ≥ 1, every language class Ex-learnable by an iterative learner is iteratively [n, n]StreamEx-learnable. Proof. Suppose L is Ex-learnable by an iterative learner M . We construct M1 , . . . , Mn which [n, n]StreamEx-learn L. We maintain the invariant that each Mi outputs the same grammar g at each time step. Initially g =?. At any time t, suppose Mi receives a datum xti , previous hypothesis is g and the synchronized data, if any, was dt1 , dt2 , . . . , dtn . The output conjecture of the learners is g = g, if there is no synchronized data; otherwise the output conjecture of the learners is g = M (. . . M (M (g, dt1 )dt2 ) . . . dtn ). The learner Mi requests for synchronization if M (g , xti ) = g . Clearly M1 , . . . , Mn form a team of iterative learners from streams and always output the same hypothesis. Furthermore, it can be seen that if M on the text T1 (0)T2 (0) . . . Tn (0)T1 (1)T2 (1) . . . Tn (1) . . . converges to a hypothesis, then the sequence of hypothesis output by learners M1 , M2 , . . . , Mn also converges to the same hypothesis. Thus, if M iteratively learns the input language, then M1 , M2 , . . . , Mn also iteratively [n, n]StreamEx-learn the input language. Now we compare learning from streams with learning from an incomplete or noisy text. Formally, a text T ∈ (N ∪ {#})∞ is an incomplete text for L iff L ⊇ ctnt(T ) and L − ctnt(T ) is finite [8,13]. A text for L is noisy iff ctnt(T ) ⊆ L and ctnt(T ) − L is finite [13]. Ex-learning from incomplete or noisy texts is the same as Ex-learning except that the texts are now incomplete texts or noisy texts, respectively. In the following we investigate the relationships of these criteria with learning from streams. We show that learning from streams is incomparable to learning from incomplete or noisy texts. The nature of information incompleteness in learning from an incomplete text is very different from the incompleteness caused by streaming of data, because
Learning from Streams
351
streaming only spreads information, but does not destroy information (Theorem 11), while the incompleteness in an incomplete text involves the destruction of information. This difference is made precise by the following incomparability results. Proposition 21. Suppose that L consists of L0 = N and all sets Lk+1 = {1 +
x, y : x ≤ k ∧ y ∈ N}. Then L ∈ [n, n]StreamEx for any n ≥ 1 but L can neither be Ex-learnt from noisy text nor from incomplete text. Furthermore, L is iteratively learnable. For the separations in the converse direction, one cannot use indexed families as every indexed family Ex-learnable from normal text is already learnable from streams; obviously this implication survives when learnability from normal text is replaced by learnability from incomplete or noisy text. Remark 22. Suppose n ≥ 2. Then the cylindrification of the class L from Theorem 13 is Ex-learnable from incomplete text but not [1, n]StreamEx-learnable. Here the cylindrification of the class L is just the class of all sets { x, y : x ∈ L ∧ y ∈ N} with L ∈ L. Incomplete texts for a cylindrification of such a set L can be translated into standard texts for L and so the learnability from incomplete texts can be established; the diagonalization against the stream learners carries over. It is known that learnability from noisy text is possible only if for every two different sets L, L in the class the differences L − L and L − L are both infinite. This is a characterization for the case of indexed families, but it is only a necessary but not sufficient criterion for classes in general. For example if a class L consists of sets Lx = { x, y : y ∈ N − {ax }} without any method to obtain ax from x in the limit, then learnability from noisy text is lost. Theorem 23. There is a class L which is learnable from noisy text but not [1, n]StreamEx-learnable for any n ≥ 2.
7
Conclusion
In this paper we investigated learning from several streams of data. For learning indexed families, we characterized the classes which are [m, n]StreamExlearnable using a tell-tale like characterization: An indexed family L = {L0 , L1 , n . . .} is [m, n]StreamEx-learnable iff it is [1, m ]StreamEx-learnable iff there exists a uniformly r.e. sequence E0 , E1 , . . . of finite sets such that Ei ⊆ Li and n there are at most m many languages L in L such that Ei ⊆ L ⊆ Li . For general classes of r.e. languages, our investigation shows that the power of learning from streams depends crucially on the degree of dispersion, the success ratio and the number of successful learners required. Though higher degree of dispersion is more restrictive in general, we show that any class of languages which is iteratively learnable is also iteratively learnable from streams even if one requires all the learners to be successful. There are several open problems and our results suggest that there may not be a simple way to complete the picture of relationship between various [m, n]StreamEx learning criteria.
352
S. Jain, F. Stephan, and N. Ye
References 1. Ambainis, A.: Probabilistic inductive inference: a survey. Theoretical Computer Science 264, 155–167 (2001) 2. Angluin, D.: Inductive inference of formal languages from positive data. Information and Control 45, 117–135 (1980) 3. Baliga, G., Case, J., Jain, S.: The synthesis of language learners. Information and Computation 152, 16–43 (1999) 4. Baliga, G., Jain, S., Sharma, A.: Learning from multiple sources of inaccurate data. SIAM Journal on Computing 26, 961–990 (1997) 5. Blum, L., Blum, M.: Toward a mathematical theory of inductive inference. Information and Control 28, 125–155 (1975) 6. Case, J., Lynes, C.: Machine inductive inference and language identification. In: Nielsen, M., Schmidt, E.M. (eds.) ICALP 1982. LNCS, vol. 140, pp. 107–115. Springer, Heidelberg (1982) 7. Fulk, M.: Prudence and other conditions on formal language learning. Information and Computation 85, 1–11 (1990) 8. Fulk, M., Jain, S.: Learning in the presence of inaccurate information. Theoretical Computer Science 161, 235–261 (1996) 9. Mark Gold, E.: Language identification in the limit. Information and Control 10, 447–474 (1967) 10. Jain, S., Osherson, D., Royer, J.S., Sharma, A.: Systems That Learn: An Introduction to Learning Theory, 2nd edn. MIT Press, Cambridge (1999) 11. Jain, S., Sharma, A.: Team learning of computable languages. Theory of Computing Systems 33, 35–58 (2000) 12. Odifreddi, P.: Classical Recursion Theory. North-Holland, Amsterdam (1989) 13. Osherson, D., Stob, M., Weinstein, S.: Systems That Learn: An Introduction to Learning Theory for Cognitive and Computer Scientists. MIT Press, Cambridge (1986) 14. Osherson, D., Weinstein, S.: Criteria of language learning. Information and Control 52, 123–138 (1982) 15. Pitt, L.: Probabilistic inductive inference. Journal of the ACM 36, 383–433 (1989) 16. Pitt, L., Smith, C.H.: Probability and plurality for aggregations of learning machines. Information and Computation 77, 77–92 (1988) 17. Rogers, H.: Theory of Recursive Functions and Effective Computability. McGrawHill, New York (1967); Reprinted in MIT Press (1987) 18. Smith, C.H.: The power of pluralism for automatic program synthesis. Journal of the ACM 29, 1144–1165 (1982) 19. Smith, C.H.: Three decades of team learning. In: Arikawa, S., Jantke, K.P. (eds.) AII 1994 and ALT 1994. LNCS, vol. 872, pp. 211–228. Springer, Heidelberg (1994) 20. Wiehagen, R.: Limes-Erkennung rekursiver Funktionen durch spezielle Strategien. Elektronische Informationsverbarbeitung und Kybernetik (EIK) 12, 93–99 (1976)
Smart PAC-Learners Hans Ulrich Simon Fakult¨ at f¨ ur Mathematik, Ruhr-Universit¨ at Bochum, 44780 Bochum, Germany
[email protected]
Abstract. The PAC-learning model is distribution-independent in the sense that the learner must reach a learning goal with a limited number of labeled random examples without any prior knowledge of the underlying domain distribution. In order to achieve this, one needs generalization error bounds that are valid uniformly for every domain distribution. These bounds are (almost) tight in the sense that there is a domain distribution which does not admit a generalization error being significantly smaller than the general bound. Note however that this leaves open the possibility to achieve the learning goal faster if the underlying distribution is “simple”. Informally speaking, we say a PAC-learner L is “smart” if, for a “vast majority” of domain distributions D, L does not require significantly more examples to reach the “learning goal” than the best learner whose strategy is specialized to D. In this paper, focusing on sample complexity and ignoring computational issues, we show that smart learners do exist. This implies (at least from an information-theoretical perspective) that full prior knowledge of the domain distribution (or access to a huge collection of unlabeled examples) does (for a vast majority of domain distributions) not significantly reduce the number of labeled examples required to achieve the learning goal.
1
Introduction
We are concerned with sample-efficient strategies for properly PAC-learning a finite class H of concepts over a finite domain X. In the general PAC-learning framework, a learner is exposed to a worst-case analysis by asking the following question: what is the smallest sample size m = mH (ε, δ) such that, for every target concept h ∈ H and every distribution D on X, the probability (taken over m randomly chosen and correctly labeled examples) for returning an ε-accurate hypothesis is at least 1 − δ? It is well-known [3], [4] that (up to some logarithmic factors) there are matching upper and lower bounds on mH (ε, δ). The proof for the lower bound makes use of a fiendish distribution Dε∗ which makes the learning task quite hard. The lower bound is remains valid when Dε∗ is known to the learner. While this almost completely determines the sample size that is required in the worst-case, it leaves open the question whether the learning goal can be achieved faster when the underlying domain distribution D is significantly simpler than Dε∗ . Furthermore, if it can be achieved faster, it leaves open the question whether this can be
This work was supported by the Deutsche Forschungsgemeinschaft Grant SI 498/8-1.
R. Gavald` a et al. (Eds.): ALT 2009, LNAI 5809, pp. 353–367, 2009. c Springer-Verlag Berlin Heidelberg 2009
354
H.U. Simon
exploited only by a learner that is specialized to D or if it can be as well exploited by a “smart” PAC-learner who has no prior knowledge about D. This is precisely the question that we try to answer in this paper. Our main result: In our paper, it will be convenient to think of the target class H, the sample size m and the accuracy parameter ε as fixed, respectively, and to figure out the smallest value for δ such that m examples suffice to meet the ∗ ∗ (ε, δ)-criterion of PAC-learning.1 Let δD = δD (ε, m) denote the smallest value for δ that can be achieved by a learner who is specialized to D. A general PAClearner who must cope with arbitrary distributions can clearly not be specialized to every distribution at the same time. Nevertheless we can associate a quantity L L δD = δD (ε, m) with L that is defined as the smallest value of δ that such that m examples are sufficient to meet the (ε, δ)-criterion provided that L was exposed to the domain distribution D. Ideally, we would like to prove that there is a PAC-learner L and a constant c (not even depending on H) such that, for every L ∗ domain distribution D, δD ≤ c · δD . This would be a strong result as it is basically saying that, for every D, the learning progress of L (without prior knowledge of D) is made roughly at the same speed as the learning progress of the learner whose strategy is perfectly specialized to D. Our result is however slightly weaker than that. We can show that there is a PAC-learner L such that, L ∗ for a “vast majority” of domain distributions D, δD ≤ 2c · δD (where c is a constant that grows with the desired “vastness” of the majority). The formal statement is found in Corollary 3. Related papers: Our main result makes a contribution to a discussion about the power of semi-supervised learning that was raised in [1]. The authors pursue the question whether unlabeled data (which are usually cheap) provably help to save labeled data (which are usually expensive). Moreover, they pursue this question in a passive learning model (like PAC-learning) where the labeled data are generated at random (and the learner has no control about the data-generation process). They present some particular concept classes for which they can prove that unlabeled data do not significantly help. Their analysis uses a nice “rescaling trick” that works however only for one-dimensional Euclidean domains. They conjecture that unlabeled data do not significantly help for a much wider family of concept classes. Our main result supports this conjecture for proper learning (as opposed to agnostic learning) and for (arbitrary!) finite classes. Our results are also weakly related to [2] where upper and lower bounds (in terms of cover- and packing-numbers associated with H and D) on the sample size are presented when H is PAC-learned under a fixed distribution D. If these bounds were tight, one line of attack for proving our results could have been to design a general PAC-learner that, when exposed to D, achieves the learning goal with a sample size that does not significantly exceed the lower bound on the sample size in the fixed-distribution setting. However, since the upper and lower bounds in [2] leave a significant gap (although being related by a polynomial of small degree), this approach does not look very promising. 1
This is obviously equivalent to discussing the sample size as a function in ε and δ.
Smart PAC-Learners
355
Structure of the paper: Section 2 clarifies the notation that is used throughout the paper. Section 3 is devoted to learning under a fixed distribution D. This setting is cast as a zero-sum game between two players, the learner and her opponent, such that the Minimax Theorem from game theory applies. This leads ∗ ∗ to a nice characterization of δD = δD (ε, m). It is furthermore shown that, when the opponent makes his draw first, there is a strategy for the learner that, despite of being not defined in terms of D, does not perform much worse than the best strategy that is specialized to D. Section 4 is devoted to the proof of the main result. To this end, we first treat the case of finitely many distributions and cast the resulting task for a general PAC-learner again as a zero-sum game. Another application of the Minimax Theorem brings us then in the position to prove an important result for a learner who simultaneously copes with finitely many distributions. The case of (infinitely many) arbitrary distributions is finally treated in a similar fashion by invocation of a continuity argument. At the end of Section 4, the reader will find our main results. Section 5 is devoted to some final discussions and open problems.
2
Notations
We assume that the reader is familiar with the PAC-learning framework and knows the Minimax Theorem from game-theory. Throughout the paper, we use the following notation: – X denotes a finite domain. – H denotes a finite concept class over domain X. Thus, every h ∈ H is a function of the form h : X → {0, 1}. – D denotes a domain distribution. – m denotes the sample size. – ε denotes the accuracy parameter. – δ denotes the confidence parameter which bounds from above the “expected failure rate” where “failure” means the delivery of an ε-inaccurate hypothesis. – (x, b) ∈ X m × {0, 1}m denotes a labeled sample. – For x = (ξ1 , . . . , ξm ) ∈ X m and h ∈ H, we set h(x) = (h(ξ1 ), . . . , h(ξm )) ∈ {0, 1}m. – As usual, a learning function is a mapping of the form L : X m × {0, 1}m → H, i.e., it maps labeled samples to hypotheses. – For two hypotheses h, h ∈ H, h ⊕ h := {ξ ∈ X : h(ξ) = h (ξ)} denotes their symmetric difference. Recall that h is called ε-accurate for h w.r.t. D if D(h ⊕ h ) ≤ ε.
356
H.U. Simon
A deterministic learner can be identified with a learning function (if computational issues are ignored). We consider, however, randomized learners. Each-one of these can be identified with a probability distribution over the set of all learning functions.
3
Learning under a Fixed Distribution Revisited
Consider a fixed distribution D on X that is known to the learner. Let L1 , . . . , LM be a list of all learning functions mapping a labeled sample of size m to a hypothesis from H = {h1 , . . . , hN }. For every ε > 0, i = 1, . . . , M , j = 1, . . . , N , x ∈ X m , ε,x,b [i, j] be the Bernoulli variable indicating whether the hyand b ∈ {0, 1}m, let ID pothesis Li (x, b) is ε-inaccurate w.r.t. target concept hj and domain distribution D, i.e., 1 if Li (x, b) is ε-inaccurate for hj w.r.t. D ε,x,b . ID [i, j] = 0 otherwise Now, let
ε,x,hj (x) ε,x,h (x) [i, j] = Dm (x)ID j [i, j] Aε,m D [i, j] := Ex∈D m ID
(1)
x
=
Pr [Li (x, hj (x)) is ε-inaccurate for hj w.r.t. D] .
x∈Dm
(2)
If D, ε, m are obvious from context, we omit these letters as subscripts or superscripts in what follows. A randomized learner is given by a vector p ∈ [0, 1]M M that assigns a probability pi to every learning function Li (so that i=1 pi = 1). Thus, we may identify learners with mixed strategies for the “row-player” in the zero-sum game associated with A. We may view the “column-player” in this game as an opponent of the learner. A mixed strategy for the opponent is given by a vector q ∈ [0, 1]N that assigns an a`-priori probability qj to every possible N target concept hj (so that j=1 qj = 1). The well-known Minimax Theorem states that min max p Aq = max min p Aq. (3) p
q
q
p
∗ ∗ (ε, m). Note that δD (ε, m) In the sequel, we denote the optimal value in (3) by δD coincides with the smallest value for the confidence parameter δ such that every target concept from H can be inferred up to accuracy ε from a random sample of size m (w.r.t. the fixed distribution D) with a probability at least 1−δ of success. ∗ Thus the definition of δD (ε, m) given here is consistent with the definition given in the introduction. Recall that by “expected failure rate (w.r.t. D, m, ε)” we mean the probability p for delivering an ε-inaccurate hypothesis. We denote by δD (ε, m) the expected failure rate of the learner with mixed strategy p (with the opponent making the second draw). Clearly, p ∗ (ε, m) = min δD (ε, m) = min max p Aj δD p
p
j=1,...,N
where Aj denotes the j’th column of matrix A = Aε,m D .
Smart PAC-Learners
357
Since a mixed strategy for the learner is a distribution over learning functions (mapping a labeled sample to a hypothesis), we may equivalently think of the learner as waiting for a random labeled sample (x, b) and then playing a mixed strategy that depends on (x, b). In order to formalize this intuition, we consider the new payoff-matrix A˜ = A˜εD given by 1 if hi is ε-inaccurate for hj w.r.t. D ˜ . A[i, j] = 0 otherwise ˜ We associate the following game with A: 1. The opponent selects a vector q ∈ [0, 1]N specifying a`-priori probabilities for the target concept. Note that this implicitly determines – the probability Q(b|x) = qj j:hj (x)=b
of labeling a given sample x by b, – and the a`-posteriori probabilities qj if hj (x) = b Q(j|x, b) = Q(b|x) 0 otherwise
(4)
for target concept hj given the labeled sample (x, b). For sake of a compact notation, let q˜(x, b) denote the vector whose j’th component is Q(j|x, b). 2. A labeled sample (x, b) is produced at random with probability Pr(x, b) = Dm (x)Q(b|x).
(5)
3. The learner chooses a vector p˜(x, b) ∈ [0, 1]N (that may depend on D, q and ˜ (x, b)) specifying her mixed strategy w.r.t. payoff-matrix A. ˜q (x, b) so that her expected loss, averaged 4. The learner suffers loss p˜(x, b) A˜ ˜q (x, b). over all labeled samples, evaluates to x,b Pr(x, b)˜ p(x, b) A˜ ˜ respectively, are simply called In the sequel, the games associated with A and A, ˜ A-game and A-game, respectively. Lemma 1. Let q ∈ [0, 1]N be an arbitrary but fixed mixed strategy for the learner’s opponent. Then every mixed strategy p ∈ [0, 1]M for the learner in ˜ the A-game can be mapped to a mixed strategy p˜ for the learner in the A-game so that ˜q (x, b) . p Aq = Pr(x, b)˜ p(x, b) A˜ (6) x,b
Moreover, this mapping p → p˜ is surjective, i.e., every mixed strategy for the ˜ learner in the A-game has a pre-image (so that the optimal values in both games are the same).
358
H.U. Simon
Proof. For every probability vector p ∈ [0, 1]M and every labeled sample (x, b), we define the corresponding probability vector p˜(x, b) ∈ [0, 1]N as follows: p˜i (x, b) = pi (7) i:Li (x,b)=hi
Note that
N
p˜i (x, b) =
M
i =1
pi = 1 .
i=1
The following computation verifies (6):
(1 )
p Aq =
m
D (x)
x
=
D (x)
=
=
N
j:hj (x)=b
i =1
i:Li (x,b)=hi
N
Dm (x)
Dm (x) Dm (x)
N
I x,b [i, j] pi qj
˜ ,j] =A[i
˜ , j]qj A[i
j:hj (x)=b i =1
pi
i:Li (x,b)=hi
˜ , j]˜ pi (x, b)qj A[i
i =1 j:hj (x)=b
x,b (4 )
I x,b [i, j]pi qj
j:hj (x)=b i=1
x,b (7)
M
m
x,b
=
I x,hj (x) [i, j]pi qj
j=1 i=1
x,b
=
M N
Dm (x)Q(b|x)
N N
˜ , j]˜ pi (x, b)Q(j|x, b) A[i
i =1 j=1
x,b
(5 ) ˜q (x, b) = Pr(x, b)˜ p(x, b) A˜ x,b
As for the second part of the lemma, consider a mixed strategy p˜ (x, b) of the ˜ learner in the A-game. We shall specify a mixed strategy p of the learner in the A-game such that function p˜(x, b) computed according to (7) coincides with p˜ (x, b). To this end, let us make the notational convention p˜hi := p˜i and let us choose p as follows: pi = p˜Li (x,b) (x, b). (8) x,b
Note that, with this choice of p, M i=1
M
N
(8 ) pi = p˜Li (x,b) (x, b) = p˜i (x, b) = 1. i=1 x,b
x,b i =1
=1
(9)
Smart PAC-Learners
359
In the second-last equation, we used the distributive law where the reader should note the one-to-one correspondence between the set of all learning functions and the free combination of (number of labeled samples many) hypotheses taken from H. A computation similar to (9) verifies now the desired coincidence between p˜ and p˜ : (7 ) p˜i (x , b ) =
pi
i:Li (x ,b )=hi (8 )
=
=
i:Li (x ,b )=hi
x,b
p˜i (x , b )
p˜Li (x,b) (x, b)
p˜Li (x,b) (x, b)
i:Li (x ,b )=hi (x,b) =(x ,b ) N
= p˜i (x , b )
(x,b) =(x ,b )
i=1
p˜i (x, b)
=1
= p˜i (x , b ) In the second-last equation, we used the distributive law where the reader should note the one-to-one correspondence between the set of all learning functions with a fixed value on one sample (x , b ) and the free combination of (number of labeled samples minus 1 many) hypotheses taken from from H. As a corollary to Lemma 1, we obtain ⎡ ⎤ ∗ ⎣ δD Pr(x, b) · min p˜ A˜ε,m (ε, m) = max min p Aε,m ˜(x, b) ⎦ . D q = max D q q
p
q
x,b
p˜
We close this section with a result that prepares the ground for our analysis of general PAC-learners in the next section: Lemma 2. Let ε > 0 be a given accuracy and m ≥ 1 a given sample size. For every probability vector q ∈ [0, 1]N , and every domain distribution D, the following holds: ∗ Pr(x, b)˜ q (x, b) A˜2ε ˜(x, b) ≤ 2δD (ε, m) (10) Dq x,b
Proof. Recall that q˜(x, b) is the vector that assigns the `a-posteriori probability Q(j|(x, b) for being the target concept to every hypothesis hj . Since the a-posteriori probabilities outside the version space ` V := {h ∈ H : h(x) = b}
360
H.U. Simon
are zero, only target concepts in V can contribute to the left hand-side in (10). In the remainder of the proof, we simply write A˜ε instead of A˜εD , and A˜εi denotes the i’th row of this matrix. In the A˜ε -game, the opponent makes the first draw by choosing a (prior) probability vector q ∈ [0, 1]N . The following “Bayesian strategy” for the learner minimizes (6): for a given labeled sample (x, b) pick a hypothesis h∗ = hi∗ (x,b) ∈ H which maximizes the total a`-posteriori probability of hypotheses that are ε-close to h∗ w.r.t. D, i.e., ⎫ ⎧ ⎬ ⎨ i∗ (x, b) = arg max Q(j|x, b) . i=1,...,N ⎩ ⎭ j:D(hi ⊕hj )≤ε
It follows that
∗ Pr(x, b)A˜εi∗ (x,b) q˜(x, b) ≤ δD (ε, m) .
(11)
x,b
We are now prepared to verify (10). We call a hypothesis from the version space V an “(D, x, b, ε)-exception” if it is not ε-close to h∗ w.r.t. D. Note that A˜ε∗ q˜(x, b) coincides with the total a`-posteriori probability of (D, x, b, ε)i (x,b)
exceptions. Consider now the strategy p˜ = q˜ for the learner. Given (x, b), she ˆ = hi at random with probability Q(i|x, b). The following picks a hypothesis h observation, which is a simple application of the triangle inequality, is crucial: if ˆ ∈ V is not an (D, x, b, ε)-exception, then h ˆ is 2ε-close w.r.t. D to every hypothh ˆ and the target esis in V that is not an (D, x, b, ε)-exception either. Thus, if h concept are both picked at random according to q˜(x, b), then the probability for ˆ being 2ε-inaccurate is bounded from above by twice the total probability of h (D, x, b, ε)-exceptions, i.e., q˜(x, b)A˜2ε q˜(x, b) ≤ 2Aεi∗ (x,b) q˜(x, b). This, combined with (11), concludes the proof of the lemma.
It is important to note that no knowledge of D is required to play the strategy ˜ p˜ = q˜ in the A-game (with the opponent making the first draw). Nevertheless, as made precise in Lemma 2, this is a reasonably good strategy for any underlying domain distribution.
4
Smart PAC-Learners
Let us first consider learners that cope with an arbitrary but fixed finite list D1 , . . . , DC of (not necessarily distinct) distributions on X.2 We shall define a suitable payoff-matrix R blockwise so that R = [R(1) , . . . , R(C) ] and R(k) is the block reserved for distribution Dk . Every block has M rows (one row for every learning function) and N columns (one column for every possible target concept). Choosing R(k) = Aε,m Dk (compare with the previous section) would lead 2
We shall later extend these considerations to arbitrary distributions.
Smart PAC-Learners
361
to mixed strategies for the PAC-learner that put too much emphasis on fiendish domain distributions. Such strategies are not likely to succeed with considerably fewer sample points when the underlying distribution D happens to be simple. Assuming ∗ (A1) ∀k = 1, . . . , C, δD (ε, m) = 0, k
the following payoff-matrix is a better choice: 1 2ε,m ∗ (ε, m) · ADk [i, j] δD k 1 Pr [Li (x, hj (x)) · is 2ε-inaccurate for hj w.r.t. Dk ] = ∗ δDk (ε, m) x∈Dkm
R(k) [i, j] :=
∗ (ε, m) challenges the learner to put more Intuitively, the scaling factor 1/δD k emphasis on benign distributions. Note furthermore that we penalize the learner only if her hypothesis is 2ε-inaccurate (as opposed to ε-inaccurate). This leaves some slack which helps the learner to compensate for not knowing D. ∗ In the sequel, we simply write δk∗ (ε, m) instead of δD (ε, m) and A(k) instead k (k)
(k)
denotes the j’th column in A(k) and, similarly, Rj denotes the of A2ε,m Dk . Aj j’th column in R(k) . A mixed strategy for the learner is a probability vector p ∈ [0, 1]M (as in Section 3). A mixed strategy for the opponent is a proba(k) bility vector q = [q (1) , . . . , q (C) ] ∈ [0, 1]CN where qj denotes the probability for choosing domain distribution Dk and target concept hj . According to the Minimax Theorem, the following holds: min max p Rq = max min p Rq. p
q
q
p
(12)
Let ρ∗ (ε, m) denote the optimal value in (12). The following quantities refer to a learner who makes the first draw and applies the mixed strategy p: ρp (ε, m) := max
max p Rj
(k)
j=1,...,N k=1,...,C
p
ρ¯ (ε, m) := max
j=1,...,N
∗
C 1 (k) · p Rj C k=1
p
Clearly, ρ (ε, m) = minp ρ (ε, m). Let us make perfectly clear the connection between these quantities and our p concept of a smart PAC-learner. We denote by δj,k (ε, m) the expected failure rate (w.r.t. m, ε) of a PAC-learner with mixed strategy p when the target concept is hj and the domain distribution is Dk .3 It follows from the definition of A(k) = A2ε,m Dk that (k) p δj,k (2ε, m) = p Aj . Thus, according to the definition of R(k) , 3
In contrast to the previous section, here the learner has no prior knowledge of Dk .
362
H.U. Simon p δj,k (2ε, m)
max
max
δk∗ (ε, m) p δj,k (2ε, m)
j=1,...,N k=1,...,C
δk∗ (ε, m)
= p Rj , (k)
= ρp (ε, m),
p C 1 δj,k (2ε, m) · = ρ¯p (ε, m). max j=1,...,N C δk∗ (ε, m) k=1
It becomes obvious now that the quantities ρp (ε, m) and ρ¯p (ε, m) measure how well the general PAC-learner with mixed strategy p (and accuracy parameter 2ε) competes against the best learner with full prior knowledge of the domain distribution (and accuracy parameter ε). We call ρp (ε, m) the worst performance ratio and ρ¯p (ε, m) the average performance ratio of the mixed strategy p (although both quantities refer to the worst-case as far as the choice of the target concept is concerned). In the sequel, a learner with mixed strategy p is identified with p so that we can speak of a performance ratio of the learner. A very strong result would be ρ∗ (ε, m) ≤ c for some small constant c, which would mean that there exists a learner (mixed strategy) p whose worst performance ratio is bounded by c. But, since this is somewhat overambitious, we pursue a weaker goal in the following and analyze the average performance ratio, ρ¯p (ε, m), instead. We make use of the (obvious) fact that ρ¯∗ (ε, m) := min ρ¯p (ε, m) p
¯ is the optimal value in the R-game for C ¯ := 1 R R(k) . C
(13)
k=1
With this notation, the following holds: ¯ Lemma 3. For every mixed strategy q of the opponent in the R-game, there ¯ ≤ 2. exists a mixed strategy p for the learner such that p Rq ¯ Proof. From the decomposition (6), we get the following decomposition of p Rq: C ) 1 (k) ¯ (13 = p R q p Rq C k=1
=
C 1 1 (k) q ∗ (ε, m) p A C δD k k=1
(6)
=
C 1 1 Pr(x, b)˜ p(x, b)A˜(k) q˜(x, b). ∗ k C δDk (ε, m) k=1
x,b
˜, p˜ are Here, Prk (x, b) = Dkm (x)Q(b|x), A˜(k) = A˜2ε Dk , and the quantities Q(b|x), q derived from q and p, respectively, as explained in the previous section. According to Lemma 2,
Smart PAC-Learners
x,b
∗ Pr(x, b)˜ q (x, b)A˜(k) q˜(x, b) ≤ 2δD (ε, m). k k
363
(14)
According to Lemma 1, there exists a mixed strategy p for the learner such that p˜ = q˜. With this choice of p, we get ¯ = p Rq
C (14) 1 1 ˜(k) q˜(x, b) ≤ 2, Pr (x, b)˜ q (x, b) A ∗ (ε, m) k C δD k k=1
x,b
as desired.
¯ j denote the j’th column of matrix R. ¯ The Minimax Theorem applied to Let R ¯ the R-game allows us to infer from Lemma 3 the following Corollary 1. ρ¯∗ (ε, m) ≤ 2, i.e., there exists a learner (mixed strategy) p whose average performance ratio is bounded by 2. So far, we have assumed that there is a finite list of distributions and the domain distribution is taken from this list. We now extend these considerations to arbitrary distributions. Recall that our domain X is finite, say X = {ξ1 , . . . , ξd }. The domain distributions are in one-to-one correspondence with the vectors taken from the probability simplex Δ := {z ∈ [0, 1]d : z1 + · · · + zd = 1}. Specifically, Dz (ξν ) = zν for ν = 1, . . . , d. Note that instead of finitely many block matrices R(1) , . . . , R(C) , as before, we now have a system of infinitely many matrices R(z) . Let f : Δ → R+ be a continuous function that satisfies f (z) dz = 1 (15) z∈Δ
so that it can serve as a density function. For sake of simple notation, we set ∗ δz∗ := δD (ε, m). z
For every ζ > 0 and every E ⊆ Δ, let Pr(E) := z∈E
f (z) dz,
Δζ := {z ∈ Δ : δz∗ ≥ ζ}.
(16)
(17) (18)
The former Assumption (A1) is replaced now by the following assumption: (A2) limζ→0 Pr(Δζ ) = 1. (A2) implies that Pr(Δζ ) > 0 for every sufficiently small ζ > 0, which is assumed in the sequel. Since f (z)/ Pr(Δζ ) is a continuous density function on Δζ , we can now use 1 ¯ ζ := · f (z)R(z) dz (19) R Pr(Δζ ) z∈Δζ as a payoff-matrix (where this matrix-equation is understood entry-wise).
364
H.U. Simon
Lemma 4. The integral on the right hand-side in (19) exists. Proof. Since f (z) is continuous and Δ is compact, f is continuous and bounded on Δ. Furthermore, for every z ∈ Δζ , each entry of R(z) is bounded by 1/ζ. The lemma would be rather obvious if the functions R(z) [i, j] were continuous in z. Although this is not the case in general, we may exploit the fact that discontinuities occur only in the set E= Eij 1≤i<j≤N
where Eij := {z ∈ Δ : Dz (hi ⊕ hj ) = ε}. Note that the sets Eij are of Lebesgue-measure zero, and so is E. For this reason, integrating over Δζ leads to the same result as integrating over Δζ \ E, which, by construction, is a set without discontinuities. The average performance ratio of a mixed strategy p for the learner refers now to the density function f (z)/ Pr(Δζ ) and must therefore be redefined as follows: 1 (z) ¯ζ · f (z)p Rj dz = max p R ρ¯pζ (ε, m) := max j j=1,...,N Pr(Δζ ) j=1,...,N z∈Δζ ¯ ζ denotes the j’th column of R ¯ ζ . With this notation, we get where R j Corollary 2. For every sufficiently small ζ > 0, there exists a learner (mixed strategy) p whose average performance ratio is bounded by 2. Proof. The crucial observation is that Lemma 3 is still correct when we define ¯ := R ¯ ζ according to (19). The only modification in the proof of this lemma R is the substitution of integrals for sums. Thus the Minimax Theorem applies to ¯ the R-game and Corollary 2 is obtained. Assumption (A2) and Corollary 2 combined with Markov’s Inequality immediately lead to Corollary 3. For every continuous function f : Δ → R+ satisfying (15) and for every constant c > 0, there exists a mixed strategy p for the learner such that, for j = 1, . . . , N , Pr(z ∈ Δ : p Rj
(z)
1 ≤ 2c) > 1 − . c
(20)
¯=R ¯ ζ . Corollary 2 combined with Markov’s Inequality Proof. Let ζ > 0 and R shows that there exists a mixed strategy p for the learner such that, for j = 1, . . . , N , 1 (z) Pr(z ∈ Δζ : p Rj > 2c) < . c By assumption (A2), we may conclude that
Smart PAC-Learners
Pr(z ∈ Δ : p Rj
(z)
> 2c) ≤ Pr(z ∈ Δζ : p Rj
(z)
> 2c) + Pr(Δ \ Δζ ) <
365
1 c
provided that ζ > 0 is sufficiently small. From this, Corollary 3 is immediate. According to Corollary 3, there exists a mixed strategy p for a learner without any prior knowledge of the domain distribution such that, in comparison to the best learner with full prior knowledge of the domain distribution, a performance ratio of 2c is achieved for the “vast majority” of distributions. The total probability mass of distributions (measured according to density function f (z)) not belonging to the “vast majority” is bounded by 1/c. So Corollary 3 is the result that we had announced in the introduction.
5
Discussion of Assumption (A2) and Open Problems
We claim that Assumption (A2) is not very restrictive and provide some intuition why this might be true. For every γ > 0, let Δγ := {z ∈ Δ| ∀ν = 1, . . . , d : γ ≤ zν ≤ 1 − γ}. Assume that z ∈ Δγ . Pick ν(z) ∈ {1, . . . , d} such that zν(z) = min{z1 , . . . , zd }. Clearly, 1 γ ≤ zν(z) ≤ . d For sake of brevity, let ξ := ξν(z) . For b ∈ {0, 1}, consider the set H(ξ, b) := {h ∈ H : h(ξ) = b}. If the opponent finds two hypotheses h, h ∈ H(ξ, b) such that Dz (h ⊕ h ) > 2ε,
(21)
he can assign `a-priori probability 1/2 to h and h , respectively, and achieve the following: With a probability of at least γ m , the sample x is of the form x = (ξ, . . . , ξ) ∈ m X so that the learner cannot distinguish between h and h . Conditioned to x = (ξ, . . . , ξ) ∈ X m , the learner will therefore fail with probability at least 1/2 (regardless of her strategy). Thus, the overall expected failure rate is at least γ m /2. The punchline of this discussion is the following implication: γm ∗ (22) z ∈ Δγ ∧ 2ε < max max D (g ⊕ g ) ⇒ δ (ε, m) ≥ z z 2 b∈{0,1} g,g ∈H(ξν(z) ,b)
(A3)
Condition (A3) looks wild but it is essentially saying that the knowledge of a single labeled example should not trivialize the resulting version space (in terms of its diameter) too much.
366
H.U. Simon
Define K(H) as the smallest number K such that, for every ξ ∈ X, there exist g+ ∈ H(ξ, 1), g− ∈ H(ξ, 0) and g1 , . . . , gK ∈ H which satisfy the following condition: ∀ξ ∈ X \ {ξ}, ∃κ ∈ {1, . . . , K} : (g+ (ξ) = gκ (ξ) ∧ g+ (ξ ) = gκ (ξ )) = gκ (ξ )) ∨ (g− (ξ) = gκ (ξ) ∧ g− (ξ ) If suitable functions g+ , g− , g1 , g2 , . . . cannot be found for some ξ ∈ X, we set K(H) = ∞ by default. Lemma 5. Assume that K(H) < ∞. Then, for every z ∈ Δγ and every d ≥ 2, the following holds: max
max
b∈{0,1} g,g ∈H(ξν(z) ,b)
Dz (g ⊕ g ) ≥
1 1 − 1/d ≥ K(H) 2K(H)
Proof. According to the definition of ν(z), Dz (X \ {ξν(z) }) ≥ 1 − 1/d. Let us set ξ := ξν(z) , let K = K(H), and let g+ , g− , g1 , . . . , gK be the functions chosen in accordance with the definition of K(H). Without loss of generality, we may assume that g1 , . . . , gK agree with g+ on ξ whereas gK +1 , . . . , gK agree with g− on ξ. Note that the “disagreement sets” g+ ⊕ g1 , . . . , g+ ⊕ gK ; g− ⊕ gK +1 , . . . , g− ⊕ gK cover X \ {ξ}. Thus, by the pigeon-hole principle, there must exist a hypothesis g ∈ {g+ , g− } and g ∈ {g1 , . . . , gK } such that g(ξ) = g (ξ) but the disagreement set g ⊕ g has probability mass at least (1 − 1/d)/K. Lemma 5 combined with the implication (22) yields Corollary 4. If K(H) < ∞, then the following holds: ∀γ > 0, ∀z ∈ Δγ , ∀ε <
1 γm : δz∗ (ε, m) ≥ 2K(H) 2
We remind the reader to (16) and to the definition of Δζ in (18). Since γ m /2 ≥ ζ is equivalent to γ ≥ (2ζ)1/m , we obtain Corollary 5. Let K(H) < ∞ and ε < 1/(2K(H)). Define γ(ζ) := (2ζ)1/m . Then, Δζ ⊇ Δγ(ζ) . Since γ(ζ) → 0 for ζ → 0 and obviously limγ→0 Pr(Δγ ) = 1 for every underlying continuous density function, we finally get Corollary 6. Let K(H) < ∞. Then, for every ε < 1/(2K(H)), H satisfies Assumption (A2). A simple hypothesis class should intuitively have a good chance to trivialize the version space resulting from a single labeled example (and, thus, a good chance
Smart PAC-Learners
367
to violate Assumption (A2)). But even for the almost trivial class of half-intervals {1, . . . , r}, 0 ≤ r ≤ d, our sufficient condition K(H) < ∞ applies. This can be seen as follows: Pick an arbitrary ξ ∈ {1, . . . , d}, and designate the following half-intervals: g+ := {1, . . . , ξ} , g− := ∅, g1 := {1, . . . , d} , g2 := {1, . . . , ξ − 1}. Pick an arbitrary ξ ∈ {1, . . . , d} \ {ξ}. For ξ > ξ, we get g+ (ξ) = g1 (ξ) = 1 and g+ (ξ ) = 0 = 1 = g1 (ξ ). For ξ < ξ, we get g− (ξ) = g2 (ξ) = 0 and g− (ξ ) = 0 = 1 = g2 (ξ ). We conclude that K(H) = 2 for the class of halfintervals. Open problems: – For every finite hypothesis class, we have shown the mere existence of a learner (mixed strategy) whose average performance ratio is bounded by 2. Gain more insight how this strategy actually works and check under which conditions it can be implemented efficiently. – Prove or disprove that there exists a learner (mixed strategy) whose worst performance ratio is bounded by a small constant. – Prove or disprove our claim that assumption (A2) is not very restrictive.
References 1. Ben-David, S., Lu, T., P´ al, D.: Does unlabeled data provably help? Worst-case analysis of the sample complexity of semi-supervised learning. In: Proceedings of the 21st Annual Conference on Learning Theory, pp. 33–44 (2008) 2. Benedek, G.M., Itai, A.: Learnability with respect to fixed distributions. Theoretical Computer Science 86(2), 377–389 (1991) 3. Blumer, A., Ehrenfeucht, A., Haussler, D., Warmuth, M.K.: Learnability and the Vapnik-Chervonenkis dimension. Journal of the Association on Computing Machinery 36(4), 929–965 (1989) 4. Ehrenfeucht, A., Haussler, D., Kearns, M., Valiant, L.: A general lower bound on the number of examples needed for learning. Information and Computation 82(3), 247–261 (1989)
Approximation Algorithms for Tensor Clustering Stefanie Jegelka1 , Suvrit Sra1 , and Arindam Banerjee2 1
Max Planck Institute for Biological Cybernetics, 72076 T¨ ubingen, Germany {jegelka,suvrit}@tuebingen.mpg.de 2 Univ. of Minnesota, Twin Cities, Minneapolis, MN, USA
[email protected]
Abstract. We present the first (to our knowledge) approximation algorithm for tensor clustering—a powerful generalization to basic 1D clustering. Tensors are increasingly common in modern applications dealing with complex heterogeneous data and clustering them is a fundamental tool for data analysis and pattern discovery. Akin to their 1D cousins, common tensor clustering formulations are NP-hard to optimize. But, unlike the 1D case, no approximation algorithms seem to be known. We address this imbalance and build on recent co-clustering work to derive a tensor clustering algorithm with approximation guarantees, allowing metrics and divergences (e.g., Bregman) as objective functions. Therewith, we answer two open questions by Anagnostopoulos et al. (2008). Our analysis yields a constant approximation factor independent of data size; a worst-case example shows this factor to be tight for Euclidean co-clustering. However, empirically the approximation factor is observed to be conservative, so our method can also be used in practice.
1
Introduction
Tensor clustering is a recent generalization to the basic one-dimensional clustering problem, and it seeks to partition an order-m input tensor into coherent sub-tensors while minimizing some cluster quality measure [1,2]. For example, in co-clustering, which is a special case of tensor clustering with m = 2, one simultaneously partitions rows and columns of an input matrix to obtain coherent submatrices, often while minimizing a Bregman divergence [3,4]. Being generalizations of the 1D case, common tensor clustering formulations are also NP-hard to optimize. But despite the existence of a vast body of research on approximation algorithms for 1D clustering problems (e.g., [5,6,7,8,9,10]), there seem to be no published approximation algorithms for tensor clustering. Even for (2D) co-clustering, there are only two recent attempts [11] and [12] (from 2008). Both prove an approximation factor of 2α1 for Euclidean co-clustering given an α1 -approximation for k-means, and show constant approximation factors for 1 ([12] only for binary matrices) and p -norm [11] based variants. Tensor clustering is a basic data analysis task with growing importance; several domains now deal frequently with tensor data, e.g., data mining [13], computer graphics [14], and computer vision [2]. We refer the reader to [15] for a R. Gavald` a et al. (Eds.): ALT 2009, LNAI 5809, pp. 368–383, 2009. c Springer-Verlag Berlin Heidelberg 2009
Approximation Algorithms for Tensor Clustering
369
recent survey about tensors and their applications. The simplest tensor clustering scenario, namely, co-clustering (also known as bi-clustering) is more established [12,4,16,17,18]. Tensor clustering is less well known, though several researchers have considered it before [1,2,19,20,21]. 1.1
Contributions
The main contribution of this paper is the analysis of an approximation algorithm for tensor clustering that achieves an approximation ratio of O(p(m)α), 1 where m is the order of the tensor, p(m) = m or p(m) = m log3 2 , and α is the approximation factor of a corresponding 1D clustering algorithm. Our results apply to a fairly broad class of objective functions, including metrics such as p norms, Hilbertian metrics [22,23], and divergence functions such as Bregman divergences [24] (with some assumptions). As corollaries, our results solve two open problems posed by [12], viz., whether their methods for Euclidean co-clustering could be extended to Bregman co-clustering, and if one could extend the approximation guarantees to tensor clustering. The bound also gives insight into properties of the tensor clustering problem. We give an example for the tightness of our bound for squared Euclidean distance, and provide an experimental validation of the theoretical claims, which forms an additional contribution.
2
Background and Problem
Traditionally, “center” based clustering algorithms seek partitions of columns of an input matrix X = [x1 , . . . , xn ] into clusters C = {C1 , . . . , CK }, and find “centers” µk that minimize the objective J(C) =
K k=1
x∈Ck
d(x, µk ),
(2.1)
where the function d(x, y) measures cluster quality. The “center” µk of cluster Ck is given by the mean of the points in Ck when d(x, y) is a Bregman divergence [25]. Co-clustering extends (2.1) to seek simultaneous partitions (and centers µIJ ) of rows and columns of X, so that the objective function d(xij , µIJ ), (2.2) J(C) = I,J
i∈I,j∈J
is minimized; µIJ denotes the (scalar) “center” of the cluster described by the row and column index sets, viz., I and J. We generalize formulation (2.2) to tensors in Section 2.2 after introducing some background on tensors. 2.1
Tensors
An order-m tensor A may be viewed as an element of the vector space Rn1 ×...×nm . An individual entry of A is given by the multiply-indexed value ai1 i2 ...im , where ij ∈ {1, . . . , nj } for 1 ≤ j ≤ m. For us, the most important tensor operation
370
S. Jegelka, S. Sra, and A. Banerjee
is multilinear matrix multiplication, which generalizes matrix multiplication [26]. Matrices act on other matrices by either left or right multiplication. Similarly, for an order-m tensor, there are m dimensions on which a matrix may act. For A ∈ Rn1 ×n2 ×···×nm , and matrices P1 ∈ Rp1 ×n1 , . . ., Pm ∈ Rpm ×nm , multilinear multiplication is defined by the action of the Pi on the different dimensions of A, and is denoted by A = (P1 , . . . , Pm )·A ∈ Rp1 ×···×pm . The individn ,...,n (1) (m) ual components of A are given by ai1 i2 ...im = j11,...,jmm=1 pi1 j1 · · · pim jm aj1 ...jm , (k)
where pij denotes the ij-th entry of matrix Pk . The inner product between two tensors A and B is defined as A, B =
i1 ,...,im
ai1 ...im bi1 ...im ,
(2.3)
and this inner product satisfies the following natural property (which generalizes the familiar Ax, By = x, A By ): (P1 , . . . , Pm ) · A, (Q1 , . . . , Qm ) · B = A, P 1 Q1 , . . . , Pm Qm · B .
(2.4)
Moreover, the Frobenius norm is A2 = A, A. Finally, we define an arbitrary divergence function d(X, Y) as an elementwise sum of individual divergences, i.e., d(X, Y) =
i1 ,...,im
d(xi1 ,...,im , yi1 ,...,im ),
(2.5)
and we will define the scalar divergence d(x, y) as the need arises. 2.2
Problem Formulation
Let A ∈ Rn1 ×···×nm be an order-m tensor that we wish to partition into coherent sub-tensors (or clusters). In 3D, we divide a cube into smaller cubes by cutting orthogonal to (i.e., along) each dimension (Fig. 1). A basic approach is to minimize the sum of the divergences between individual (scalar) elements in each cluster to their corresponding (scalar) cluster “centers”. Readers familiar with [4] will recognize this to be a “block-average” variant of tensor clustering. Assume that each dimension j (1 ≤ j ≤ m) is partitioned into kj clusters. Let Cj ∈ {0, 1}nj ×kj be the cluster indicator matrix for dimension j, where the ik-th entry of such a matrix is one if and only if index i belongs to the k-th cluster (1 ≤ k ≤ kj ) for dimension j. Then, the tensor clustering problem is (cf. 2.2): minimize
C1 ,...,Cm ,M
d(A, (C1 , . . . , Cm ) · M),
nj ×kj
s.t. Cj ∈ {0, 1}
,
(2.6)
where the tensor M collects all the cluster “centers.”
3
Algorithm and Analysis
Given formulation (2.6), our algorithm, which we name Combination Tensor Clustering (CoTeC), follows the simple outline:
Approximation Algorithms for Tensor Clustering
371
1. Cluster along each dimension j, using an approximation algorithm to obtain clustering Cj ; Let C = (C1 , . . . , Cm ) 2. Compute M = argminX∈Rk1 ×···×km d(A, C · X). 3. Return the tensor clustering (C1 , . . . , Cm ) (with representatives M). Remark 1. Instead of clustering one dimension at a time in Step 1, we can also cluster along t dimensions simultaneously. In such a t-dimensional clustering of an order-m tensor, we form groups of order-(m − t) tensors.
C3 C2
C1
μ3,1,3 C3 C1
C2
Fig. 1. CoTeC: Cluster along dimensions one (C1), two (C2), three (C3) separately and combine the results; μ3,1,3 is the mean of sub-tensor (cluster) (3,1,3). The various clusters in the final tensor clustering are color coded to indicate combination of contributions from clusters along each dimension.
Our algorithm might be counterintuitive to some readers as merely clustering along individual dimensions and then combining the results is against the idea of “co”-clustering, where one simultaneously clusters along different dimensions. However, our analysis shows that dimension-wise clustering suffices to obtain strong approximation guarantees for tensor clustering—a fact often observed empirically too. It is also easy to see that CoTeC runs in time O((m/t)T (t)), if the subroutine for dimension-wise clustering takes T (t) time. The main contribution of this paper is the following approximation guarantee for CoTeC, which we prove in the remainder of this section. Theorem 1 (Approximation). Let A be an order-m tensor and let Cj denote its clustering along the jth subset of t dimensions (1 ≤ j ≤ m/t), as obtained from a multiway clustering algorithm with guarantee αt 1 . Let C = (C1 , . . . , Cm/t ) 1
We say an approximation algorithm has guarantee α if it yields a solution that achieves an objective value within a factor O(α) of the optimum.
372
S. Jegelka, S. Sra, and A. Banerjee
denote the induced tensor clustering, and JOPT (m) the best m-dimensional clustering. Then, J(C) ≤ p(m/t)ρd αt JOPT (m), with (3.1) 1. ρd = 1 and p(m/t) = 2log2 m/t if d(x, y) = (x − y)2 , 2. ρd = 1 and p(m/t) = 3log2 m/t if d(x, y) is a metric2 . Thm. 1 is quite general, and it can be combined with some natural assumptions (see §3.3) to yield results for tensor clustering with general divergence functions (though ρd might be greater than 1). For particular choices of d one can perhaps derive tighter bounds, though for squared Euclidean distances, we provide an explicit example (Fig. 2) that shows the bound to be tight in 2D. 3.1
Analysis: Theorem 1, Euclidean Case
We begin our proof with the Euclidean case, i.e., d(x, y) = (x − y)2 . Our proof is inspired by the techniques of [12]. We establish that given a clustering algorithm which clusters along t of the m dimensions at a time3 with an approximation factor of αt , CoTeC achieves an objective within a factor O(m/tαt ) of the optimal. For example, for t = 1 we can use the seeding methods of [8,9] or the stronger approximation algorithms of [5]. We assume without loss of generality (wlog) that m = 2h t for an integer h (otherwise, pad in empty dimensions). Since for the squared Frobenius norm, each cluster “center” is given by the mean, we can recast Problem (2.6) into a more convenient form. To that end, note that the individual entries of the means tensor M are given by (cf. (2.2)) MI1 ...Im =
1 ai1 ...im , i1 ∈I1 ,...,im ∈Im |I1 | · · · |Im |
(3.2)
with index sets Ij for 1 ≤ j ≤ m. Let Cj be the normalized cluster indicator
matrix obtained by normalizing the columns of Cj , so that Cj Cj = Ikj . Then, we can rewrite (2.6) in terms of projection matrices Pj as: minimize
C=(C1 ,...,Cm )
J(C) = A − (P1 , . . . , Pm ) · A2 ,
s.t. Pj = Cj Cj .
(3.3)
Lemma 1 (Pythagorean). Let P = (P1 , . . . , Pt ), P ⊥ = (I − P1 , . . . , I − Pt ) be collections of projection matrices Pj , and S and R be arbitrary collections of m − t projection matrices. Then, (P , S) · A + (P ⊥ , R) · B2 = (P , S) · A2 + (P ⊥ , R) · B2 . 2
3
The results can be trivially extended to λ-relaxed metrics that satisfy d(x, y) ≤ λ(d(x, z) + d(z, y)); the corresponding approximation factor just gets scaled by λ. One could also consider clustering differently sized subsets of the dimensions, say {t1 , . . . , tr }, where t1 +· · ·+tr = m. However, this requires unilluminating notational jugglery, which we can skip for simplicity of exposition.
Approximation Algorithms for Tensor Clustering
373
Proof. Using A2 = A, A we can rewrite the l.h.s. as
(P , S) · A + (P ⊥ , R) · B2 = (P , S) · A2 + (P ⊥ , R) · B2 + 2 (P , S) · A, (P ⊥ , R) · B ,
from which the last term is immediately seen to be zero using Property (2.4) ⊥ and the fact that P
j Pj = Pj (I − Pj ) = 0. Some more notation. Since we cluster along t dimensions at a time, we recursively partition the initial set of all m dimensions until (after log(m/t) + 1 steps), the sets of dimensions have length t. Let l denote the level of recursion, starting at l = log(m/t) = h and going down to l = 0. At level l, the sets of dimensions will have length 2l t (so that for l = 0 we have t dimensions). We represent each clustering along a subset of 2l t dimensions by its corresponding 2l t projection matrices. We gather these projection matrices into the collection Pil (note boldface), where the index i ranges from 1 to 2h−l . We also need some notation to represent a complete tensor clustering along all m dimensions, where only a subset of 2l t dimensions are clustered. We pad the collection Pil with m − 2l t identity matrices for the non-clustered dimensions, and call this padded collection Qli . With recursive partitioning of the dimensions, Qli subsumes Q0j for 2l (i − 1) < j ≤ 2l i, i.e., Qli =
2l i j=2l (i−1)+1
Q0j .
At level 0, the algorithm yields the collections Q0i and Pi0 . The remaining clusterings are simply combinations, i.e., products of these level-0 clusterings. We denote the collection of m − 2l t identity matrices (of appropriate size) by I l , so that Ql1 = (P1l , I l ). Accoutered with our notation, we now prove the main lemma that relates the combined clustering to its sub-clusterings. Lemma 2. Let A be an order-m tensor and m ≥ 2l t. The objective function for any 2l t-dimensional clustering Pil = (P20l (i−1)+1 , . . . , P20l i ) can be bound via the sub-clusterings along only one set of dimensions of size t as A − Qli · A2 ≤
max
2l (i−1)<j≤2l i
2l A − Q0j · A2 .
(3.4)
We can always (wlog) permute dimensions so that any set of 2l clustered dimensions maps to the first 2l ones. Hence, it suffices to prove the lemma for i = 1, i.e., the first 2l dimensions. Proof. We prove the lemma for i = 1 by induction on l. Base: Let l = 0. Then Ql1 = Q01 , and (3.4) holds trivially. Induction: Assume the claim holds for l ≥ 0. Consider a clustering P1l+1 = (P1l , P2l ), or equivalently Ql+1 = Ql1 Ql2 . Using P + P ⊥ = I, we decompose A as 1 A
⊥
= (P1l+1 + P1l+1 , I l+1 ) · A
=
⊥
⊥
(P1l + P1l , P2l + P2l , I l+1 ) · A
⊥
⊥
⊥
⊥
= (P1l , P2l , I l+1 ) · A + (P1l , P2l , I l+1 ) · A + (P1l , P2l , I l+1 ) · A + (P1l , P2l , I l+1 ) · A ⊥
⊥
⊥
⊥
= Ql1 Ql2 · A + Ql1 Ql2 · A + Ql1 Ql2 · A + Ql1 Ql2 · A,
374
S. Jegelka, S. Sra, and A. Banerjee ⊥
⊥
where Ql1 = (P1l , I l ). Since Ql+1 = Ql1 Ql2 , the Pythagorean Property 1 yields 1 ⊥
⊥
⊥
⊥
A − Ql+1 · A2 = Ql1 Ql2 · A2 + Ql1 Ql2 · A2 + Ql1 Ql2 · A2 . 1 ⊥
Combining the above equalities with the assumption (wlog) Ql1 Ql2 · A2 ≥ ⊥ Ql1 Ql2 · A2 , we obtain the inequalities ⊥ ⊥ ⊥ ≤ 2 Ql1 Ql2 · A2 + Ql1 Ql2 · A2
A − Ql1 Ql2 · A2 ⊥
⊥
⊥
⊥
⊥
= 2Ql1 Ql2 · A + Ql1 Ql2 · A2 = 2Ql1 (Ql2 + Ql2 ) · A2 ⊥
= 2Ql1 · A2 = 2A − Ql1 · A2 ≤ 2 max A − Qlj · A2 1≤j≤2l
≤ 2 · 2l
max
1≤j≤2l+1
A − Q0j · A2 ,
where the last step follows from the induction hypothesis (3.4), and the two norm terms in the first line are combined using the Pythagorean Property.
Proof. (Thm. 1, Case 1 ). Let m = 2h t. Using an algorithm with guarantee αt , we cluster each subset (indexed by i) of t dimensions to obtain Q0i . Let Si be the optimal sub-clustering of subset i, i.e., the result that Q0i would be if αt were 1. We bound the objective for the collection of all m sub-clusterings P1h = Qh1 as A − Qh1 · A2 ≤ 2h max A − Q0j · A2 ≤ 2h αt max A − Sj · A2 . j
j
(3.5)
The first inequality follows from Lemma 2, while the last inequality follows from the αt approximation factor that we used to get sub-clustering Q0j . So far we have related our approximation to an optimal sub-clustering along a set of dimensions. Let us hence look at the relation between such an optimal sub-clustering S of the first t dimensions (via permutation, these dimensions correspond to an arbitrary subset of size t), and the optimal tensor clustering F along all the m = 2h t dimensions. Recall that a clustering can be expressed by either the projection matrices collected in Ql1 , or by cluster indicator matrices Ci together with the mean tensor M, so that (C1 , . . . , C2l t , I l ) · M = Ql1 · A. Let CSj and CF j be the dimension-wise cluster indicator matrices for S and F , respectively. By definition, S solves min
C1 ,...,Ct ,M
A − (C1 , . . . , Ct , I 0 ) · M2 ,
nj ×kj
s.t. Cj ∈ {0, 1}
,
F which makes S even better than the sub-clustering (CF 1 , . . . , Ct ) induced by the optimal m-dimensional clustering F . Thus, F 0 2 A − S · A2 ≤ min A − (CF 1 , . . . , Ct , I ) · M M
F 0 F F F 2 ≤ A − (CF 1 , . . . , Ct , I )(I, . . . , I, Ct+1 , . . . , Cm ) · M
= A − F · A2 ,
(3.6)
Approximation Algorithms for Tensor Clustering
375
where MF is the tensor of means for the optimal m-dimensional clustering. Combining (3.5) with (3.6) yields the final bound for the combined clustering C = Qh1 , Jm (C) = A − Qh1 · A2 ≤ 2h αt A − F · A2 = 2h αt JOPT (m),
which completes the proof of the theorem.
Tightness of Bound. How tight is the bound for CoTeC implied by Thm. 1? The following example shows that for Euclidean co-clustering, i.e., m = 2, the bound is tight. Specifically, for every 0.25 > γ > 0, there exists a matrix for which the approximation is as bad as J(C) = (m − γ)JOPT (m). Let be such that γ = 2(1 + )−2 . The optimal 1D row clusa b c d tering C1 for the matrix in Fig1 − −1 1 ure 2 groups rows {1, 2} and {3, 4} 2 1 −1 − together, and the optimal column 3 10 − 9 10 + 11 clustering is C2 = ({a, b}, {c, d}). 4 11 10 + 9 10 − The co-clustering loss for the combination is J2 (C1 , C2 ) = 8 + 82 . Fig. 2. Matrix with co-clustering approxima−2 The optimal co-clustering, group- tion factor 2 − 2(1 + ) ing columns {a, d} and {b, c} (and rows as C2 ) achieves an objective of JOPT (2) = 4(1 + )2 . Relating these results, we get J2 (C1 , C2 ) = (2 − γ)JOPT (m). However, this example is a worst-case scenario; the average factor is much better in practice, as revealed by our experiments (§4). The latter combined with the structure of this negative example suggest that with some assumptions on the data, one can probably obtain tighter bounds. Also note that the bound holds for a CoTeC-like scheme treating dimensions separately, but not necessarily for all approximation algorithms. 3.2
Analysis: Theorem 1, Metric Case
Now we present our proof of Thm. 1 for the case where d(x, y) is a metric. For this case, recall that the tensor clustering problem is minimize J(C) = d(A, (C1 , . . . , Cm ) · M),
s.t. Cj ∈ {0, 1}
(C1 ,...,Cm ),M
nj ×kj
.
(3.7)
Since in general the best representative M is not the mean tensor, we cannot use the shorthand P · A for M, so the proof is different from the Euclidean case. The following lemma is the basis of the induction for this case of Thm. 1. Lemma 3. Let A be of order, m = 2h t, and Rli the clustering of the i-th subset of 2l t dimensions (for l < h) with an approximation guarantee of α2l t —Rli combines the Cj in a manner analogous to how Qli combines projection matrices. Then the combination Rl+1 = Rli Rlj , i = j, satisfies min d(A, Rl+1 · M) ≤ 3α2l t min d(A, F l+1 · M), M
l+1
M
is the optimal joint clustering of the dimensions covered by Rl+1 (as where F before, we always assume that Rli and Rlj cover disjoint subsets of dimensions).
376
S. Jegelka, S. Sra, and A. Banerjee
Proof. Without loss of generality, we prove the lemma for Rl+1 = Rl1 Rl2 . Let 1 l l Mi = argminX d(A, Ri · X) be the associated representatives for i = 1, 2, and Sil the optimal 2l -dimensional clusterings. Further let F1l+1 = F1l F2l be the optimal 2l+1 -dimensional clustering. The following step is vital in relating objective values of Rl+1 and Sil . The optimal sub-clusterings will eventually be bounded by 1 the objective of the optimal F1l+1 . Let L = 2l+1 , and = argmin d(Rl Ml , Rl Rl · X), M 1 1 1 2
X ∈ Rk1 ×...×kL ×nL+1 ...×nm .
X
Let i,j be multi-indices running over dimensions 1 to 2l , and 2l + 1 to 2l+1 , respectively; let r be the multi-index covering the remaining m − L dimensions. The multi-indices of the clusters defined by Rl1 and Rl2 , respectively, are I and is the element-wise minimum, we have J. Since M = min d((μl1 )Ijr , μIJr ) d(Rl1 · Ml1 , Rl1 Rl2 · M) ≤
I,J i∈I,r
µIJr ∈R
d((μl1 )Ijr , (μl2 )iJr )
=
j∈J
d(Rl1 · Ml1 , Rl2 · Ml2 ).
I,J i∈I,r j∈J
Using this relation and the triangle inequality, we can now relate the objectives for the combined clustering and for the optimal sub-clusterings: min d(A, Rl1 Rl2 · Ml+1 ) ≤ d(A, Rl1 Rl2 · M) Ml+1
≤ d(A, Rl1 · Ml1 ) + d(Rl1 · Ml1 , Rl1 Rl2 · M) ≤ d(A, Rl1 · Ml1 ) + d(Rl1 · Ml1 , Rl2 · M2 l ) ≤ 2d(A, Rl1 · Ml1 ) + d(A, Rl2 · Ml2 ) ≤ 2α2l t min d(A, S1l · X1 ) + α2l t min d(A, S2l · X2 ). X1
X2
(3.8)
However, owing to the optimality of S1l , we have min d(A, S1l · Xl1 ) ≤ min d(A, F1l · Yl ) ≤ min d(A, F1l F2l · Yl+1 ), Xl1
Yl
Y l+1
and analogously for S2l . Plugging this inequality into (3.8) we get min d(A, Rl1 Rl2 · Ml+1 ) ≤ 3α2l t min d(A, F1l F2l · Yl+1 ) = 3α2l t min d(A, F1l+1 · Yl+1 ).
Ml+1
Yl+1
Yl+1
Proof. (Thm. 1, Case 2 ). Given Lemma 3, the proof of Thm. 1 for the metric case follows easily by induction if we hierarchically combine the sub-clusterings and use α2l+1 t = 3α2l t , for l ≥ 0, as stated by the lemma.
3.3
Implications
We now mention several important implications of Theorem 1.
Approximation Algorithms for Tensor Clustering
377
Clustering with Bregman divergences. Bregman divergence based clustering and co-clustering are well-studied problems [25,4]. Here, the function d(x, y) is parametrized by a strictly convex function f [24], so that d(x, y) = Bf (x, y) = f (x) − f (y) − f (y)(x − y). Under the assumption (also see [5,6]) σL x − y2 ≤ Bf (x, y) ≤ σU x − y2 ,
(3.9)
on the curvature of the divergence Bf (x, y), we can invoke Thm. 1 with ρd = σU /σL . The proofs are omitted for brevity, and may be found in [27]. We would like to stress that such curvature bounds seem to be necessary to guarantee constant approximation factors for the underlying 1D clustering—this intuition is reinforced by the results of [28], who avoided such curvature assumptions and had to be content with a non-constant O(log n) approximation factor for information theoretic clustering. Clustering with p -norms. Thm. 1 (metric case) immediately yields approximation factors for clustering with p -norms. We note that for binary matrices, using t = 2 and the results of [11] we can obtain the slightly stronger guarantee √ J(C) ≤ 3log2 (m)−1 (1 + 2)α1 JOPT (m). Exploiting 1D clustering results. Substituting the approximation factors α1 of existing 1D clustering algorithms in Thm. 1 (with t = 1) instantly yields specific bounds for corresponding tensor clustering algorithms. Table 1 summarizes these results, however we omit proofs for lack of space—see [27] for details. Table 1. Approximation guarantees for Tensor Clustering Algorithms. K ∗ denotes the maximum number of clusters, i.e., K ∗ = argmaxj kj ; c is some constant. Problem Name Metric tensor clustering Bregman tensor clustering Bregman tensor clustering Bregman co-clustering Hilbertian metrics
4
Approx. Bound J(C) ≤ m(1 + )JOPT (m) E[J(C)] ≤ 8mc(log K ∗ + 2)JOPT (m) −1 J(C) ≤ mσU σL (1 + )JOPT (m) Above two results with m = 2 E[J(C)] ≤ 8m(log K ∗ + 2)JOPT (m)
Proof Thm. 1 + [6] (3.9), Thm. 1 + [7] (3.9), Thm. 1 + [5] as above See [27]
Experimental Results
Our bounds depend strongly on the approximation factor αt of an underlying t-dimensional clustering method. In our experiments, we study this close dependence for t = 1, wherein we compare the tensor clusterings arising from different 1D methods of varying sophistication. Keep in mind that the comparison of the 1D methods is to see their impact on the tensor clustering built on top of them. Our experiments reveal that the empirical approximation factors are usually smaller than the theoretical bounds, and these factors depend on statistical
378
S. Jegelka, S. Sra, and A. Banerjee
properties of the data. We also observe the linear dependence of the CoTeC objectives on the associated 1D objectives, as suggested by Thm. 1 (for Euclidean) and Table 1 (2nd row, for KL Divergence). Further comparisons show that in practice, CoTeC is competitive with a greedy heuristic SiTeC (Simultaneous Tensor Clustering), which simultaneously takes all dimensions into account, but lacks theoretical guarantees. As expected, initializing SiTeC with CoTeC yields lower final objective values using fewer “simultaneous” iterations. We focus on Euclidean disuniform lrl +1D k-means rk } CoTeC tance and KL Divergence to seeding +SiTeC +SiTeC (1D) test CoTeC. To study the eflrcl rkc }SiTeC fect of the 1D method, we use data two seeding methods, uniform, specific seeding +1D k-means and distance based (weighted ls l sk } CoTeC (1D) +SiTeC +SiTeC farthest first) drawing. The latter ensures 1D approximalscl } SiTeC skc tion factors for E[J(C)] by [7] for Euclidean clustering and by Fig. 3. Tensor clustering variants [8,9] for KL Divergence. We use each seeding by itself and as an initialization for k-means to get four 1D methods for each divergence (see Fig. 3). We refer to the CoTeC combination of the corresponding independent 1D clusterings by abbreviations: (1) ‘r:’ uniformly sample centers from the data points and assign each point to its closest center; (2) ‘s:’ sample centers with distance-specific seeding [7,8,9] and assign each point to its closest center; (3) ‘rk:’ initialize Euclidean or Bregman k-means with ‘r’; (4) ‘sk:’ initialize Euclidean or Bregman k-means with ‘s’. The SiTeC method we compare to is the minimum sum-squared residue coclustering of [29] for Euclidean distances in 2D, and a generalization of Algorithm 1 of [4] for 3D and Bregman 2D clustering. Additionally, we initialize SiTeC with the outcome of each of the four CoTeC variants, which yields four versions (of SiTeC), namely, rc, sc, rkc, and skc, initialized with the results of ‘r’, ‘s’, ‘rk’, and ‘sk’, respectively. These variants inherit the guarantees of CoTeC, as they monotonically decrease the objective value. 4.1
Experiments on Synthetic Data
For a controlled setting with synthetic data, we generate tensors A of size 75 × 75 × 50 and 75 × 75, for which we randomly choose a 5 × 5 × 5 tensor of means M and cluster indicator matrices Ci ∈ {0, 1}ni ×5 . For clustering with Euclidean distances we add Gaussian noise (from N (0, σ 2 ) with varying σ) to A, while for KL Divergences we use the sampling method of [4] with varying noise. For each noise-level to test, we repeat the 1D seeding 20 times on each of five generated tensors and average the resulting 100 objective values. To estimate the approximation factor αm on a tensor, we divide the achieved objective J(C) by the objective value of the “true” underlying tensor clustering. Figure 4 shows the
Approximation Algorithms for Tensor Clustering
r rk rc rkc s sk sc skc
5 4.5 4
379
r rk rc rkc s sk sc skc
3.5
3
factor
factor
3.5 3 2.5
2.5
2
2 1.5 1.5 1
1 0.5
1
1.5 σ
2
3
0.5
1.4
2
3
r rk rc rkc s sk sc skc
1.3 1.25 1.2
1.3
1.15
factor
factor
1.5 σ
1.35
r rk rc rkc s sk sc skc
1.5
1
1.2
1.1 1.05
1.1 1 1
0.95 0.9
0.9 1
0.8
0.6 σ
0.4
0.85
0.2
1
0.8
0.6 σ
0.4
0.2
Fig. 4. Approximation factors for 3D clustering (left) and co-clustering (right) with increasing noise. Top row: Euclidean distances, bottom row: KL Divergence. The x axis shows σ, the y axis the empirical approximation factor. Table 2. (i) Improvement of CoTeC and SiTeC variants upon ‘r’ in %; the respective reference value (J2 for ‘r’) is shaded in gray. (ii) Average number of SiTeC iterations. (i)
k1 k 2 20 3 x=r x=s 20 6 x=r x=s 50 3 x=r x=s 50 6 x=r x=s
Bcell, Euc. CoTeC x xk 5.75 · 105 31.66 18.83 32.24 5.56 · 105 49.13 34.97 50.55 5.63 · 105 31.10 15.25 32.58 5.18 · 105 47.55 36.22 49.83
SiTeC xc xkc 20.05 33.05 24.61 33.36 35.26 50.37 43.93 51.66 14.77 31.76 19.14 33.17 34.63 48.41 43.77 50.55
(i)
k 1 k2 20 3 x=r x=s 20 6 x=r x=s 50 3 x=r x=s 50 6 x=r x=s
Bcell, KL CoTeC x xk 3.37 · 10−1 17.59 10.54 18.44 3.15 · 10−1 18.62 11.76 20.52 3.20 · 10−1 15.70 9.61 17.24 2.85 · 10−1 16.38 11.86 18.63
SiTeC xc xkc 22.23 23.26 22.99 22.98 24.51 25.43 25.69 26.23 20.12 21.07 20.85 21.33 21.61 22.57 23.24 23.13
(ii) k1 k2 rc rkc sc skc rc rkc sc skc (ii) k1 k2 20 3 7.0 ± 1.4 2.0 ± 0.2 3.9 ± 1.0 2.2 ± 0.5 20 3 10.6 ± 2.8 7.5 ± 2.0 7.4 ± 1.8 7.0 ± 2.2 20 6 11.3 ± 2.3 2.6 ± 0.8 5.1 ± 2.0 2.7 ± 0.7 20 6 12.6 ± 3.4 8.8 ± 2.9 8.4 ± 2.1 8.1 ± 2.0 50 3 50 3 6.2 ± 1.9 2.0 ± 0.0 3.5 ± 2.0 2.0 ± 0.0 9.1 ± 2.3 6.2 ± 1.3 6.9 ± 1.8 6.0 ± 1.3 50 6 50 6 10.5 ± 1.8 7.7 ± 2.1 8.1 ± 2.3 6.9 ± 1.0 8.1 ± 2.1 2.1 ± 0.3 4.1 ± 1.6 2.0 ± 0.0
empirical approximation factor α ˆ m for Euclidean distance and KL Divergence. Qualitatively, the plots for tensors of order 2 and 3 do not differ. In all settings, the empirical factor remains below the theoretical factor. The reason for decreasing approximation factors with higher noise could be lower accuracy of the estimates of JOPT on the one hand, and more similar objective values for all clusterings on the other hand. With low noise, distance-specific
380
S. Jegelka, S. Sra, and A. Banerjee
seeding s yields better results than uniform seeding r, and adding k-means on top (rk,sk) improves the results of both. With Euclidean distances, CoTeC with wellinitialized 1D k-means (sk) competes with SiTeC. For KL Divergence, though, SiTeC still improves on sk, and with high noise levels, 1D k-means does not help: both rk and sk are as good as their seeding only counterparts r, s. 4.2
Experiments on Biological Data
We further assess the behavior of our method with gene expression data4 from multiple sources [30,31,32]. For brevity, we only introduce two of the data sets here for which we present more detailed results; more datasets and experiments are described in [27]. The matrix Bcell [30] is a (1332×62) lymphoma microarray dataset of chronic lymphotic leukemia, diffuse large B-cell leukemia and follicular lymphoma. The order-3 tensor Interferon consists of gene expression levels from MS patients treated with recombinant human interferon beta [32]. After removal of missing values, a complete 6 × 21 × 66 tensor remained. For experiments with KL Divergence, we normalized all tensors to have their entries sum up to one. Since our analysis concerns the objective function J(C) alone, we disregard the “true” labels, which are available for only one of the dimensions. For each data set, we repeat the sampling of centers 30 times and average the resulting objective values. Panel (i) in Table 2 (order-2), and in Table 3 (order-3) show the objective value for the simplest CoTeC variant ‘r’ as a baseline, and the relative improvements achieved by other methods. The methods are encoded as x, xk, xc, xkc, where x stands for r or s, depending on the row in the table. Table 3. (i) Improvement of CoTeC and SiTeC variants upon ‘r’ in %; the respective reference value (J3 for ‘r’) is shaded in gray Interferon, KL x xk (i) k1 k2 k3 2 2 2 x=r 9.71 · 10−1 38.58 x=s 25.07 36.67 2 2 3 x=r 8.17 · 10−1 41.31 33.63 43.90 x=s 2 2 4 x=r 7.11 · 10−1 39.79 38.01 46.09 x=s
xc 42.46 43.53 46.06 46.82 44.05 51.30
xkc 43.53 43.74 46.31 47.16 45.62 51.35
Figure 5 summarizes the average improvements for all five order-2 data sets studied in [27]. Groups indicate methods, and colors indicate seeding techniques. On average, a better seeding improves the results for all methods: the gray bars are higher than their black counterparts in all groups. Just as for synthetic data, 1D k-means improves the CoTeC results here too. SiTeC (groups 3 and 4) is better than CoTeC with mere seeding (r,s, group 1). Notably, for Euclidean 4
We thank Hyuk Cho for kindly providing us his preprocessed 2D data sets.
Approximation Algorithms for Tensor Clustering (i) Eucl.Dist.
(i) KL
(ii) Eucl.Dist. 100
15 10 5 0
(ii) KL
unif: x=r dist.: x=s
80
20
% iterations
% improvement wrt. ’r’
25
381
xk
xc
0
xkc
CoTeC
40 20
unif: x=r dist.: x=s x
60
SiTeC
xc
xkc
CoTeC
SiTeC
Fig. 5. (i) % improvement of the objective J2 (C) with respect to uniform 1D seeding (r), averaged over all order-2 data sets and parameter settings (details in [27]). (ii) average number of SiTeC iterations, in % with respect to initialization by r.
distances, combining good 1D clusters obtained by k-means (rk,sk, group 2) is on average better than SiTeC initialized with simple seeding (rc,sc, group 3). For KL Divergences, on the other hand, SiTeC still outperforms all CoTeC variations. Given the limitation to single dimensions, CoTeC performs surprisingly well in comparison to SiTeC. Additionally, SiTeC initialized with CoTeC converges faster to better solutions, further underscoring the utility of CoTeC. Relation to 1D Clusterings. Our experiments support the theoretical results and the intuitive expectation that better 1D clusterings yield better CoTeC solutions. Can we quantify this relation? Theorem 1 suggests a linear dependence of the order-m factor αm on α1 . However, these factors are difficult to check empirically when optimal clusterings are unknown. However, on one matrix JOPT (2)/JOPT (1) is constant, so if the approximation factors are tight (up to a constant factor), the ratio i = 1, 2
1.2 1 1
10
2
J /J
only depends on α2 /α1 . Stating α2 = 2α1 ρd , Thm. 1 predicts J2 /J1 to be independent of the 1D method, i.e., of α1 , and constant on one matrix. The empirical ratios J2 /J1 in Figure 6 support this prediction, which suggests that for CoTeC the quality of the multi-dimensional clustering directly depends on the quality of its 1D components, both in theory and in practice.
% improvement wrt. ’r’ − 1D
J2 (C1 , C2 )/J1 (Ci ) ≈ (α2 /α1 ) JOPT (2)/JOPT (1),
5
0.6 0.4
uniform dist−spec 0
0.8
x
xk
xc
xkc
unif: x=r dist.: x=s
0.2 0
x
xk
xc
xkc
Fig. 6. Left: average improvement of 1D clusterings (components) with respect to ‘r’. Right: average ratio J2 /J1 , both for the same clusterings as in Figure 5.
382
5
S. Jegelka, S. Sra, and A. Banerjee
Conclusion
In this paper we presented CoTeC, a simple, and to our knowledge the first approximation algorithm for tensor clustering, which yielded approximation results for Bregman co-clustering and tensor clustering as special cases. We proved an approximation factor that grows linearly with the order of the tensor, and showed tightness of the factor for the 2D Euclidean case (Fig. 2), though empirically the observed factors are usually smaller than suggested by the theory. Our worst-case example also illustrates the limitation of CoTeC, i.e., to ignore the interaction between clusterings along multiple dimensions. Thm. 1 thus gives hints how much information maximally lies in this interaction. Analyzing this interplay could potentially lead to better approximation factors, e.g., by developing a co-clustering specific seeding technique. Using such an algorithm as a subroutine in CoTeC will yield a hybrid that combines CoTeC’s simplicity with better approximation guarantees. Acknowledgment. AB was supported in part by NSF grant IIS-0812183.
References 1. Banerjee, A., Basu, S., Merugu, S.: Multi-way Clustering on Relation Graphs. In: SIAM Conf. Data Mining, SDM (2007) 2. Shashua, A., Zass, R., Hazan, T.: Multi-way Clustering Using Super-Symmetric Non-negative Tensor Factorization. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 595–608. Springer, Heidelberg (2006) 3. Dhillon, I.S., Mallela, S., Modha, D.S.: Information-theoretic co-clustering. In: KDD, pp. 89–98 (2003) 4. Banerjee, A., Dhillon, I.S., Ghosh, J., Merugu, S., Modha, D.S.: A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation. JMLR 8, 1919–1986 (2007) 5. Ackermann, M.R., Bl¨ omer, J.: Coresets and Approximate Clustering for Bregman Divergences. In: ACM-SIAM Symp. on Disc. Alg., SODA (2009) 6. Ackermann, M.R., Bl¨ omer, J., Sohler, C.: Clustering for metric and non-metric distance measures. In: ACM-SIAM Symp. on Disc. Alg. (SODA) (April 2008) 7. Arthur, D., Vassilvitskii, S.: k-means++: The Advantages of Careful Seeding. In: ACM-SIAM Symp. on Discete Algorithms (SODA), pp. 1027–1035 (2007) 8. Nock, R., Luosto, P., Kivinen, J.: Mixed Bregman clustering with approximation guarantees. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML / PKDD 2008, Part II. LNCS (LNAI), vol. 5212, pp. 154–169. Springer, Heidelberg (2008) 9. Sra, S., Jegelka, S., Banerjee, A.: Approximation algorithms for Bregman clustering, co-clustering and tensor clustering. Technical Report 177, MPI for Biological Cybernetics (2008) 10. Ben-David, S.: A framework for statistical clustering with constant time approximation algorithms for K-median and K-means clustering. Mach. Learn. 66(2-3), 243–257 (2007) 11. Puolam¨ aki, K., Hanhij¨ arvi, S., Garriga, G.C.: An approximation ratio for biclustering. Inf. Process. Letters 108(2), 45–49 (2008) 12. Anagnostopoulos, A., Dasgupta, A., Kumar, R.: Approximation algorithms for coclustering. In: Symp. on Principles of Database Systems, PODS (2008)
Approximation Algorithms for Tensor Clustering
383
13. Zha, H., Ding, C., Li, T., Zhu, S.: Workshop on Data Mining using Matrices and Tensors. In: KDD (2008) 14. Hasan, M., Velazquez-Armendariz, E., Pellacini, F., Bala, K.: Tensor Clustering for Rendering Many-Light Animations. In: Eurographics Symp. on Rendering, vol. 27 (2008) 15. Kolda, T.G., Bader, B.W.: Tensor Decompositions and Applications. SIAM Review 51(3) (to appear, 2009) 16. Hartigan, J.A.: Direct clustering of a data matrix. J. of the Am. Stat. Assoc. 67(337), 123–129 (1972) 17. Cheng, Y., Church, G.: Biclustering of expression data. In: Proc. ISMB, pp. 93–103. AAAI Press, Menlo Park (2000) 18. Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: KDD, pp. 269–274 (2001) 19. Bekkerman, R., El-Yaniv, R., McCallum, A.: Multi-way distributional clustering via pairwise interactions. In: ICML (2005) 20. Agarwal, S., Lim, J., Zelnik-Manor, L., Perona, P., Kriegman, D., Belongie, S.: Beyond pairwise clustering. In: IEEE CVPR (2005) 21. Govindu, V.M.: A tensor decomposition for geometric grouping and segmentation. In: IEEE CVPR (2005) 22. Sch¨ olkopf, B., Smola, A.: Learning with Kernels. MIT Press, Cambridge (2001) 23. Hein, M., Bousquet, O.: Hilbertian metrics and positive definite kernels on probability measures. In: AISTATS (2005) 24. Censor, Y., Zenios, S.A.: Parallel Optimization: Theory, Algorithms, and Applications. Oxford University Press, Oxford (1997) 25. Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.: Clustering with Bregman Divergences. JMLR 6(6), 1705–1749 (2005) 26. de Silva, V., Lim, L.H.: Tensor Rank and the Ill-Posedness of the Best Low-Rank Approximation Problem. SIAM J. Matrix Anal. & Appl. 30(3), 1084–1127 (2008) 27. Jegelka, S., Sra, S., Banerjee, A.: Approximation algorithms for Bregman coclustering and tensor clustering (2009); arXiv:cs.DS/0812.0389v3 28. Chaudhuri, K., McGregor, A.: Finding metric structure in information theoretic clustering. In: Conf. on Learning Theory, COLT (July 2008) 29. Cho, H., Dhillon, I.S., Guan, Y., Sra, S.: Minimum Sum Squared Residue based Co-clustering of Gene Expression data. In: SDM, 114–125 (2004) 30. Kluger, Y., Basri, R., Chang, J.T.: Spectral biclustering of microarray data: Coclustering genes and conditions. Genome Research 13, 703–716 (2003) 31. Cho, H., Dhillon, I.: Coclustering of human cancer microarrays using minimum sum-squared residue coclustering. IEEE/ACM Tran. Comput. Biol. Bioinf. 5(3), 385–400 (2008) 32. Baranzini, S.E., et al: Transcription-based prediction of response to IFNβ using supervised computational methods. PLoS Biology 3(1) (2004)
Agnostic Clustering Maria Florina Balcan1 , Heiko R¨oglin2, , and Shang-Hua Teng3 1
College of Computing, Georgia Institute of Technology
[email protected] 2 Department of Quantitative Economics, Maastricht University
[email protected] 3 Computer Science Department, University of Southern California
[email protected]
Abstract. Motivated by the principle of agnostic learning, we present an extension of the model introduced by Balcan, Blum, and Gupta [3] on computing low-error clusterings. The extended model uses a weaker assumption on the target clustering, which captures data clustering in presence of outliers or ill-behaved data points. Unlike the original target clustering property, with our new property it may no longer be the case that all plausible target clusterings are close to each other. Instead, we present algorithms that produce a small list of clusterings with the guarantee that all clusterings satisfying the assumption are close to some clustering in the list, proving both upper and lower bounds on the length of the list needed.
1
Introduction
Problems of clustering data from pairwise distance or similarity information are ubiquitous in science. Typical examples of such problems include clustering proteins by function, images by subject, or documents by topic. In many of these clustering applications there is an unknown target or desired clustering, and while the distance information among data is merely heuristically defined, the real goal in these applications is to minimize the clustering error with respect to the target clustering. A commonly used approach for data clustering is to first choose a particular distance-based objective function Φ (e.g., k-median or k-means) and then design a clustering algorithm that (approximately) optimizes this objective function [1, 2, 7]. The implicit hope is that approximately optimizing the objective function will in fact produce a clustering of low clustering error, i.e. a clustering that is pointwise close to the target clustering. Mathematically, the implicit assumption is that the clustering error of any c-approximation to Φ on the data set is bounded by some . We will refer to this assumed property as the (c, ) property for Φ.
This work was done in part while the authors were at Microsoft Research, New England. Supported by a fellowship within the Postdoc-Program of the German Academic Exchange Service (DAAD).
R. Gavald` a et al. (Eds.): ALT 2009, LNAI 5809, pp. 384–398, 2009. c Springer-Verlag Berlin Heidelberg 2009
Agnostic Clustering
385
Balcan, Blum, and Gupta [3] have shown that by making this implicit assumption explicit, one can efficiently compute a low-error clustering even in cases when the approximation problem of the objective function is NP-hard. In particular, they show that for any c = 1 + α > 1, if data satisfies the (c, ) property for the k-median or the k-means objective, then one can produce a clustering that is O()-close to the target, even for values c for which obtaining a c-approximation is NP-hard. However, the (c, ) property is a strong assumption. In real data there may well be some data points for which the (heuristic) distance measure does not reflect cluster membership well, causing the (c, ) property to be violated. A more realistic assumption is that the data satisfies the (c, ) property only after some number of outliers or ill-behaved data points, i.e., a ν fraction of the data points, have been removed. We will refer to this property as the (ν, c, ) property. While the (c, ) property leads to the situation that all plausible clusterings (i.e., all the clusterings satisfying the (c, ) property) are O()-close to each other, two different sets of outliers could result in two different clusterings satisfying the (ν, c, ) property. We therefore analyze the clustering complexity of this property [4], i.e, the size of the smallest ensemble of clusterings such that any clustering satisfying the (ν, c, ) property is close to a clustering in the ensemble; we provide tight upper and lower bounds on this quantity for several interesting cases, as well as efficient algorithms for outputting a list such that any clustering satisfying the property is close to one of those in the list. Perspective. The clustering framework we analyze in this paper is related in spirit to the agnostic learning model in the supervised learning setting [6]. In the Probably Approximately Correct (or PAC) learning model of Valiant [8], also known as the realizable setting, the assumption is that the data distribution over labeled examples is correctly classified by some fixed but unknown concept in some concept class, e.g., by a linear separator. In the agnostic setting [6] however, the assumption is weakened to the hope that most of the data is correctly classified by some fixed but unknown concept in some concept space, and the goal is to compete with the best concept in the class by an efficient algorithm. Similarly, one can view the (ν, c, ) property as an agnostic version of the (c, ) property since we assume that the (ν, c, ) property is satisfied if the (c, ) property is satisfied on most but not all of the points and moreover the points where the property is not satisfied are adversarially chosen. Our results. We present several algorithmic and information-theoretic results in this new clustering model. For most of this paper we focus on the k-median objective function. In the case where the target clusters are large (have size Ω((/α + ν)n)) we show that the algorithm in [3] can be used in order to output a single clustering that is (ν + )-close to the target clustering. We then show that in the more general case there can be multiple significantly different clusterings that can satisfy the (ν, c, ) property. This is true even in the case where most of the points come from large clusters; in this case, however, we show that we can in polynomial time output a small list of k-clusterings such that any clustering that satisfies
386
M.F. Balcan, H. R¨ oglin, and S.-H. Teng
the property is close to one of the clusterings in the list. In the case where most of the points come from small clusters, we provide information-theoretic bounds on the clustering complexity of this property. We also show how both the analysis in [3] for the (c, ) property and our analysis for the (ν, 1 + α, ) property can be adapted to the inductive case, where we imagine our given data is only a small random sample of the entire data set. Based on the sample, our algorithm outputs a clustering or a list of clusterings of the full domain set that are evaluated with respect to the underlying distribution. We conclude by discussing how our analysis extends to the k-means objective function as well.
2
The Model
The clustering problems we consider fall into the following general framework: we are given a metric space M = (X, d) with point set X and a distance function d : X2 → R≥0 satisfying the triangle inequality — this is the ambient space. We are also given the actual point set S ⊆ X we want to cluster; we use n to denote the cardinality of S. A k-clustering C is a partition of S into k (possibly empty) sets C1 , C2 , . . . , Ck . In this work, we always assume that there is a true or target k-clustering CT for the point set S. Commonly used clustering algorithms seek to minimize some objective function or “score”. For example, the k-median clustering objective to each k assigns cluster Ci a “median” ci ∈ Ci and seeks to minimize Φ1 (C) = i=1 x∈Ci d(x, ci ). Another example is the k-means clustering objective, which assigns to each cluster Ci a “center” ci ∈ X and seeks to minimize Φ2 (C) = ki=1 x∈Ci d(x, ci )2 . Given a function Φ and an instance (M, S), let OPTΦ = minC Φ(C), where the minimum is over all k-clusterings of S. The notion of distance between two k-clusterings C = {C1 , C2 , . . . , Ck } and C = {C1 , C2 , . . . , Ck } that we use throughout the paper, is the fraction of points on which they disagree under the optimal matching of clusters in C to clusters in C ; we denote that as dist(C, C ). Formally, dist(C, C ) = min
σ∈Sk
k
1 |Ci − Cσ(i) |, n i=1
where Sk is the set of bijections σ : {1, . . . , k} → {1, . . . , k}. We say that two clusterings C and C are -close if dist(C, C ) ≤ and we say that a clustering has error if it is -close to the target. The (1 + α, )-property. The following notion originally introduced in [3] and later studied in [5] is central to our discussion: Definition 1. Given an objective function Φ (such as k-median or k-means), we say that instance (S, d) satisfies the (1 + α, )-property for Φ with respect to the target clustering CT if all clusterings C with Φ(C) ≤ (1+α)·OPTΦ are -close to the target clustering CT for (S, d).
Agnostic Clustering
387
The (ν, 1 + α, )-property. In this paper, we study the following more robust variation of Definition 1: Definition 2. Given an objective function Φ (such as k-median or k-means), we say that instance (S, d) satisfies the (ν, 1 + α, )-property for Φ with respect to the target clustering CT if there exists a set of points S ⊆ S of size at least (1 − ν)n such that (S , d) satisfies the (1 + α, )-property for Φ with respect to the clustering CT ∩ S induced by the target clustering on S . In other words our hope is that the (1 + α, )-property for objective Φ is satisfied only after outliers or ill-behaved data points have been removed. Note that unlike the case ν = 0, in general the (ν, 1+α, )-property could be satisfied with respect to multiple significantly different clusterings, since we allow the set of outliers or ill-behaved data points to be arbitrary. As a consequence we will be interested in the size of the smallest list any algorithm could hope to output that guarantees that at least one clustering in the list has small error. Given the instance (S, d), we say that a given clustering C is consistent with the (ν, 1 + α, )-property for Φ if (S, d) satisfies the (ν, 1 + α, )-property for Φ with respect to C. The following notion originally introduced in [4] provides a formal measure of the inherent usefulness of a given property. Definition 3. Given an instance (S, d) and the (ν, 1 + α, )-property for Φ, we define the (γ, k)-clustering complexity of the instance (S, d) with respect to the (ν, 1 + α, )-property for Φ to be the length of the shortest list of clusterings h1 , . . . , ht such that any consistent k-clustering is γ-close to some clustering in the list. The (γ, k) clustering complexity of the (ν, 1 + α, )-property for Φ is the maximum of this quantity over all instances (S, d). Ideally, the (ν, 1+α, ) property should have (γ, k) clustering complexity polynomial in k, 1/, 1/ν, 1/α, and 1/γ. Sometimes we analyze the clustering complexity of our property restricted to some family of interesting clusterings. We define this analogously: Definition 4. Given an instance (S, d) and the (ν, 1 + α, )-property for Φ, we define the (γ, k)-restricted clustering complexity of the instance (S, d) with respect to the (ν, 1 + α, )-property for Φ and with respect to some family of clusterings F to be the length of the shortest list of clusterings h1 , . . . , ht such that any consistent k-clustering in the family F is γ-close to some clustering in the list. The (γ, k) restricted clustering complexity of the (ν, 1 + α, )-property for Φ and F is the maximum of this quantity over all instances (S, d). For example, we will analyze the (ν, 1 + α, )-property restricted to clusterings in which every cluster has size Ω((/α + ν)n) or to the case where the average cluster size is at least Ω((/α + ν)n). Throughout the paper we use the following notations: For n ∈ N, we denote by [n] the set {1, . . . , n}. Furthermore, log denotes the logarithm to base 2. We say that a list C1 , C2 , C3 , . . . of clusterings is laminar if Ci+1 can be obtained from Ci by merging some of the clusters of Ci .
388
3
M.F. Balcan, H. R¨ oglin, and S.-H. Teng
k-Median Based Clustering: The (1 + α, )-Property
We start by summarizing in Section 3.1 consequences of the (1 + α, )-property that are critical for the new results we present in this paper. We also describe the algorithm presented in [3] for the case that all clusters in the target clustering are large. Then in Section 3.2 we show how this algorithm can be extended to and analyzed in the inductive case. 3.1
Key Properties of the (1 + α, )-Property
Given an instance of k-median specified by a metric space M = (X, d) and a set of points S ⊆ X, fix an optimal k-median clustering C ∗ = {C1∗ , . . . , Ck∗ }, and let c∗i be the center point for Ci∗ . For x ∈ S, let w(x) = mini d(x, c∗i ) be the contribution of x to the k-median objective in C ∗ (i.e., x’s “weight”), and let w2 (x) be x’s distance to the second-closest center point among {c∗1 , c∗2 , . . . , c∗k }. 1 Also, let w = n x∈S w(x) = OPT be the average weight of the points. Finally, n let ∗ = dist(CT , C ∗ ); so, from the (1 + α, )-property we have ∗ < . Lemma 5 ( [3]). If the k-median instance (M, S) satisfies the (1+α, )-property with respect to CT , then (a) less than 6n points x ∈ S have w2 (x) − w(x) < αw 2 , (b) if each cluster in CT has size at least 2n, less than ( − ∗ )n points x ∈ S on which CT and C ∗ agree have w2 (x) − w(x) < αw , and (c) for every z ≥ 1, at most zn/α points x ∈ S have w(x) ≥ αw z . Algorithm 1. k-median, the case of large target clusters Input: τ , b. Step 1. Construct the graph Gτ = (S, Eτ ) by connecting all pairs {x, y} ∈ S2 with d(x, y) ≤ τ . Step 2. Create a new graph Hτ,b where we connect two points by an edge if they share more than bn neighbors in common in Gτ . Step 3. Let C be any clustering obtained by taking the largest k components in Hτ,b , adding the vertices of all other smaller components to any of these. Step 4. For each point x ∈ S and each cluster Cj , compute the median distance dmed (x, j) between x and all points in Cj . Insert x into the cluster Ci for i = argminj dmed (x, j). Output: Clustering C
Theorem 6 ( [3]). Assume that the k-median instance satisfies the (1 + α, )property. If each cluster in CT has size at least (3 + 10/α)n + 2, then given w we can efficiently find a clustering that is -close to CT . If each cluster in CT has size at least (4 + 15/α)n + 2, then we can efficiently find a clustering that is -close to CT even without being given w. Since some of the elements of this construction are essential in our subsequent proofs, we summarize in the following the main ideas of this proof.
Agnostic Clustering
389
Main Ideas of the Construction. Assume first that we are given w. We use Algorithm 1 with τ = 2αw 5 and b = (1 + 5/α). For the analysis, let us define dcrit = αw . We call point x good if both w(x) < dcrit and w2 (x) − w(x) ≥ 5dcrit , 5 else x is called bad ; by Lemma 5 and the fact that ∗ ≤ , if all clusters in the target have size greater than 2n, then at most a (1 + 5/α)-fraction of points is bad. Let Xi be the good points in the optimal cluster Ci∗ , and let B = S \ ∪Xi be the bad points. For instances satisfying the (1 + α, )-property, the threshold graph Gτ defined in Algorithm 1 has the following properties: (i) For all x, y in the same Xi , the edge {x, y} ∈ E(Gτ ). (ii) For x ∈ Xi and y ∈ Xj ∈ E(Gτ ). Moreover, such points x, y do not share any neighbors =i , {x, y} in Gτ (by the triangle inequality). This implies that each Xi is contained in a distinct component of the graph Hτ,b ; the remaining components of Hτ,b contain vertices from the “bad bucket” B. Since the Xi ’s are larger than B, we get that the clustering C obtained in Step 3 by taking the largest k components in H and adding the vertices of all other smaller components to one of them differs from the optimal clustering C ∗ only in the bad points which constitute an O(/α) fraction of the total. To argue that the clustering C is -close to CT , we call a point x “red” if it satisfies w2 (x) − w(x) < 5dcrit , “yellow” if it is not red but w(x) ≥ dcrit , and “green” otherwise. So, the green points are those in the sets Xi , and we have partitioned the bad set B into red points and yellow points. The clustering C agrees with C ∗ on the green points, so without loss of generality we may assume Xi ⊆ Ci . Since each cluster in Ci has a strict majority of green points all of which are clustered as in C ∗ , this means that for a non-red point x, the median distance to points in its correct cluster with respect to C ∗ is less than the median distance to points in any incorrect cluster. Thus, C agrees with C ∗ on all non-red points. Since there are at most ( − ∗ )n red points on which CT and C ∗ agree by Lemma 5 — and C and CT might disagree on all these points — this implies dist(C , CT ) ≤ ( − ∗ ) + ∗ = , as desired. The “unknown w” Case. If we are not given the value w, and every target cluster has size at least (4 + 15/α)n + 2, we instead run Algorithm 1 (with τ = 2αw 5 and b = (1 + 5/α) repeatedly for different values of w, starting with w = 0 (so the graph Gτ is empty) and at each step increasing w to the next value such that Gτ contains at least one new edge. We say that a point is missed if it does not belong to the k largest components of Hτ,b . The number of missed points decreases with increasing w, and we stop with the smallest w, for which we miss at most bn = (1 + 5/α)n points and each of the k largest components contains more than 2bn points. Clearly, for the correct value of w, we miss at most bn points because we miss only bad points. Additionally, every Xi contains more than 2bn points. This implies that our guess for w can only be smaller than the correct w and the resulting graphs Gτ and Hτ,b can only have fewer edges than the corresponding graphs for the correct w. However, since we miss at most bn points and every set Xi contains more than bn points, there must be good points from every good set Xi that are not missed. Hence, each of the k largest components corresponds to a distinct cluster Ci∗ . We
390
M.F. Balcan, H. R¨ oglin, and S.-H. Teng
might misclassify all bad points and at most bn good points (those not in the k largest components), but this nonetheless guarantees that each Ci contains at least |Xi | − bn ≥ bn + 2 correctly clustered green points (with respect to C ∗ ) and at most bn misclassified points. Therefore, as shown above for the case of known w, the resulting clustering C will correctly cluster all non-red points as in C ∗ and so is at distance at most from CT . 3.2
The Inductive Case
In this section we consider an inductive model in which the set S is merely a small random subset of points of size n from a much larger abstract instance space X, |X| = N , N n, and the clustering we output is represented implicitly through a hypothesis h : X → Y . Algorithm 2. Inductive k-median Input: (S, d), ≤ 1, α > 0, k, n. Training Phase: Step 1. Set w = min{d(x, y) | x, y ∈ S} and τ = 2αw . 5 Step 2. Apply Steps 1, 2 and 3 in Algorithm 1 with parameters τ and b = 2(1 + 5/α) to generate a clustering C1 . . . Ck of the sample S. Step 3. If the total number of points in C1 . . . Ck is at least (1 − b)n and each |Ci | ≥ 2bn, then terminate the training phase. Else increase τ to the smallest τ > τ for which Gτ = Gτ and go to Step 2. Testing Phase: When a new point z arrives, compute for every cluster Ci the median distance of z to all sample points in Ci . Assign z to the cluster that minimizes this median distance.
Our main result in this section is the following: Theorem 7. Assume that the k-median instance (X, d) satisfies the (1 + α, )property and that each cluster T has size at least (6+30/α)N +2. If we draw in C a sample S of size n = Θ 1 ln kδ , then we can use Algorithm 2 to produce a clustering that is -close to the target with probability at least 1 − δ. Proof. Let Xi be the good points in the optimal cluster Ci∗ , and let B = S \ ∪Xi be the bad points defined as in Theorem 6 over the whole instance space X. In particular, if w is the average weight of the points in the optimal k-median solution over the whole instance space, we call point x good if both w(x) < dcrit and w2 (x) − w(x) ≥ 5dcrit , else x is called bad. Let Xi be the good points in the optimal cluster Ci∗ , and let B = S \ ∪Xi be the bad points. Since each cluster in CT has size at least (6 + 30/α)N + 2 we can show using a similar reasoning as inTheorem 6 that |Xi | > 5|B|. Also, since our sample is large enough, n = Θ 1 ln kδ , by Chernoff bounds, with probability at least 1 − δ over the sample we have |B ∩ S| < 2(1 + 5/α)n and |Xi ∩ S| ≥ 4(1 + 5/α)n, and so |Xi ∩ S| > 2|B ∩ S| for all i. This then ensures that if we apply Steps 1, 2 and 3 in Algorithm 1 with parameters τ = 2αw 5 and b = 2(1 + 5/α) we generate
Agnostic Clustering
391
a clustering C1 . . . Ck of the sample S that is O(b)-close to the target on the sample. In particular, all good points in the sample that are in the same cluster form cliques in the graph Hτ,b and good points from different clusters are in different connected components of this graph. So, taking the largest connected components of this graphs gives us a clustering that is O(b)-close to the target clustering restricted to the sample S. If we do not know w, then we use the same approach as in Theorem 6. That is, we start by setting w = 0 and increase it until the k largest components in the corresponding graph Hτ,b cover a large fraction of the points. The key point is that the correctness of this approach followed from the fact that the number of good points in every cluster is more than twice the total number of bad points. As we have argued above, this is satisfied with probability at least 1 − δ for the sample S as well, and hence, using arguments similar to the ones in Theorem 6 implies that we cluster the whole space with error at most . Note that one can speed up Algorithm 2 as follows. Instead of repeatedly calling Algorithm 1 from scratch, we can store the graphs G and H and only add new edges to them in every iteration of Algorithm 2. Note also that in the test phase, when a new point z arrives, we compute for every cluster Ci the median distance of z to all sample points in Ci (and not to all the points added so Ci ), and assign z to the cluster that minimizes this median distance. Note also that a natural approach which will not work (due to the bad points) is to compute a centroid/median for each Ci and then insert new points based on this Voronoi diagram.
4
k-Median Based Clustering: The (ν, 1 + α, )-Property
We now study k-median clustering under the (ν, 1 + α, )-property. If C is an arbitrary clustering consistent with the property, and its set of outliers or illbehaved data points is S \ S , we will refer to w = OPT as the value of C or the n value of S , where OPT is the value of the optimal k-clustering of the set S . We start with the simple observation that if we are given a value w corresponding to a consistent clustering CT on a subset S , then we can efficiently find a clustering that is (ν + )-close to CT if all clusters in CT are large. Proposition 8. Assume that the target CT is consistent with the (ν, 1 + α, )property for k-median. Assume that each target cluster has size at least (3 + 10/α)n + 2 + 2νn. Let S ⊆ S with |S | ≥ (1 − ν)n be its corresponding set of non-outliers. If we are given the value of S , then we can efficiently find a clustering that is (ν + )-close to CT . Proof. We can use the same argument as in Theorem 6 with the modification that we treat the outliers or ill-behaved data points S \ S as additional red bad points. To prove correctness, observe that the only property we used about red bad points is that in the graph Gτ none of them connects to points from two different sets Xi and Xj . Due to the triangle inequality, this is also satisfied for the “outliers”. The proof then proceeds as in Theorem 6 above.
392
4.1
M.F. Balcan, H. R¨ oglin, and S.-H. Teng
Large Target Clusters
We now show that the (ν +, k)-clustering complexity of the (ν, 1+α, )-property is 1 in the “large clusters” case. Specifically: Theorem 9. Let F be the family of clusterings with the property that every cluster has size at least (4 + 15/α)n + 2 + 3νn. Then the (ν + , k) restricted clustering complexity of the (ν, 1 + α, )-property with respect to F is 1, and we can efficiently find a clustering that is (ν + )-close to any clustering in F that is consistent with the (ν, 1 + α, )-property; in particular this clustering is (ν + )-close to the target CT . Proof. Let C1 be an arbitrary clustering consistent with the (ν, 1 + α, )-property of minimal value of w. Let C2 be any other consistent clustering. By definition we know that there exist sets of points S1 and S2 of size at least (1 − ν)n such that (Si , d) satisfies the (1 + α, )-property with respect to the induced clustering Ci ∩ Si on Si , for i = 1, 2. Let w and w denote the values of the clusterings C1 and C2 on the sets S1 and S2 , respectively; and by assumption we have w ≤ w . Furthermore, let C1∗ and C2∗ denote the optimal k-clusterings on the sets S1 and 2αw S2 , respectively. We set τ = 2αw 5 and τ = 5 , and b = (1 + 5/α) + ν and consider the graphs Hτ,b and Hτ ,b . Let K1 , . . . , Kk be the k largest connected components in the graph Hτ,b , and let K1 , . . . , Kk be the k largest connected components in the graph Hτ ,b . For j ∈ [2], let Bj = (Sj \ ∪i Xij ) ∪ (S \ Sj ) denote the bad set of clustering Cj∗ . As in Theorem 6, we can show that |Bj | ≤ ((1 + 5/α) + ν)n. For i ∈ [k], we denote by Xi1 the intersection of Ki with the good set of clustering C1∗ and we denote by Xi2 the intersection of Ki with the good set of clustering C2∗ . By the assumption that the size of the target clusters is more than three times the size of the bad set, we have Xij ≥ 2|Bj | for all i ∈ [k] and j ∈ [2]. As Hτ,b ⊆ Hτ ,b , this implies that (up to reordering) Ki ⊆ Ki for every i. This is because otherwise, if we end up merging two components Ki and Kj before reaching w , then one of the clusters Kl must be a subset of B1 and so it must be strictly smaller than (4 + 15/α)n + 2 + 3νn. This implies that the clusterings C1∗ and C2∗ are O(/α + ν)-close to each other since they can only differ on the bad set B1 ∪ B2 . By Proposition 8, this implies that also the clusterings C1 and C2 are O(/α + ν)-close to each other. Moreover, since Xij ≥ 2|Bj | for all i ∈ [k] and j ∈ [2], using an argument similar to the one in Theorem 6 yields that the clusterings Cw and Cw obtained by running Algorithm 1 with w and w , respectively, are identical; moreover this clustering is (ν + )-close to both C1 and C2 . This follows as the outliers in the sets S \ S1 and S \ S2 can be treated as additional red bad points as described in Proposition 8 above. Since C1 is an arbitrary clustering consistent with the (ν, 1 + α, )-property with a minimal value of w and C2 is any other consistent clustering, we obtain that the (ν + , k)-clustering complexity is 1. By the same arguments, we can also use the algorithm for unknown w, described after Theorem 6, to get (ν + )-close to any consistent clustering when we do not know the value of w beforehand.
Agnostic Clustering
4.2
393
Target Clusters That Are Large on Average
We show here that if we allow some of the target clusters to be small, then the (γ, k) clustering complexity of the (ν, 1 + α, )-property is larger than one — it can be as large as k even for γ = 1/k. Specifically: Theorem 10. For k ≤ νn and γ ≤ (1 − ν)/k the (γ, k)-clustering complexity of the (ν, 1 + α, )-property is Ω(k). Proof Sketch. Let A1 , . . . , Ak be sets of size n(1 − ν)/k and let x1 , . . . , xk be additional points not belonging to any of the sets A1 , . . . , Ak such that the optimal k-median solution on the set A1 ∪ . . . ∪ Ak is the clustering C = {A1 , . . . , Ak } and the instance (A1 ∪ . . . ∪ Ak , d) satisfies the (1 + α, )-property. We assume that S ⊆ N and that every set Ai consists of n(1 − ν)/k points at exactly the same position ai ∈ N. In our construction, we will have a1 < . . . < ak . By placing the point x1 very far away from all the sets Ai and by placing A1 and A2 much closer together than any other pair of sets, we can achieve that the optimal k-median solution on the set A1 ∪ . . . ∪ Ak ∪ {x1 } is the clustering {A1 ∪ A2 , A3 , . . . , Ak , {x1 }} and that the instance (A1 ∪ Ak ∪ {x1 }, d) satisfies the (1 + α, )-property. We can continue analogously and place x2 very far away from all the sets Ai and from x1 . Then the optimal k-median clustering on the set A1 , . . . , ∪ . . . ∪ Ak ∪ {x1 , x2 } will be {A1 ∪ A2 ∪ A3 , A4 , . . . , Ak , {x1 , x2 }} if A2 and A3 are much closer together than Ai and Ai+1 for i ≥ 3. The instance also satisfies the (1 + α, )-property. This way, each of the clusterings {A1 ∪ . . . ∪ Ai , Ai+1 . . . Ak , {x1 }, {x2 }, . . . , {xi−1 }} is a consistent target clustering, and the distance between any of them is at least γ. Note that in the example in Theorem 10 all the clusterings that satisfy the (ν, 1 + α, )-property have the feature that the total number of points that come from large clusters (of size at least n(1 − ν)/k) is at least (1 − ν)n. We show that in such cases we also have an upper bound of k on the clustering complexity. Theorem 11. Let b = (6 + 10/α) + ν. Let F be the family of clusterings with the property that the total number of points that come from clusters of size at least 2bn is at least (1 − β)n. Then the (2b + β, k) restricted clustering complexity of the (ν, 1 + α, )-property with respect to F is at most k and we can efficiently construct a list of length at most k such that any clustering in F that is consistent with the (ν, 1 + α, )-property is (2b + β)-close to one of the clusterings in the list. Proof. The main idea of the proof is to use the structure of the graphs H to show that the clusterings that are consistent with the (ν, 1 + α, )-property are almost laminar with respect to each other. Note that for all w < w we have Gw ⊆ Gw and Hw ⊆ Hw . Here we used Gw and Hw as abbreviations for Gτ and Hτ with τ = 2αw 5 . In the following, we say that a cluster is large if it contains at least 2bn elements. To find a list of clusterings that “covers” all the relevant clusterings, we use the following algorithm. We keep increasing the value of w until we reach a value w1 such that the following is satisfied: Let K1 , . . . , Kk denote the k largest
394
M.F. Balcan, H. R¨ oglin, and S.-H. Teng
connected components of the graph Hw1 and assume |K1 | ≥ |K2 | ≥ . . . ≥ |Kk |. We set k1 = max{i ∈ [k] | |Ki | ≥ bn} and stop for the smallest w1 for which the clusters K1 , . . . , Kk1 cover together a significant fraction of the space, namely a 1 − (b + β) fraction. Let S˜ = K1 ∪ . . . ∪ Kk1 . The first clustering we add to the list contains a cluster for each of the components K1 , . . . , Kk1 and it assigns the points in S \ S˜ arbitrarily to those. Now we increase the value of w and each time we add an edge in Hw between two points in different components Ki and Kj , we merge the corresponding clusters to obtain a new clustering with at least one cluster less. We add this clustering to our list and we continue until only one cluster is left. As in every step, the number of clusters decreases by at least one, the list of clusterings produced this way has length at most k1 ≤ k. Let w1 , w2 , . . . denote the values of w for which the clusterings are added to the list. To complete the proof, we show that any clustering C satisfying the property is (2b + β)-close to one of the clusterings in the list we constructed. Let wC denote the value corresponding to C. First we notice that wC ≥ w1 . This follows easily from the structure of the graph HwC : it has one connected component for every large cluster in C and each of these components must contain at least bn points as every large cluster contains at least 2bn points and the bad set contains at most bn points. Also by definition and the fact that the size of the bad set is bounded by bn, it follows that these components together cover at least a 1 − (b + β) fraction of the points. This proves that wC ≥ w1 by the definition of w1 . Now let i be maximal such that wi ≤ wC . We show that the clustering we output at wi is (2b + β)-close to the clustering C. Let K1 , . . . , Kk denote the components in Hwi that evolved from the Ki and let K1 , . . . , Kk denote the evolved components in HwC . As wC < wi+1 , k = k we can assume (up to reordering) that Ki = Ki ˜ As all points in S˜ that are not in the bad set for wi are clustered on the set S. in C according to the components K1 , . . . , Kk , the clusterings corresponding to wi and wC can only differ on S \ S˜ and the bad set for wi . Using the fact ˜ ≤ (b + β)n and that the size of the bad set is bounded by bn, we get that |S \ S| the clustering we output at wi is (2b+β)-close to the clustering C, as desired. Moreover, if every large cluster is at least as large as (12 + 20/α)n + 2νn + 2β, then, as for w1 the size of the missed set is at most (6 + 10/α)n + νn + β, the intersection of the good set with every large cluster is larger than the missed set for wi for any i. This then implies that if we apply the median argument from Step 4 of Algorithm 1, the clustering we get for wi is (ν + + β)-close to the clustering C if i is chosen as in the previous proof. Together with Theorem 11 this implies the following corollary. Corollary 12. Let b = (6 + 10/α) + ν. Let F be the family of clusterings with the property that the average cluster size n/k is at least 2bn/(1 − β). Then the (ν + + β, k) restricted clustering complexity of the (ν, 1 + α, )-property with respect to F is at most k and we can efficiently construct a list of length at most k such that any clustering in F that is consistent with the (ν, 1 + α, )-property is (ν + + β)-close to one of the clusterings in the list.
Agnostic Clustering
395
The Inductive Case. We show here how the algorithm in Theorem 11 can be extended to the inductive setting. Theorem 13. Let b = (6 + 10/α) + ν. Let F be the family of clusterings with the property that the total number of points that come from clusters of size at least 2bn is at least (1 − β)n. If we draw a sample S of size n = O 1 ln kδ , then we can efficiently produce a list of length at most k such that any clustering in the family F that is consistent with the (ν, 1 + α, )-property is 3(2b + β)-close to one of the clusterings in the list with probability at least 1 − δ. Proof Sketch. In the training phase, we will run the algorithm in Theorem 11 over the sample to get a list of clusterings L. Then we run an independent “test phase” for each clustering in this list. Let C be one such clustering in the list L with clusters C1 , . . . , Ck , and let S˜ be the set of relevant points defined Theorem 11. In the test phase, when a new point x comes in, then we compute ˜ and insert it into the for each cluster Ci the medium distance of x to Ci ∩ S, cluster Ci to which it has the smallest median distance. To prove correctness we use the fact that, as shown in Theorem 11, the (2b + β, k)-clustering complexity of the (ν, 1 + α, )-property is at most k, when restricted to clusterings in which the total number of points coming from clusters of size at least 2bn is at least (1 − β)n. Let L be a list of k1 ≤ k clusterings such that any consistent clustering is (2b + β)-close to one of them. Now the argument is similar to the one in Theorem 7. In the proof of that theorem, we used a Chernoff bound to argue that with probability at least 1 − δ the good set of any cluster that is contained in the sample is more than twice as large as the total bad set in the sample. Now we additionally apply a union bound over the at most k clusterings in the list L to ensure this property for each of the clusterings. From that point on the arguments are analogous to the arguments in Theorem 7. 4.3
Small Target Clusters
We now consider the general case, where the target clusters can be arbitrarily small. We start with a proposition showing that if we are willing relax the notion of closeness significantly then the clustering complexity is still upper bounded by k even in this general case. With a more careful analysis, we then show a better upper bound on the clustering complexity in this general case. Proposition 14. Let b = (6 + 10/α) + ν. Then the ((k + 4)b, k)-clustering complexity of the (ν, 1 + α, )-property is at most k. Proof. Let us consider a clustering C = (C1 , . . . , Ck ) and a set S ⊆ S with |S | ≥ (1 − ν)n such that (S , d) satisfies the (1 + α, )-property with respect to the induced target clustering C ∩ S . Let us first have a look at the graph Gw . There exists a bad set B of size at most bn, and for every cluster i, the points in Xi = Ci \ B form cliques in Gw . There are no edges between Xi and Xj for i = j and there is no point x ∈ B that is simultaneously connected to Xi and Xj for i = j.
396
M.F. Balcan, H. R¨ oglin, and S.-H. Teng
If there are two different consistent clusterings C 1 and C 2 that have the same value w, then, by the properties of Gw , all points in S \ (B1 ∪ B2 ) are identically clustered. Hence, dist(C 1 , C 2 ) ≤ (|B1 |+|B2 |)/n ≤ 2b. This implies that we do not lose too much by choosing for every value w with multiple consistent clusterings one of them as representative. To be precise, let w1 < w2 < · · · < ws be a list of all values for which a correct clustering exists and for every wi , let C i denote a correct clustering with value wi . We construct a sparsified list L of clusterings as follows: insert C 1 into L; if the last clustering added to L is C i , add C j for the smallest j > i for which dist(C i , C j ) ≥ (k + 2)b. This way, the list L will contain clusterings C 1 , . . . , C s with values w1 , . . . , ws such that every correct clustering is (k + 4)b-close to at least one of the clusterings in L. It remains to bound the length s of the list L. Let us assume for contradiction that s ≥ k+1. According to the properties of the graphs Gwi , the clusterings that are induced by the clusterings C 1 , . . . , C k+1 on the set S \ (B1 ∪ . . . ∪ Bk+1 ) are laminar. Furthermore, as the bad set B1 ∪ . . . ∪ Bk+1 has size at most (k + 1)bn, two consecutive clusterings in the list must differ on the set S \ (B1 ∪ . . . ∪ Bk+1 ), which means together with the laminarity implies that two clusters must have merged. This can happen at most k − 1 times, contradicting the assumption that s ≥ k + 1. We will improve the result in the above proposition by imposing that consecutive clusterings in the list L in the above proof are significantly different in the laminar part. In particular we will make use of the following lemma which shows that if we have a laminar list of clusterings then the sum of the pairwise distances between consecutive clusterings cannot be too big; this implies that if the pairwise distances between consecutive clusterings are all large, then the list must be short. Lemma 15. Let C 1 , . . . , C s be a laminar list of clusterings, let k ≥ 2 denote the number of clusters in C 1 , and let β ∈ (0, 1). If dist(C i , C i+1 ) ≥ β for every i ∈ [s − 1], then s ≤ min{ 9 log(k/β) , k}. β Proof. When going from C i to C i+1 , clusters contained in the clustering C i merge into bigger clusters contained in C i+1 . Merging the clusters K1 , . . . , K ∈ C i with |K1 | ≥ |K2 | ≥ · · · ≥ |K | into cluster K ∈ C i+1 contributes (|K2 | + · · · + |Kl |)/n to the distance between C i and C i+1 . When going from C i to C i+1 , multiple such merges can occur and we know that their total contribution to the distance must be at least β. We consider a single merge in which the pieces K1 , . . . , K ∈ C i merge into K ∈ C i+1 virtually as − 1 merges and associate with each of them a type. We say that the merge corresponding to Ki , i = 2, . . . , , has type j ∈ N if |Ki | ∈ [n/2j+1 , n/2j ). If Ki has type j, we say that the data points contained in Ki participate in a merge of type j. For the step from C i to C i+1 , let xij denote the total number of virtual merges of type j that occur. The number of merges of type j that can occur during the whole sequence from C 1 to C s is bounded from above by 2j+1 as each of the n data points can participate at most once in a merge of type j. This follows
Agnostic Clustering
397
because once a data point participated in a merge of type j, it is contained in a piece of size at least n/2j . We are only interested in types j ≤ L = log(k/β) + 1. As there can be at most k −1 merges from C i to C i+1 , the total contribution to the distance between C i and C i+1 coming from larger types can be at most k/2L+1 ≤ β/2. Hence for every i ∈ [s − 1], the total contribution of types j ≤ L must be at least β/2. In terms of the xij , these conditions can be expressed as ∀j ∈ [L] :
s−1 L β xij xij ≤ 1 and ∀i ∈ [s − 1] : ≥ . j+1 j 2 2 2 i=1 j=1
This yields s−1 L
xij (s − 1)β ≤ ≤L, 4 2j+1 i=1 j=1 and hence, s ≤ 4L/β + 1 ≤ 4log(k/β) +4 + 1 ≤ 9 log(k/β) . As in every step at β β least two clusters must merge, s ≤ k and the lemma follows. We can now show the following upper bound on the clustering complexity. Theorem 16. Let b = (6 + 10/α) + ν. Then the(9 b log(k/b), k)-clustering complexity of the (ν, 1 + α, )-property is at most 4 log(k/b)/b. Proof. We use the same arguments as in Proposition 14. We construct L in the same way, but with 7 b log(k/b) instead of (k + 2)b as bound on the distance of consecutive clusterings. We assume for contradiction that s ≥ t := 4 log(k/b)/b and apply Lemma 15 with β = 7 b log(k/b)−s b ≥ 3 b log(k/b) to the induced clusterings on S \ (B1 ∪ . . . ∪ Bt ). This yields s < t, contradicting the assumption that s ≥ t.
5
Discussion and Open Questions
In this work we extend the results of Balcan, Blum, and Gupta [3] on finding low error clusterings to the agnostic setting where we make the weaker assumption that the data satisfies the (c, ) property only after some outliers have been removed. While we have focused in this paper on the (ν, c, ) property for k-median, most of our results extend directly to the k-means objective as well. In particular, for the k-means objective one can prove an analog of Lemma 5 with different constants which then can be propagated through the main results of this paper. It is worth noting that we have assumed implicitly throughout the paper that the fraction of outliers or a good upper bound on it ν is known to the algorithm. In the most general case, where no good upper bound on ν is known, i.e., in the purely agnostic setting, we can run our algorithms 1/ times once for each integral multiplicative of , thus incurring only a 1/ multiplicative factor increase in the clustering complexity and in the running time.
398
M.F. Balcan, H. R¨ oglin, and S.-H. Teng
Open Questions. The main concrete technical questions left open are whether one can show a better upper bound on the clustering complexity in the case of small target clusters and whether in this case there is an efficient algorithm for constructing a short list of clusterings such that every consistent clustering is close to one of the clusterings in the list. More generally, it would also be interesting to analyze other natural variations of the (c, ) property. For example, a natural direction would be to consider variations that express beliefs that only the c-approximate clusterings that might be returned by natural approximation algorithms are close to the target. In particular, many approximation algorithms for clustering return Voronoi-based clusterings [7]. In this context, a natural relaxation of the (c, )-property is to assume that only the Voronoi-based clusterings that are c-approximations to the optimal solution are -close to the target. It would be interesting to analyze whether this is sufficient for efficiently finding low-error clusterings, both in the realizable and in the agnostic setting. Acknowledgements. We thank Avrim Blum and Mark Braverman for a number of helpful discussions.
References 1. Jain, K., Mahdian, M., Saberi, A.: A new greedy approach for facility location problems. In: STOC (2002) 2. Charikar, M., Guha, S., Tardos, E., Shmoys, D.B.: A constant-factor approximation algorithm for the k-median problem. In: Proceedings of the Thirty-First Annual ACM Symposium on Theory of Computing (1999) 3. Balcan, M.F., Blum, A., Gupta, A.: Approximate clustering without the approximation. In: Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (2009) 4. Balcan, M.F., Blum, A., Vempala, S.: A discrimantive framework for clustering via similarity functions. In: Proceedings of the 40th ACM Symposium on Theory of Computing (2008) 5. Balcan, M.F., Braverman, M.: Finding low error clusterings. In: Proceedings of the 22nd Annual Conference on Learning Theory (2009) 6. Kearns, M.J., Schapire, R.E., Sellie, L.M.: Toward efficient agnostic learning. Machine Learning Journal (1994) 7. Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: A local search approximation algorithm for k -means clustering. In: Proceedings of the Eighteenth Annual Symposium on Computational Geometry (2002) 8. Valiant, L.: A theory of the learnable. Commun. ACM 27(11), 1134–1142 (1984)
Author Index
Akutsu, Tatsuya 126 Angluin, Dana 171 Arias, Marta 156
Kevei, P´eter 83 Kinber, Efim 308 K¨ otzing, Timo 263
Balcan, Maria Florina 384 Balc´ azar, Jos´e L. 156 Banerjee, Arindam 368 Becerra-Bonache, Leonor 171 Beygelzimer, Alina 247 Bilmes, Jeff 141 Bshouty, Nader H. 97 Bubeck, S´ebastien 23
Langford, John Luo, Qinglong
247 293
Maillard, Odalric-Ambrym Mansour, Yishay 4 Mazzawi, Hanna 97 Munos, R´emi 23 Perchet, Vianney 68 Pereira, Fernando C.N.
7
Carlucci, Lorenzo 323 Case, John 263 Cesa-Bianchi, Nicol` o 110 Chernov, Alexey 8 Cl´emen¸con, St´ephan 216
Ravikumar, Pradeep Reyzin, Lev 171 R¨ oglin, Heiko 384
Dasgupta, Sanjoy 1 L ukasz 53 Debowski, Dediu, Adrian Horia 171
Semukhin, Pavel 293 Simon, Hans Ulrich 353 Sra, Suvrit 368 Stephan, Frank 293, 338 Stoltz, Gilles 23 Sz¨ or´enyi, Bal´ azs 186
Gavald` a, Ricard 201 Geffner, Hector 2 Gentile, Claudio 110 Guillory, Andrew 141 Gy¨ orfi, L´ aszl´ o 83
Tamura, Takeyuki 126 Teng, Shang-Hua 384 Th´erien, Denis 201
Han, Jiawei 3 Horimoto, Katsuhisa
126
Jain, Sanjay 293, 308, 338 Jegelka, Stefanie 368
247
Vayatis, Nicolas 216, 232 Vitale, Fabio 110 Vovk, Vladimir 8 V’yugin, Vladimir V. 38 Ye, Nan 338 Yoshinaka, Ryo
278
232