Lecture Notes in Artificial Intelligence Subseries of Lecture Notes in Computer Science Edited by J. G. Carbonell and J. Siekmann
Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis and J. van Leeuwen
1720
3
Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Singapore Tokyo
Osamu Watanabe Takashi Yokomori (Eds.)
Algorithmic Learning Theory 10th International Conference, ALT’99 Tokyo, Japan, December 6-8, 1999 Proceedings
13
Series Editors Jaime G. Carbonell, Carnegie Mellon University, Pittsburgh, PA, USA J¨org Siekmann, University of Saarland, Saarbr¨ucken, Germany Volume Editors Osamu Watanabe Tokyo Institute of Technology Department of Mathematical and Computing Sciences Tokyo 152-8552, Japan E-mail:
[email protected] Takashi Yokomori Waseda University Department of Mathematics, School of Education 1-6-1 Nishiwaseda, Shinjuku-ku, Tokyo 169-8050, Japan E-mail:
[email protected]
Cataloging-in-Publication data applied for
Die Deutsche Bibliothek - CIP-Einheitsaufnahme Algorithmic learning theory10th : international conference ; proceedings / ALT ’99, Tokyo, Japan, December 6 - 8, 1999. Osamu Watanabe ; Takashi Yokomori (ed.). - Berlin ; Heidelberg ; New York ; Barcelona ; Hong Kong ; London ; Milan ; Paris ; Singapore ; Tokyo : Springer, 1999 (Lecture notes in computer science ; Vol. 1720 : Lecture notes in artificial intelligence) ISBN 3-540-66748-2
CR Subject Classification (1998): I.2.6, I.2.3, F.4.1, I.7 ISBN 3-540-66748-2 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. © Springer-Verlag Berlin Heidelberg 1999 Printed in Germany Typesetting: Camera-ready by author SPIN: 10705377 06/3142 – 5 4 3 2 1 0
Printed on acid-free paper
Pref ce
This volume contains all the papers presented at the International Conference on Al orithmic Learnin Theory 1999 (ALT’99), held at Waseda University International Conference Center, Tokyo, Japan, December 6−8, 1999. The conference was sponsored by the Japanese Society for Arti cial Intelli ence (JSAI). In response to the call for papers, 51 papers on all aspects of al orithmic learnin theory and related areas were submitted, of which 26 papers were selected for presentation by the pro ram committee based on their ori inality, quality, and relevance to the theory of machine learnin . In addition to these re ular papers, this volume contains three papers of invited lectures presented by Katharina Morik of the University of Dortmund, Robert E. Schapire of AT&T Labs, Shannon Lab., and Kenji Yamanishi of NEC, C&C Media Research Lab. ALT’99 is not just one of the ALT conference series, but this conference marks the tenth ann versary in the series that was launched in Tokyo, in October 1990, for the discussion of research topics on all areas related to al orithmic learnin theory. The ALT series was renamed last year from ALT workshop to ALT conference , expressin its wider oal of providin an ideal forum to brin to ether researchers from both theoretical and practical learnin communities, producin novel concepts and criteria that would bene t both. This movement was reflected in the papers presented at ALT’99, where there were several papers motivated by application oriented problems such as noise, data precision, etc. Furthermore, ALT’99 bene ted from bein held jointly with the 2nd International Conference on Discovery Science (DS’99), the conference for discussin , amon other thin s, more applied aspects of machine learnin . Also, we could celebrate the tenth anniversary of the ALT series with researchers from both theoretical and practical communities. This year we started the E Mark Gold Award for the most outstandin paper by a student author, selected by the pro ram committee of the conference. This year’s award was iven to Yur Kaln shkan for his paper General Linear Relations amon Di erent Types of Predictive Complexity . We wish to thank all who made this conference possible, rst of all, the authors for submittin papers and the three invited speakers for their excellent presentations and their contributions of papers to this volume. We are indebted to all members of the pro ram committee: Nader Bshouty (Technion, Israel), Satoshi Kobayashi (Tokyo Denki Univ., Japan), Gabor Luosi (Pompeu Fabra Univ., Spain), Masayuki Numao (Tokyo Inst. of Tech., Japan), Robert Schapire (ATT Shannon Lab., USA), Arun Sharma (New South Wales, Australia), John Shawe-Taylor (Univ. of London, UK), Ayumi Shinohara (Kyushu Univ., Japan), Prasad Tadepalli (Ore on State Univ., USA), Junichi Takeuchi (NEC C&C Media Research Lab., Japan), Akihiro Yamamoto (Hokkaido Univ., Japan), Rolf Wieha en (Univ. Kaiserslautern, Germany), and Thomas Zeu mann (Kyushu Univ., Japan). They and the subreferees (listed sep-
VI
Pref ce
arately) put a hu e amount of work into reviewin the submissions and jud in their importance and si ni cance. We also ratefully acknowled e the work of all those who did important jobs behind the scenes to make this volume as well as the conference possible. We thank Akira Maruoka for providin valuable su estions, Shi eki Goto for the initial arran ement of a conference place, Naoki Abe for arran in an invited speaker, Shinichi Shimozono for producin the ALT99 lo o, Isao Saito for drawin the ALT99 posters, and Sprin er-Verla for their excellent support in preparin this volume. Last but not least, we are very rateful to all the members of the local arran ement committee: Taisuke Sato (chair), Satoru Miyano, Ayumi Shinohara, without whose e orts this conference would not have been successful.
Tokyo, Au ust 1999
Osamu Watanabe Takashi Yokomori
List of Referees
Hiroki Arimura San hamitra Bandyopadhyay Jonathan Baxter Nicolo Cesa-Bianchi Nello Cristianini Stuart Flockton Ryutaro Ichise Kimihito Ito Michael Kearns Pascal Koiran Eric Martin Llew Mason Ujjwal Maulik
Eric McCreath Andrew Mitchell Koichi Moriyama Atsuyoshi Nakamura Thomas R. Amoth Mark Reid Yasubumi Sakakibara Hiroshi Sakamoto Bernhard Schoelkopf Rich Sutton Noriyuki Tanida Chris Watkins Kenji Yamanishi Takashi Yokomori
Table of Contents
INVITED LECTURES Tailoring Representations to Different Requirements . . . . . . . . . . . . . . . . . . . . Katharina Morik
1
Theoretical Views of Boosting and Applications . . . . . . . . . . . . . . . . . . . . . . . . 13 Robert E. Schapire Extended Stochastic Complexity and Minimax Relative Loss Analysis . . . . 26 Kenji Yamanishi
REGULAR CONTRIBUTIONS Neural Networks Algebraic Analysis for Singular Statistical Estimation . . . . . . . . . . . . . . . . . . . 39 Sumio Watanabe Generalization Error of Linear Neural Networks in Unidentifiable Cases . . . 51 Kenji Fukumizu The Computational Limits to the Cognitive Power of the Neuroidal Tabula Rasa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Jiˇr´ı Wiedermann
Learning Dimension The Consistency Dimension and Distribution-Dependent Learning from Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Jos´e L. Balc´ azar, Jorge Castro, David Guijarro, and Hans-Ulrich Simon The VC-Dimension of Subclasses of Pattern Languages . . . . . . . . . . . . . . . . . 93 Andrew Mitchell, Tobias Scheffer, Arun Sharma, and Frank Stephan On the Vγ Dimension for Regression in Reproducing Kernel Hilbert Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Theodoros Evgeniou and Massimiliano Pontil
X
Table of Contents
Inductive Inference On the Strength of Incremental Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 Steffen Lange and Gunter Grieser Learning from Random Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 Peter Rossmanith Inductive Learning with Corroboration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Phil Watson
Inductive Logic Programming Flattening and Implication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Kouichi Hirata Induction of Logic Programs Based on ψ-Terms . . . . . . . . . . . . . . . . . . . . . . . . 169 Yutaka Sasaki Complexity in the Case Against Accuracy: When Building One Function-Free Horn Clause Is as Hard as Any . . . . . . . . . . . . . . . . . . . . . . . . . 182 Richard Nock A Method of Similarity-Driven Knowledge Revision for Type Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 Nobuhiro Morita, Makoto Haraguchi, and Yoshiaki Okubo
PAC Learning PAC Learning with Nasty Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 Nader H. Bshouty, Nadav Eiron, and Eyal Kushilevitz Positive and Unlabeled Examples Help Learning . . . . . . . . . . . . . . . . . . . . . . . 219 Francesco De Comit´e, Fran¸cois Denis, R´emi Gilleron, and Fabien Letouzey Learning Real Polynomials with a Turing Machine . . . . . . . . . . . . . . . . . . . . . 231 Dennis Cheung
Mathematical Tools for Learning Faster Near-Optimal Reinforcement Learning: Adding Adaptiveness to the E3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 Carlos Domingo A Note on Support Vector Machine Degeneracy . . . . . . . . . . . . . . . . . . . . . . . . 252 Ryan Rifkin, Massimiliano Pontil, and Alessandro Verri
Table of Contents
XI
Learning Recursive Functions Learnability of Enumerable Classes of Recursive Functions from “Typical” Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 Jochen Nessel On the Uniform Learnability of Approximations to Non-recursive Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 Frank Stephan and Thomas Zeugmann
Query Learning Learning Minimal Covers of Functional Dependencies with Queries . . . . . . . 291 Montserrat Hermo and V´ıctor Lav´ın Boolean Formulas Are Hard to Learn for Most Gate Bases . . . . . . . . . . . . . . 301 V´ıctor Dalmau Finding Relevant Variables in PAC Model with Membership Queries . . . . . 313 David Guijarro, Jun Tarui, and Tatsuie Tsukiji
On-Line Learning General Linear Relations among Different Types of Predictive Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 Yuri Kalnishkan Predicting Nearly as Well as the Best Pruning of a Planar Decision Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 Eiji Takimoto and Manfred K. Warmuth On Learning Unions of Pattern Languages and Tree Patterns . . . . . . . . . . . . 347 Sally A. Goldman and Stephen S. Kwek
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
Tailorin Representations to Di erent Requirements Katharina Morik Univ. Dortmund, Computer Science VIII, D-44221 Dortmund, Germany mor
[email protected] -dortmund.de http://www-a .cs.un -dortmund.de Abs rac . Designing the representation languages for the input and output of a learning algorithm is the hardest task within machine learning applications. Transforming the given representation of observations into a well-suited language LE may ease learning such that a simple and e cient learning algorithm can solve the learning problem. Learnability is de ned with respect to the representation of the output of learning, LH . If the predictive accuracy is the only criterion for the success of learning, the choice of LH means to nd the hypothesis space with most easily learnable concepts, which contains the solution. Additional criteria for the success of learning such as comprehensibility and embeddedness may ask for transformations of LH such that users can easily interpret and other systems can easily exploit the learning results. Designing a language LH that is optimal with respect to all the criteria is too di cult a task. Instead, we design families of representations, where each family member is well suited for a particular set of requirements, and implement transformations between the representations. In this paper, we discuss a representation family of Horn logic. Work on tailoring representations is illustrated by a robot application.
1
Introduction
Machine learnin has focused on the particular learnin task of concept learnin . Investi atin this task has led to sound theoretical results as well as to e cient learnin systems. However, some aspects of learnin have not yet received the attention they deserve. Theoretical results are missin where they are ur ently needed for successful applications of machine learnin . If we contrast the theoretically analyzed settin of concept learnin with the one found in real-world applications, we encounter quite a number of open research questions. Let me draw your attention to those questions that are related with tailorin representations. The settin of concept learnin that has been well investi ated can be summarized as follows. Concept learnin : Given a set of examples in a representation lan ua e LE , drawn accordin to some distribution, and back round knowled e in a representation lan ua e LB , O. Watanabe, T. Yokomori (Eds.): ALT’99, LNAI 1720, pp. 1 12, 1999. c Sprin er-Verla Berlin Heidelber 1999
2
Katharina Morik
learn a hypothesis in a representation lan ua e LH that classi es further examples properly accordin to a iven criterion. Di erent paradi ms vary assumptions about the distribution and about the existence of back round knowled e. In the paradi m of probably approximately correct (PAC) learnin , for instance, an unknown but xed distribution is assumed and back round knowled e is i nored. In the paradi m of inductive lo ic pro rammin (ILP), the distribution is i nored, but back round knowled e is taken into account. Di erent approaches vary the particular success criterion. For instance, information ain, minimal description len th, or Bayes-oriented criteria are discussed. Structural risk minimization balances the error rate and the complexity of hypotheses. The variety of criteria can be subsumed under that of accuracy of predictions (i.e. correctness of classi cations of new observations). In contrast, applications of machine learnin are evaluated with respect to additional criteria: comprehensibility: How easily can the learnin results be interpreted by human decision makers? embeddedness: Most learnin results are produced in order to enhance the capabilities of another system (e. ., problem solvin system, natural lan ua e system, robot), the so-called performance system. The criterion is here: How easily can learnin results be used by the performance system? These criteria are most often i nored in theoretical studies. Comprehensibility and embeddedness further constrain the choice of an appropriate hypothesis lan ua e LH . The problem is that they frequently do so in a contradictory manner. A representation that is well-suited for a human user may be hard to handle by a performance system. A representation that is well-suited for a system to perform a certain task may be hard to be understood by a human user. To make thin s even worse, we also consider the input of the learnin al orithm. A learnin al orithm A can be characterized by its input (i.e. LE , LB ) and output (i.e. LH ) formats. Those pairs of input and output lan ua es, for which an e cient and e ective learnin al orithm exists, are called admissible lan ua es. In this way, LH is further constrained by the iven data in LE . It is hard or even impossible to desi n a representation LH which is e ciently learnable from data in a iven format, easily understood by users, and can be put to direct use by a performance system. Instead of searchin for one representation that ful lls at the same time the requirements of the learnin al orithm, the iven data, the users, and the performance system, we may want to follow a more stepwise procedure. We divide the overall representation problem into several subproblems. Each subproblem consists of two parts, namely determinin an appropriate representation and developin a transformation from a iven representation into it. In order to not trade in a simple representation lan ua e for a complex transformation process,
Tailoring Representations to Di erent Requirements
3
we desi n families of representations, where each family member is well suited for a particular set of requirements, and e cient transformations between the family members are possible. The subproblems are: Learnability: The majority of theoretical analyses shows the learnability of concept classes under certain restrictions of the learnin task, the representation lan ua es, and the criterion for accuracy. Hence, this subject does not require justi cation. However, the transformation from a iven representation into the desired one raises questions that have not yet been addressed frequently. The transformation here converts LH into LHj , where LHj is better suited for learnin . Optimizin input data: The hardest task of machine learnin applications is the desi n of a well-suited representation LEj . On one hand, the iven representation formalism LE must be converted into the one accepted by the learnin al orithm as input. On the other hand, the representation lan ua e (si nature) can be optimized. The formalism remainin the same, the set of features (predicates) is chan ed. The standard procedure is to run the al orithm on diverse feature sets until one is found that leads to an acceptable accuracy of learnin . Most feature construction or selection methods aim at overcomin this trial-and-error approach. Comprehensibility: Althou h stated as a primary oal of machine learnin from the very be innin on, the criterion of comprehensibility never became a hot topic of learnin theory. Embeddedness or pro ram optimization: Learnin results are supposed to enhance a procedure. Most often, the procedure is already implemented as a particular performance system. The learnin result in LH must be transformed into the representation LHj of the performance system. Quite often, the performance system o ers restrictions that are not presupposed by the learnin al orithm. In this case, the transformation may be turned into an optimization, exploitin the application’s restrictions. In this paper, a family of subsets of Horn lo ic is discussed and illustrated by a robot application.
2
Learnability of Restricted Horn Lo ic
The learnability subject does not need any justi cation here. However, the transformation of one representation LH into another one LHj which is better suited for learnin , does. An important question is, how to reco nize those parts in a iven representation (or concept class) automatically, that cause ne ative learnability results. The user may then be asked, whether the learnin task could be weakened, or at least be warned that it may take exponential time to compute a result. For instance, detectin variables that occur in the head of a clause but not in its body prohibit the clause bein enerative. This can easily be checked automatically. Even su estions for xin the clause can be computed on the basis of the predicates de ned so far [27] Alternatively, the learnin al orithm
4
Katharina Morik
may start with a simple representation and increase the complexity of the hypothesis space. Of course, this alternative is easier than the reco nition of the complexity class. Desired are learnability proofs that can be made operational in the sense that indicators for (non-)learnability are de ned that can be reco nized automatically. One example for an operational complexity criterion is the one used by the support vector machine [31]. The width of the mar in separatin positive and ne ative examples corresponds to the VC dimension of the hypothesis space, but can be calculated much easier.1 Accordin to this operational criterion hypothesis spaces of increasin complexity are tried by the learnin al orithm. A rou h indication of the complexity of hypotheses from the ILP paradi m is the number of literals in a clause. Accordin to this criterion, the RDT al orithm searches fo increasin ly complex hypotheses [8] and a similar procedure is followed in [5]. However, the number of literals is merely a heuristic measure of complexity. Let me now illustrate operational proofs by [9]. The di culty of rst-order lo ic induction ori inates in rst-order lo ic deduction. Hypothesis testin , saturation of examples with back round knowled e, the comparison of competin hypotheses, and the reduction of hypotheses under equivalence all use the deductive inference. Since −subsumption [25] is a correct and for not self-resolvin , non tautolo ical clauses a complete inference, most ILP systems use it. −subsumption: A clause D −subsumes a clause C i there exists a substitution such that D C. −subsumption is NP-complete because of the indeterminism of choosin in the eneral case. Hence, the clauses need to be restricted. Restrictin Horn clauses to k-llocal clauses allows for polynomial learnin [3]. William Cohen’s proof maps k-llocal clauses on monomials for which learnability proofs exist. However, the proof is not operational. It just shows where it is worth to look for theoretically well-based al orithms. What we need is an operational proof of polynomial complexity of −subsumption of k-llocal clauses in order to solve the deductive problem within induction. We then need a learnin al orithm exploitin the restriction so that learnin k-llocal clauses is polynomial. Desired is an automatic check, whether a clause is k-llocal, or not. DDET DN ON DET be a Horn clause, k-llocal Horn clause: Let D = D0 where DDET is the deterministic and DN ON DET is the indeterministic part of its body. Let vars be a function computin the set of variables of a clause. DN ON DET is a local part of D, i LOC (vars(LOC ) vars( D0 DDET )) \ vars(DN ON DET LOC ) = and there does not exist LOCj LOC which is also a local part of D. A local part LOC is k-llocal, i there exists a constant k such that k LOC . A Horn clause is k-llocal, i each local part of it is k-llocal. 1
Determining the VC dimension is a problem in O(nO(logn) ) where n is the size of a matrix representing the concept class [26].
Tailoring Representations to Di erent Requirements
5
J¨ or -Uwe Kietz presents a fast subsumption al orithm for D = D0 LOCn and any Horn clause C [9]. For the deterministic part of DDET LOC1 D, there exists exactly one substitution . An e cient subsumption al orithm for deterministic Horn clauses is presented in [11]. In the indeterministic part of D, it is looked for local parts, i.e. for sets of literals which do not share an indeterministic variable. This is done by mer in sets of literals which have a variable in common. In this way, the al orithm checks a iven clause D, whether it is k-local, or not. This can be done polynomially. If the local part is bounded by some k, its subsumption is in O(n (k k C )). If, however, DBody consists of one lar e local part, the subsumption is exponential. The proof of polynomial complexity in the number of literals of the clauses D and C corresponds directly to the fast subsumption al orithm. The al orithm uarantees e cient subsumption for k-local D, but can be applied to any Horn clause. The least eneral eneralization, LGG, of Gordon Plotkin [21] is then adjusted to k-llocal clauses and it is shown that the size of the LGG of k-llocal clauses does not increase exponentially (as does the LGG of unrestricted Horn clauses) and reduction of k-llocal clauses is polynomial, too [9]. To ether with the ne tive learnability results in [10], k-llocal clauses can be considered the borderline between learnable and not learnable indeterministic clauses. This concludes the illustration of what I mean by operational proofs.
3
Optimizin Input Data
The transformation of the iven data within the same representation formalism is called feature construction/extraction, if new features are built on the basis of the iven ones, and feature selection, if the most relevant subset of features is selected. An explicit construction is the invention and de nition of new predicates as presented by [20], [29], and [32]. Includin the transformation into the learnin al orithm is called constructive induction [17], [2]. An implicit addin of new features is performed by the support vector machine [31]. Usin a kernel function, the feature space is transformed such that the observations become linearly separable. A more complex input representation allows the al orithm to remain simple. Only because the enriched feature space need not be used for computation the kernel functions are used instead , transformation plus learnin al orithm remain e cient. It is an open question how to make explicit, which of the added features contributed most to a ood learnin result. Let me now turn to the transformation from one representation formalism to another one. A learnin al orithm is selected because of both, its input and output formalisms (plus other properties). It may turn out, however, that there does not exist an al orithm with the desired input output pair of lan ua es. In this case, one of the options is to transform the data from the formalism in which they are available into the input format of the selected al orithm. The current interest in feature construction may stem from knowled e discovery in databases (KDD) [15]. The iven database representation has to be transformed into one which is accepted by the learnin al orithm. Of course, for an ILP learnin
6
Katharina Morik
al orithm there exists a 1:1 mappin from a database table to a predicate[7]. However, this simple transformation most often is not one that eases learnin . Since the arity of predicates cannot be chan ed throu h learnin , a hu e number of irrelevant database attributes is carried alon . Therefore, the transformation from database tables into predicates should map parts of the tables to predicates, where the database key is one of the ar uments of the predicate. A tool that ives an ILP learner direct access to a relational database, provides di erent types of mappin s from tables to predicates, and constructs SQL queries for hypothesis testin automatically has been developed [18]. However, the open question remains whether there are theoretically well-based indicators of the optimal mappin for a particular learnin task. This is an essential question, since the number of possible mappin s is only bounded by the size of the universal database relation. Therefore, it does not help that we may well compute the size of the hypothesis space for each mappin . What is required is a structure in the space of mappin s that allows for e cient search. A frequent task when applyin an ILP al orithm is to transform numerical measurements into qualitative Horn lo ic clauses (facts). A mobile robot, for instance, reports its action and perception by multivariate time series. Each sensor and the movin en ine deliver a measurement per moment. The ILP learner requires a description of a path and what has been sensed in terms of time intervals durin which some assertions are true. Hence, the si nal-to-symbol transformation needs to nd adequate time intervals and assertions that summarize the measurements within the time interval. This task is di erent from time series analysis, where the curves of measurement are approximated. It corresponds to the analysis of event sequences, where events have to be automatically reco nized. As opposed to current approaches which learn about event sequences (e. ., [33],[16]), the robot application asks for a representation that can be produced incrementally on-line [14]. For our robot application we have implemented a simple al orithm which constructs predicates of the form increasin (MissionID, An le, Sensor, FromTime, ToTime, RelToMove) from measurements [19]. Durin the time interval FromTime, ToTime, the sensor preceives increasin distance while bein oriented in a particular an le with respect to the lobal coordinates. The relation between measured distance and moved distance is expressed by the last ar ument. The al orithm reads in a measurement and compares it with the current summarizin assertion (i.e., a predicate with all ar uments bound except for ToTime). Either ToTime is bound, the event has ended and a new one starts, or the next measurement is read in. This procedure has some parameters (e. ., tolerated variance). These are adjusted on the data so that the transformation itself is adaptive. The lan ua e LE constructed has the nice property of ttin to eneral chain rules. Given an appropriate substitution , a literal B of a chain rule can be uni ed with a round fact f LE , i.e. B = f . General chain rule: Let S be a literal or a set of literals. Let ar s(S) be a function that returns the Datalo ar uments of S. A normal clause is a eneral chain rule, i its body literals can be arran ed in a sequence
Tailoring Representations to Di erent Requirements
7
B0 B1 B2 , ...,Bk Bk+1 such that there exist Datalo terms X Z ar s(B0 ), X Y1 ar s(B1 ), Y1 Y2 ar s(B2 ), ...,Yk−1 Yk ar s(Bk ), and Yk Z ar s(Bk+1 ). Obviously, chain rules are well-suited for the representation of event sequences, usin the time points as chainin ar uments. LE is constructed such that chain rules can be learned from them. In other words, havin chain rules in mind for LH we transform the iven data into facts that correspond to literals of LH . Note, that this transformation is not motivated by the learnin al orithm in order to make learnin simpler. The aim is to desi n LE such that a LH can be learned which can easily embedded in the robot application.
4
Comprehensibility
It is often claimed that learnin results are easier to understand than statistical results which demand a human interpreter who translates the results into conclusions for the customer of the statistical study. In particular, decision trees and rules were found to be easily undertandable. However, these statements lack justi cation. In a user-independent way, proposed criteria for comprehensibility most often refer to the len th of a description (the minimal description len th (MDL) principle [24]). For decision trees, the number of nodes was used as a uideline for comprehensibility. For lo ic pro rams, the number of literals of a clause or the number of variables was used as an operational criterion for the ease of understandin [1] 2 , [4]. Irene Stahl corrected the MDL for ILP by restrictin the description len th of examples to that of positive examples only [28]. Althou h presented as an operationalisation of comprehensibility, compression may lead to almost incomprehensible descriptions3 . Its value for hypothesis testin and structurin the hypothesis space not ne lected, as a means to achieve comprehensible learnin results it is questionable. Ryszard Michalski has proposed to use natural lan ua e for communicatin learnin results. The ease of transformin LH into natural lan ua e can then be used as a criterion for the naturalness of the representation. Ed ar Sommer introduced the notion of extensional redundancy. Removin extensional redundancy compresses a theory, but compressions are not restricted to that. It is known from psycholo y that some redundancy eases understandin . Extensional redundancy aims at characterizin superfluous parts of a theory [27]. Extensional redundancy: Let G be the set of oal concepts in a theory T . Let Q be the set of instances of oal concepts that are derivable from T . Let C be a clause in T and L be a literal in C. 2
3
The reconstruction of the examples and background knowldge from the logic theory and the example encoding by a reference Turing machine leads to a measure of compression: the length of the output tape minus the length of the input tape. Think of compressed text les!
8
Katharina Morik
A literal L is extensionally redundant in C with respect to G, i C is in the derivation of and C L . A clause C is extensionally redundant in T with respect to Q, i T Q and T C Q. Structure is a key to comprehensibility. Structure is achieved by the foldin operations which leaves the minimal Herbrand model of a theory unchan ed [30]. fold: Let C D T be ordered clauses of the form C1 Cm Cm+1 Cm+n and C = C0 D = D0 D1 Dm . Let be a substitution satisfyin the followin conditions m 1. C = D i = 1 Xl be variables that occur in DBody but not in DHead ; each 2. let X1 l does not occur in CHead nor in Cm+1 Cm+n ; if i = i, Xj j = 1 then X = Xj . 3. D is the only clause in T whose head is uni able with D0 . If such a substitution exists, then D0 Cm+1 Cm+n and C 0 = C0 C C0 . T +1 = f old(T C D) = T Otherwise, T remains unchan ed. Several strati cation operators have been developed that structure a theory. The FENDER pro ram folds clauses that de ne a tar et concept with an intermediate concept [27]. The intermediate concept is made of common partial premises. A common partial premise is a set of literals which most frequently occur to ether in the iven theory. The literals must share a variable. This will be omitted in the folded clause. However, not all lo ically unnecessary variables are omitted. This is meant to not hide relevant information from the user, but only encapsulate internal details. An alternative strati cation method named pre x elimination has been developed particularly for chain pro rams [23], i.e. for better communicatin event sequences. As opposed to FENDER which athers common partial premises reardless of an orderin of literals, the pre x elimination method only replaces common pre xes of chained literals in a clause’s body. Hence, applyin pre x elimination to a chain pro ram outputs a chain pro ram. The pre x elimination method looks for common literal sequences in all clauses, not only ones that are used to de ne the same concept. It suppresses all unnecessary variables. This is meant to exhibit the time relations more clearly. An abstracted example from our robot application may illustrate the method. Example: Let the theory T be C1 = alon W all(S X Z) stand(V O X Y1 ) stable(V O Y1 Y2 ) decrP eak(S O Y2 Z) C2 = alon Door(S X Z) stand(V O X Y1 ) stable(V O Y1 Y2 ) incrP eak(S V Y2 Z) Pre x elimination returns T +1 :
Tailoring Representations to Di erent Requirements
9
C10 = alon W all(S X Z) standstable(V O X Y2 ) decrP eak(S O Y2 Z) 0 C2 = alon Door(S X Z) standstable(V O X Y2 ) incrP eak(S V Y2 Z) D = standstable(V O X Y2 ) stand(V O X Y1 ) stable(V O Y1 Y2 ) Two time intervals are summarized, but no information is abstracted away. Compression and strati cation are two operational criteria for the comprehensibility of lo ical theories. Whether they correspond to true needs of users should be studied empirically. Since the representation is to be understood by human users, the answer depends on eneral co nitive capabilities as well as on user-speci c preferences which, in turn, depend on prior knowled e and trainin . Adjustin a representation to particular preferences is a learnin task in its own ri ht. A number of answer set equivalent representations could be presented to the user who selects his favorite one. The selections serve as positive examples and a pro le of the user is learned which uides further presentations. Theoretical analysis of the user pro les could well lead to a re nement of compression which excludes incrompehensibly compact theories.
5
Embeddedness or Pro ram Optimization
Optimizin learnin results for their use by a performance system is easier than optimizin them for human understandin , since the requirements of the system are known. In many cases the requirements by users and performance system are conflictin . However, they can both start from the same learnin result, if LH has been chosen carefully. In our robotics application, we used the fact that chain pro rams correspond to de nite nite automata (DFA). Since eneral chain pro rams as opposed to elementary chain rules are not equivalent to context free rammars, we cannot apply the transformation presented by [6]. The eneral chain rules can be translated into a context-free rammar, but a transformed rammar cannot be uniquely translated back into eneral chain rules. Hence, we may theoretically describe the learnin results with reference to context-free rammars, but this analysis cannot be put to use. It is possible, however, to transform eneral chain rules into a DFA. Startin from a theory which has been strati ed by pre x elimination, the mappin is as follows [23]: A clause head becomes a state of the DFA. A clause de nin a pre x becomes a transition of the DFA. The clause head of an overall oal with its substitutions becomes an output. The DFA is restricted to successful derivations. Compilation of learned theories consistin of 250 to 470 clauses took between 1 and 5 minutes CPU time [23]. A system usin the learned and optimized knowled e was developed by Volker Klin spor [13], [22]. His system SHARC performs object-reco nition, plannin , and plan execution. The low-level steps of object reco nition are performed in parallel, each sensor havin its own process. Object reco nition is performed by forward inference on optimized clauses usin a marker passin strate y. Plannin and plan execution is performed by backward inference on optimized clauses. The
10
Katharina Morik
eneral chain rules are learned by GRDT, an ILP learner with declarative bias [12]. GRDT is not restricted to LH bein eneral chain rules. The top oals to be learned were movin alon a door and movin throu h a door. Guidin the mobile robot by a joy stick throu h and alon doors in one environment resulted in the trainin data. The test was that the robot actually moved alon or throu h other doors in a di erent but similar environment (i.e. the university buildin ). No map was used. Many more experiments have been made usin the simulation component of the PIONEER mobile robot. We were told before the project, that real-time behavior is impossible on the basis of lo ic pro rams, neuro-computin or other numerical processin would be mandatory. However, the real time from sendin sensor measurements to SHARC until robot’s reaction in almost all cases was below 0.005 seconds and at most 0.006 seconds [13]. This shows that learned clauses can navi ate a robot in real time, if they are parallelized and optimized with respect to their use. In contrast, applyin the learned clauses in their most comprehensible from is not fast enou h for a robot application.
6
Conclusion
This paper tries to show that theory need not be restricted to the central learnin step, leavin practicians alone with the desi n and optimization of LE and LH . In contrast, developin operational indicators for the quality of a representation asks for theoretical analysis. Concernin the criteria of learnability, comprehensibility, and embeddedness, a brief overview of approaches towards well-based uidelines for the development and transformation of LE and LH is iven. The question of how to desi n an appropriate LEj from iven data in LE is particularly stressed. On one hand, the desi n of LEj is oriented towards makin learnin easier. On the other hand, the desi n of LEj is oriented towards an admissible LH which can be optimized with respect to embeddedness. The tailorin of representations is illustrated by a robot application. A eneral ILP learner with declarative bias, GRDT, outputs eneral chain clauses, because LE was made of round chain facts. The ori inal numerical data from the robot were transformed into round chain facts by a simple procedure which uses parameter adjustment as its learnin method. A theory can be optimized concernin comprehensibility by introducin intermediate predicates, which are then used for foldin . A theory made of eneral chain rules can be optimized for real-time deductive inference usin the pre x elimination method and the transformation into DFAs.
References [1] A.Srinivasan, S. Muggleton, and M. Bain. The justi cation of logical theories based on data compression. Machine Intelli ence, 13, 1993.
Tailoring Representations to Di erent Requirements
11
[2] Eric Bloedorn and Ryszard Michalski. Data-driven constructive induction: Metodology and applications. In Huan Liu and Hiroshi Motoda, editors, Feature Extraction, Construction, and Selection A Data Minin Perpective, chapter 4, pages 51 68. Kluwer, 1998. [3] William W. Cohen. Learnability of restricted logic programs. In ILP’93 Workshop, Bled, 1993. [4] D. Conklin and I. Witten. Complexity-based induction. Machine Learnin , 16:203 225, 1994. [5] Luc DeRaedt. Interactive Theory Revision: an Inductive Lo ic Pro rammin Approach. Acad. Press, London [u.a.], 1992. [6] G. Dong and S. Ginsburg. On the decomposition of chain datalog programs into p (left-)linear l-rule components. Lo ic Pro rammin , 23:203 236, 1995. [7] Saso Dzeroski. Inductive logic programming and knowledge discovery in databases. In Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy Uthurusamy, editors, Advances in Knowled e Discovery and Data Minin , chapter 1, pages 117 152. AAAI Press/The MIT Press, Menlo Park, California, 1996. [8] J.-U. Kietz and S. Wrobel. Controlling the complexity of learning in logic through syntactic and task oriented models. In Stephen Muggleton, editor, Inductive Lo ic Pro rammin ., number 38 in The A.P.I.C. Series, chapter 16, pages 335 360. Academic Press, London [u.a.], 1992. [9] J¨ org Uwe Kietz. Induktive Analyse relationaler Daten. PhD thesis, Technische Universit¨ at Berlin, Berlin, oct 1996. [10] J¨ org-Uwe Kietz and Saso Dzeroski. Inductive logic programming and learnability. SIGART Bulletin, 5(1):22 32, 1994. [11] J¨ org-Uwe Kietz and Marcus L¨ ubbe. An e ecient subsumption algorithm for inductive logic programming. In W. Cohen and H. Hirsh, editors, Proceedin s of the 11th International Conference on Machine Learnin IML 94, San Francisco, CA, 1994. Morgan Kaufmann. [12] Volker Klingspor. GRDT: Enhancing model-based learning for its application in robot navigation. In S. Wrobel, editor, Proc. of the Fourth Intern. Workshop on Inductive Lo ic Pro rammin , GMD-Studien Nr. 237, pages 107 122, St. Augustin, Germany, 1994. GMD. [13] Volker Klingspor. Reaktives Planen mit elernten Be ri en. PhD thesis, Univ. Dortmund, 1998. [14] P. Laird. Identifying and using patterns in sequential data. In K.Jantke, S. Kobayashi, E. Tomita, and T. Yokomori, editors, Procs. of 4th Workshop on Al orithmic Learnin Theory, pages 1 18. Springer, 1993. [15] H. Liu and H. Motoda. Feature Extraction, Construction, and Selection: A Data Minin Perspective. Kluwer, 1998. [16] H. Mannila, H. Toivonen, and A. Verkamo. Discovering frequent episode in sequences. In Procs. of the 1st Int. Conf. on Knowled e Discovery in Databases and Data Minin . AAAI Press, 1995. [17] Ryszard S. Michalski and Yves Kodrato . Research in machine learning and recent progress, classi cation of methods, and future directions. In Yves Kodrato and Ryszard Michalski, editors, Machine Learnin an Arti cial Intelli ence Approach, volume III, chapter I, pages 3 30. Morgan Kaufmann, Los Altos, CA, 1990. [18] Katharina Morik and Peter Brockhausen. A multistrategy approach to relational knowledge discovery in databases. Machine Learnin Journal, 27(3):287 312, jun 1997.
12
Katharina Morik
[19] Katharina Morik and Stephanie Wessel. Incremental signal to symbol processing. In K.Morik, M. Kaiser, and V. Klingspor, editors, Makin Robots Smarter Combinin Sensin and Action throu h Robot Learnin , chapter 11, pages 185 198. Kluwer Academic Publ., 1999. [20] Stephen Muggleton and Wray Buntine. Machine invention of rst-order predicates by inverting resolution. In Proc. Fifth Intern. Conf. on Machine Learnin , Los Altos, CA, 1988. Morgan Kaufman. [21] Gordon D. Plotkin. A note on inductive generalization. In B. Meltzer and D. Michie, editors, Machine Intelli ence, chapter 8, pages 153 163. American Elsevier, 1970. [22] Anke Rieger and Volker Klingspor. Program optimization for real-time performance. In K. Morik, V. Klingspor, and M. Kaiser, editors, Makin Robots Smarter Combinin Sensin and Action throu h Robot Learnin . Kluwer Academic Press, 1999. [23] Anke D. Rieger. Pro ram Optimization for Temporal Reasonin within a Lo ic Pro rammin Framework. PhD thesis, Universit at Dortmund, Germany, Dortmund, FRG, 1998. [24] J. Rissanen. Modeling by shortest data description. Automatica, 14:465 471, 1978. [25] J. Robinson. A machine-oriented logic based on the resolution principle. Journal of the ACM, 12(1):23 41, 1965. [26] A. Shinohara. Complexity of computing vapnik-chervonenski dimension. In K.Jantke, S. Kobayashi, E. Tomita, and T. Yokomori, editors, Procs. of 4th Workshop on Al orithmic Learnin Theory, pages 279 287. Springer, 1993. [27] Edgar Sommer. Theory Restructerin : A Perspective on Desi n & Maintenance of Knowled e Based Systems. PhD thesis, University of Dortmund, 1996. [28] Irene Stahl. Compression measures in ILP. In Luc De Raedt, editor, Advances in Inductive Lo ic Pro rammin , pages 295 307. IOS Press, 1996. [29] Irene Stahl. Predicate invention in inductive logic programming. In Luc DeRaedt, editor, Advances in Inductive Lo ic Pro rammin , pages 34 47. IOS Press, 1996. [30] H. Tamaki and T. Sato. Unfold and fold transformation of logic programs. In Procs. of 2nd Int. Conf. Lo ic Pro rammin , pages 127 138, 1984. [31] Vladimir N. Vapnik. The Nature of Statistical Learnin Theory. Springer, New York, 1995. [32] Stefan Wrobel. Concept Formation and Knowled e Revision. Kluwer Academic Publishers, Dordrecht, 1994. [33] Wei Zhang. A region-based approah to discovering temporal structures in data. In Ivan Bratko and Saso Dzeroski, editors, Proc. of 16th Int. Conf. on Machine Learnin , pages 484 492. Morgan Kaufmann, 1999.
Theoretical Views of Boostin and Applications Rober E. Schapire AT&T Labs − Research, Shannon Laboratory 180 Park Avenue, Room A279, Florham Park, NJ 07932, USA www.research.att.com/ schap re
Abs rac . Boostin is a eneral method for improvin the accuracy of any iven learnin al orithm. Focusin primarily on the AdaBoost al orithm, we briefly survey theoretical work on boostin includin analyses of AdaBoost’s trainin error and eneralization error, connections between boostin and ame theory, methods of estimatin probabilities usin boostin , and extensions of AdaBoost for multiclass classi cation problems. Some empirical work and applications are also described.
Back round Boos ing is a general me hod which a emp s o boos he accuracy of any given learning algori hm. Kearns and Valian [29, 30] were he rs o pose he ques ion of whe her a weak learning algori hm which performs jus sligh ly be er han random guessing in Valian ’s PAC model [44] can be boos ed in o an arbi rarily accura e s rong learning algori hm. Schapire [36] came up wi h he rs provable polynomial- ime boos ing algori hm in 1989. A year la er, Freund [16] developed a much more e cien boos ing algori hm which, al hough op imal in a cer ain sense, never heless su ered from cer ain prac ical drawbacks. The rs experimen s wi h hese early boos ing algori hms were carried ou by Drucker, Schapire and Simard [15] on an OCR ask.
AdaBoost The AdaBoos algori hm, in roduced in 1995 by Freund and Schapire [22], solved many of he prac ical di cul ies of he earlier boos ing algori hms, and is he focus of his paper. Pseudocode for AdaBoos is given in Fig. 1 in he sligh ly generalized form given by Schapire and Singer [40]. The algori hm akes as inpu (xm ym ) where each xi belongs o some domain or a raining se (x1 y1 ) instance space X, and each label yi is in some label se Y . For mos of his paper, we assume Y = −1 +1 ; la er, we discuss ex ensions o he mul iclass case. AdaBoos calls a given weak or base learnin al orithm repea edly in a series of rounds t = 1 T . One of he main ideas of he algori hm is o main ain a dis ribu ion or se of weigh s over he raining se . The weigh of his dis ribu ion on raining example on round t is deno ed Dt ( ). Ini ially, all weigh s are se equally, bu on each round, he weigh s of incorrec ly classi ed O. Watanabe, T. Yokomori (Eds.): ALT’99, LNAI 1720, pp. 13 25, 1999. c Sprin er-Verla Berlin Heidelber 1999
14
Robert E. Schapire
Given: (x1 y1 ) (xm ym ) where xi Ini ialize D1 ( ) = 1 m. For t = 1 T:
X, yi
Y = −1 +1
Train weak learner using dis ribu ion Dt . R. Ge weak hypo hesis ht : X Choose t R . Upda e: Dt ( ) exp(− t yi ht (xi )) Dt+1 ( ) = Zt where Zt is a normaliza ion fac or (chosen so ha Dt+1 will be a dis ribuion). Ou pu
he nal hypo hesis: T t ht (x)
H(x) = sign t=1
Fi . 1. The boos ing algori hm AdaBoos .
examples are increased so ha he weak learner is forced o focus on he hard examples in he raining se . R appropria e The weak learner’s job is o nd a weak hypothesis ht : X for he dis ribu ion Dt . In he simples case, he range of each ht is binary, i.e., res ric ed o −1 +1 ; he weak learner’s job hen is o minimize he error t
= PriDt [ht (xi ) = yi ]
Once he weak hypo hesis ht has been received, AdaBoos chooses a parame er t R which in ui ively measures he impor ance ha i assigns o ht . In he gure, we have delibera ely lef he choice of t unspeci ed. For binary ht , we ypically se 1− t 1 (1) t = 2 ln t
More on choosing t follows below. The dis ribu ion Dt is hen upda ed using he rule shown in he gure. The nal hypothesis H is a weigh ed majori y vo e of he T weak hypo heses where t is he weigh assigned o ht .
Analyzin the Trainin Error The mos basic heore ical proper y of AdaBoos concerns i s abili y o reduce he raining error. Speci cally, Schapire and Singer [40], in generalizing a heorem
Theoretical Views of Boostin and Applications
of Freund and Schapire [22], show ha is bounded as follows: 1 m
1 m
: H(xi ) = yi
15
he raining error of he nal hypo hesis
exp(−yi f (xi )) =
Zt
(2)
t
i
where f (x) = t t ht (x) so ha H(x) = sign(f (x)). The inequali y follows 1 if yi = H(xi ). The equali y can be proved from he fac ha e−yi f (xi ) s raigh forwardly by unraveling he recursive de ni ion of Dt . Eq. (2) sugges s ha he raining error can be reduced mos rapidly (in a greedy way) by choosing t and ht on each round o minimize Zt =
Dt ( ) exp(− t yi ht (xi )) i
In he case of binary hypo heses, his leads o he choice of and gives a bound on he raining error of 2 t
t (1
− t)
1 − 4γt2
= t
t
exp −2 t
given in Eq. (1)
γt2
where t = 1 2 − γt . This bound was rs proved by Freund and Schapire [22]. Thus, if each weak hypo hesis is sligh ly be er han random so ha γt is bounded away from zero, hen he raining error drops exponen ially fas . This bound, combined wi h he bounds on generaliza ion error given below prove ha AdaBoos is indeed a boos ing algori hm in he sense ha i can e cien ly conver a weak learning algori hm (which can always genera e a hypo hesis wi h a weak edge for any dis ribu ion) in o a s rong learning algori hm (which can genera e a hypo hesis wi h an arbi rarily low error ra e, given su cien da a). Eq. (2) poin s o he fac ha , a hear , AdaBoos is a procedure for nding a linear combina ion f of weak hypo heses which a emp s o minimize exp −yi
exp(−yi f (xi )) = i
i
t ht (xi )
(3)
t
Essen ially, on each round, AdaBoos chooses ht (by calling he weak learner) and hen se s t o add one more erm o he accumula ing weigh ed sum of weak hypo heses in such a way ha he sum of exponen ials above will be maximally reduced. In o her words, AdaBoos is doing a kind of s eepes descen search o minimize Eq. (3) where he search is cons rained a each s ep o follow coordina e direc ions (where we iden ify coordina es wi h he weigh s assigned o weak hypo heses). Schapire and Singer [40] discuss he choice of t and ht in he case ha ht is real-valued (ra her han binary). In his case, ht (x) can be in erpre ed as a con dence-ra ed predic ion in which he sign of ht (x) is he predic ed label, while he magni ude ht (x) gives a measure of con dence.
16
Robert E. Schapire
Generalization Error Freund and Schapire [22] showed how o bound he generaliza ion error of he nal hypo hesis in erms of i s raining error, he size m of he sample, he VC-dimension d of he weak hypo hesis space and he number of rounds T of boos ing. Speci cally, hey used echniques from Baum and Haussler [4] o show ha he generaliza ion error, wi h high probabili y, is a mos Pr [H(x) = y] + O
Td m
where Pr [ ] deno es empirical probabili y on he raining sample. This bound sugges s ha boos ing will over if run for oo many rounds, i.e., as T becomes large. In fac , his some imes does happen. However, in early experimen s, several au hors [8, 14, 34] observed empirically ha boos ing of en does not over , even when run for housands of rounds. Moreover, i was observed ha AdaBoos would some imes con inue o drive down he generaliza ion error long af er he raining error had reached zero, clearly con radic ing he spiri of he bound above. For ins ance, he lef side of Fig. 2 shows he raining and es curves of running boos ing on op of Quinlan’s C4.5 decision- ree learning algori hm [35] on he le er da ase . In response o hese empirical ndings, Schapire e al. [39], following he work of Bar le [2], gave an al erna ive analysis in erms of he mar ins of he raining examples. The margin of example (x y) is de ned o be t ht (x)
y t
t t
I is a number in [−1 +1] which is posi ive if and only if H correc ly classi es he example. Moreover, as before, he magni ude of he margin can be in erpre ed as a measure of con dence in he predic ion. Schapire e al. proved ha larger margins on he raining se ransla e in o a superior upper bound on he generaliza ion error. Speci cally, he generaliza ion error is a mos Pr marginf (x y)
+O
d m 2
for any > 0 wi h high probabili y. No e ha his bound is en irely independen of T , he number of rounds of boos ing. In addi ion, Schapire e al. proved ha boos ing is par icularly aggressive a reducing he margin (in a quan i able sense) since i concen ra es on he examples wi h he smalles margins (whe her posi ive or nega ive). Boos ing’s e ec on he margins can be seen empirically, for ins ance, on he righ side of Fig. 2 which shows he cumula ive dis ribu ion of margins of he raining examples on he le er da ase . In his case, even
20
error
15 10 5 0
10
100
1000
cumulative distribution
Theoretical Views of Boostin and Applications
17
1.0
0.5
-1
# rounds
-0.5
0.5
1
mar in
Fi . 2. Error curves and he margin dis ribu ion graph for boos ing C4.5 on he le er da ase as repor ed by Schapire e al. [39]. Left: he raining and es error curves (lower and upper curves, respec ively) of he combined classi er as a func ion of he number of rounds of boos ing. The horizon al lines indica e he es error ra e of he base classi er as well as he es error of he nal combined classi er. Ri ht: The cumula ive dis ribu ion of margins of he raining examples af er 5, 100 and 1000 i era ions, indica ed by shor -dashed, long-dashed (mos ly hidden) and solid curves, respec ively. af er he raining error reaches zero, boos ing con inues o increase he margins of he raining examples e ec ing a corresponding drop in he es error. A emp s (no always successful) o use he insigh s gleaned from he heory of margins have been made by several au hors [6, 26, 32]. In addi ion, he margin heory poin s o a s rong connec ion be ween boos ing and he suppor -vec or machines of Vapnik and o hers [5, 11, 45] which explici ly a emp o maximize he minimum margin.
A Connection to Game Theory The behavior of AdaBoos can also be unders ood in a game- heore ic se ing as explored by Freund and Schapire [21, 23] (see also Grove and Schuurmans [26] and Breiman [7]). In classical game heory, i is possible o pu any wo-person, zero-sum game in he form of a ma rix M. To play he game, one player chooses a row and he o her player chooses a column j. The loss o he row player (which is he same as he payo o he column player) is Mij . More generally, he wo sides may play randomly, choosing dis ribu ions P and Q over rows or columns, respec ively. The expec ed loss hen is PT MQ. Boos ing can be viewed as repea ed play of a par icular game ma rix. Assume ha he weak hypo heses are binary, and le H = h1 hn be he en ire weak hypo hesis space (which we assume for now o be ni e). For a xed raining se (xm ym ), he game ma rix M has m rows and n columns where (x1 y1 ) Mij =
1 if hj (xi ) = yi 0 o herwise.
18
Robert E. Schapire
The row player now is he boos ing algori hm, and he column player is he weak learner. The boos ing algori hm’s choice of a dis ribu ion Dt over raining examples becomes a dis ribu ion P over rows of M, while he weak learner’s choice of a weak hypo hesis ht becomes he choice of a column j of M. As an example of he connec ion be ween boos ing and game heory, consider von Neumann’s famous minmax heorem which s a es ha max min PT MQ = min max PT MQ Q
P
P
Q
for any ma rix M. When applied o he ma rix jus de ned and rein erpre ed in he boos ing se ing, his can be shown o have he following meaning: If, for any dis ribu ion over examples, here exis s a weak hypo hesis wi h error a mos 1 2 − γ, hen here exis s a convex combina ion of weak hypo heses wi h a margin of a leas 2γ on all raining examples. AdaBoos seeks o nd such a nal hypo hesis wi h high margin on all examples by combining many weak hypo heses; so in a sense, he minmax heorem ells us ha AdaBoos a leas has he po en ial for success since, given a good weak learner, here mus exis a good combina ion of weak hypo heses. Going much fur her, AdaBoos can be shown o be a special case of a more general algori hm for playing repea ed games, or for approxima ely solving ma rix games. This shows ha , asymp o ically, he dis ribu ion over raining examples as well as he weigh s over weak hypo heses in he nal hypo hesis have game- heore ic in epre a ions as approxima e minmax or maxmin s ra egies.
Estimatin Probabilities Classi ca ion generally is he problem of predic ing he label y of an example x wi h he in en ion of minimizing he probabili y of an incorrec predic ion. However, i is of en useful o es ima e he probability of a par icular label. Recen ly, Friedman, Has ie and Tibshirani [24] sugges ed a me hod for using he ou pu of AdaBoos o make reasonable es ima es of such probabili ies. Speci cally, hey sugges using a logis ic func ion, and es ima ing Prf [y = +1 x] =
ef (x) + e−f (x)
ef (x)
(4)
where, as usual, f (x) is he weigh ed average of weak hypo heses produced by AdaBoos . The ra ionale for his choice is he close connec ion be ween he log loss (nega ive log likelihood) of such a model, namely, ln 1 + e−2yi f (xi )
(5)
i
and he func ion which, we have already no ed, AdaBoos a emp s o minimize: e−yi f (xi ) i
(6)
Theoretical Views of Boostin and Applications
19
Speci cally, i can be veri ed ha Eq. (5) is upper bounded by Eq. (6). In addi ion, if we add he cons an 1 − ln 2 o Eq. (5) (which does no a ec i s minimiza ion), hen i can be veri ed ha he resul ing func ion and he one in Eq. (6) have iden ical Taylor expansions around zero up o second order; hus, heir behavior near zero is very similar. Finally, i can be shown ha , for any dis ribu ion over pairs (x y), he expec a ions E ln 1 + e−2yf (x) and
E e−yf (x)
are minimized by he same func ion f , namely, f (x) =
1 2
ln
Pr [y = +1 x] Pr [y = −1 x]
Thus, for all hese reasons, minimizing Eq. (6), as is done by AdaBoos , can be viewed as a me hod of approxima ely minimizing he nega ive log likelihood given in Eq. (5). Therefore, we may expec Eq. (4) o give a reasonable probabili y es ima e. Friedman, Has ie and Tibshirani also make o her connnec ions be ween AdaBoos , logis ic regression and addi ive models.
Multiclass Classi cation There are several me hods of ex ending AdaBoos o he mul iclass case. The mos s raigh forward generaliza ion [22], called AdaBoos .M1, is adequa e when he weak learner is s rong enough o achieve reasonably high accuracy, even on he hard dis ribu ions crea ed by AdaBoos . However, his me hod fails if he weak learner canno achieve a leas 50% accuracy when run on hese hard dis ribu ions. For he la er case, several more sophis ica ed me hods have been developed. These generally work by reducing he mul iclass problem o a larger binary problem. Schapire and Singer’s [40] algori hm AdaBoos .MH works by crea ing a se of binary problems, for each example x and each possible label y, of he form: For example x, is he correc label y or is i one of he o her labels? Freund and Schapire’s [22] algori hm AdaBoos .M2 (which is a special case of Schapire and Singer’s [40] AdaBoos .MR algori hm) ins ead crea es binary problems, for each example x wi h correc label y and each incorrect label y 0 of he form: For example x, is he correc label y or y 0 ? These me hods require addi ional e or in he design of he weak learning algori hm. A di eren echnique [37], which incorpora es Die erich and Bakiri’s [13] me hod of error-correc ing ou pu codes, achieves similar provable bounds o hose of AdaBoos .MH and AdaBoos .M2, bu can be used wi h any weak learner which can handle simple, binary labeled da a. Schapire and Singer [40] give ye ano her me hod of combining boos ing wi h error-correc ing ou pu codes.
20
Robert E. Schapire
30
C4.5
25 20 15 10 5 0 0
5
10
15
20
25
30
boosting stumps
0
5
10
15
20
25
30
boosting C4.5
Fi . 3. Comparison of C4.5 versus boos ing s umps and boos ing C4.5 on a se of 27 benchmark problems as repor ed by Freund and Schapire [20]. Each poin in each sca erplo shows he es error ra e of he wo compe ing algori hms on a single benchmark. The y-coordina e of each poin gives he es error ra e (in percen ) of C4.5 on he given benchmark, and he x-coordina e gives he error ra e of boos ing s umps (lef plo ) or boos ing C4.5 (righ plo ). All error ra es have been averaged over mul iple runs.
Experiments and Applications Prac ically, AdaBoos has many advan ages. I is fas , simple and easy o program. I has no parame ers o une (excep for he number of round T ). I requires no prior knowledge abou he weak learner and so can be flexibly combined wi h any me hod for nding weak hypo heses. Finally, i comes wi h a se of heore ical guaran ees given su cien da a and a weak learner ha can reliably provide only modera ely accura e weak hypo heses. This is a shif in mind se for he learning-sys em designer: ins ead of rying o design a learning algori hm ha is accura e over he en ire space, we can ins ead focus on nding weak learning algori hms ha only need o be be er han random. On he o her hand, some cavea s are cer ainly in order. The ac ual performance of boos ing on a par icular problem is clearly dependen on he da a and he weak learner. Consis en wi h heory, boos ing can fail o perform well given insu cien da a, overly complex weak hypo heses or weak hypo heses which are oo weak. Boos ing seems o be especially suscep ible o noise [12] (more on his la er). AdaBoos has been es ed empirically by many researchers, including [3, 12, 14, 28, 31, 34, 43]. For ins ance, Freund and Schapire [20] es ed AdaBoos on a se of UCI benchmark da ase s [33] using C4.5 [35] as a weak learning algori hm, as well as an algori hm which nds he bes decision s ump or
Theoretical Views of Boostin and Applications 16
21
35
14 30 12 25 % Error
% Error
10 8
20
6 15 4
AdaBoost Sleeping-experts Rocchio Naive-Bayes PrTFIDF
2 0 3
4
5 Number of Classes
AdaBoost Sleeping-experts Rocchio Naive-Bayes PrTFIDF
10
5 6
4
6
8
10 12 14 Number of Classes
16
18
20
Fi . 4. Comparison of error ra es for AdaBoos and four o her ex ca egorizaion me hods (naive Bayes, probabilis ic TF-IDF, Rocchio and sleeping exper s) as repor ed by Schapire and Singer [41]. The algori hms were es ed on wo ex corpora Reu ers newswire ar icles (lef ) and AP newswire headlines (righ ) and wi h varying numbers of class labels as indica ed on he x-axis of each gure.
single- es decision ree. Some of he resul s of hese experimen s are shown in Fig. 3. As can be seen from his gure, even boos ing he weak decision s umps can usually give as good resul s as C4.5, while boos ing C4.5 generally gives he decision- ree algori hm a signi can improvemen in performance. In ano her se of experimen s, Schapire and Singer [41] used boos ing for ex ca egoriza ion asks. For his work, weak hypo heses were used which es on he presence or absence of a word or phrase. Some resul s of hese experimen s comparing AdaBoos o four o her me hods are shown in Fig. 4. In nearly all of hese experimen s and for all of he performance measures es ed, boos ing performed as well or signi can ly be er han he o her me hods es ed. Boos ing has also been applied o ex l ering [42], ranking problems [18] and classi ca ion problems arising in na ural language processing [1, 27]. The nal hypo hesis produced by AdaBoos when used, for ins ance, wi h a decision- ree weak learning algori hm, can be ex remely complex and di cul o comprehend. Wi h grea er care, a more human-unders andable nal hypo hesis can be ob ained using boos ing. Cohen and Singer [10] showed how o design a weak learning algori hm which, when combined wi h AdaBoos , resul s in a nal hypo hesis consis ing of a rela ively small se of rules similar o hose genera ed by sys ems like RIPPER [9], IREP [25] and C4.5rules [35]. Cohen and Singer’s sys em, called SLIPPER, is fas , accura e and produces qui e compac rule se s. In o her work, Freund and Mason [19] showed how o apply boos ing o learn a generaliza ion of decision rees called al erna ing rees. Their algori hm produces a single al erna ing ree ra her han an ensemble of rees as would be ob ained by running AdaBoos on op of a decision- ree learning algori hm. On he o her hand, heir learning algori hm achieves error ra es comparable o hose of a whole ensemble of rees.
22
Robert E. Schapire
4:1/0.27,4/0.17 5:0/0.26,5/0.17 7:4/0.25,9/0.18 1:9/0.15,7/0.15 2:0/0.29,2/0.19 9:7/0.25,9/0.17
3:5/0.28,3/0.28 9:7/0.19,9/0.19 4:1/0.23,4/0.23 4:1/0.21,4/0.20 4:9/0.16,4/0.16 9:9/0.17,4/0.17
4:4/0.18,9/0.16 4:4/0.21,1/0.18 7:7/0.24,9/0.21 9:9/0.25,7/0.22 4:4/0.19,9/0.16 9:9/0.20,7/0.17
Fi . 5. A sample of he examples ha have he larges weigh on an OCR ask as repor ed by Freund and Schapire [20]. These examples were chosen af er 4 rounds of boos ing ( op line), 12 rounds (middle) and 25 rounds (bo om). Undernea h each image is a line of he form d: 1 w1 , 2 w2 , where d is he label of he example, 1 and 2 are he labels ha ge he highes and second highes vo e from he combined hypo hesis a ha poin in he run of he algori hm, and w1 , w2 are he corresponding normalized scores.
A nice proper y of AdaBoos is i s abili y o iden ify outliers, i.e., examples ha are ei her mislabeled in he raining da a, or which are inheren ly ambiguous and hard o ca egorize. Because AdaBoos focuses i s weigh on he hardes examples, he examples wi h he highes weigh of en urn ou o be ou liers. An example of his phenomenon can be seen in Fig. 5 aken from an OCR experimen conduc ed by Freund and Schapire [20]. When he number of ou liers is very large, he emphasis placed on he hard examples can become de rimen al o he performance of AdaBoos . This was demons ra ed very convincingly by Die erich [12]. Friedman e al. [24] sugges ed a varian of AdaBoos , called Gen le AdaBoos which pu s less emphasis on ou liers. In recen work, Freund [17] sugges ed ano her algori hm, called BrownBoos , which akes a more radical approach ha de-emphasizes ou liers when i seems clear ha hey are oo hard o classify correc ly. This algori hm is an adap ive version of Freund’s [16] boos -by-majori y algori hm. This work, oge her wi h Schapire’s [38] work on drif ing games, reveal some in eres ing new rela ionships be ween boos ing, Brownian mo ion, and repea ed games while raising many new open problems and direc ions for fu ure research.
Theoretical Views of Boostin and Applications
23
References [1] Steven Abney, Robert E. Schapire, and Yoram Sin er. Boostin applied to ta in and PP attachment. In Proceedin s of the Joint SIGDAT Conference on Empirical Methods in Natural Lan ua e Processin and Very Lar e Corpora, 1999. [2] Peter L. Bartlett. The sample complexity of pattern classi cation with neural networks: the size of the wei hts is more important than the size of the network. IEEE Transactions on Information Theory, 44(2):525 536, March 1998. [3] Eric Bauer and Ron Kohavi. An empirical comparison of votin classi cation al orithms: Ba in , boostin , and variants. Machine Learnin , to appear. [4] Eric B. Baum and David Haussler. What size net ives valid eneralization? Neural Computation, 1(1):151 160, 1989. [5] Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A trainin al orithm for optimal mar in classi ers. In Proceedin s of the Fifth Annual ACM Workshop on Computational Learnin Theory, pa es 144 152, 1992. [6] Leo Breiman. Arcin the ed e. Technical Report 486, Statistics Department, University of California at Berkeley, 1997. [7] Leo Breiman. Prediction ames and arcin classi ers. Technical Report 504, Statistics Department, University of California at Berkeley, 1997. [8] Leo Breiman. Arcin classi ers. The Annals of Statistics, 26(3):801 849, 1998. [9] William Cohen. Fast e ective rule induction. In Proceedin s of the Twelfth International Conference on Machine Learnin , pa es 115 123, 1995. [10] William W. Cohen and Yoram Sin er. A simple, fast, and e ective rule learner. In Proceedin s of the Sixteenth National Conference on Arti cial Intelli ence, 1999. [11] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learnin , 20(3):273 297, September 1995. [12] Thomas G. Dietterich. An experimental comparison of three methods for constructin ensembles of decision trees: Ba in , boostin , and randomization. Machine Learnin , to appear. [13] Thomas G. Dietterich and Ghulum Bakiri. Solvin multiclass learnin problems via error-correctin output codes. Journal of Arti cial Intelli ence Research, 2:263 286, January 1995. [14] Harris Drucker and Corinna Cortes. Boostin decision trees. In Advances in Neural Information Processin Systems 8, pa es 479 485, 1996. [15] Harris Drucker, Robert Schapire, and Patrice Simard. Boostin performance in neural networks. International Journal of Pattern Reco nition and Arti cial Intelli ence, 7(4):705 719, 1993. [16] Yoav Freund. Boostin a weak learnin al orithm by majority. Information and Computation, 121(2):256 285, 1995. [17] Yoav Freund. An adaptive version of the boost by majority al orithm. In Proceedin s of the Twelfth Annual Conference on Computational Learnin Theory, 1999. [18] Yoav Freund, Raj Iyer, Robert E. Schapire, and Yoram Sin er. An e cient boostin al orithm for combinin preferences. In Machine Learnin : Proceedin s of the Fifteenth International Conference, 1998. [19] Yoav Freund and Llew Mason. The alternatin decision tree learnin al orithm. In Machine Learnin : Proceedin s of the Sixteenth International Conference, 1999. to appear. [20] Yoav Freund and Robert E. Schapire. Experiments with a new boostin al orithm. In Machine Learnin : Proceedin s of the Thirteenth International Conference, pa es 148 156, 1996.
24
Robert E. Schapire
[21] Yoav Freund and Robert E. Schapire. Game theory, on-line prediction and boostin . In Proceedin s of the Ninth Annual Conference on Computational Learnin Theory, pa es 325 332, 1996. [22] Yoav Freund and Robert E. Schapire. A decision-theoretic eneralization of online learnin and an application to boostin . Journal of Computer and System Sciences, 55(1):119 139, Au ust 1997. [23] Yoav Freund and Robert E. Schapire. Adaptive ame playin usin multiplicative wei hts. Games and Economic Behavior, to appear. [24] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive lo istic re ression: a statistical view of boostin . Technical Report, 1998. [25] Johannes F¨ urnkranz and Gerhard Widmer. Incremental reduced error prunin . In Machine Learnin : Proceedin s of the Eleventh International Conference, pa es 70 77, 1994. [26] Adam J. Grove and Dale Schuurmans. Boostin in the limit: Maximizin the mar in of learned ensembles. In Proceedin s of the Fifteenth National Conference on Arti cial Intelli ence, 1998. [27] Masahiko Haruno, Satoshi Shirai, and Yoshifumi Ooyama. Usin decision trees to construct a practical parser. Machine Learnin , 34:131 149, 1999. [28] Je rey C. Jackson and Mark W. Craven. Learnin sparse perceptrons. In Advances in Neural Information Processin Systems 8, pa es 654 660, 1996. [29] Michael Kearns and Leslie G. Valiant. Learnin Boolean formulae or nite automata is as hard as factorin . Technical Report TR-14-88, Harvard University Aiken Computation Laboratory, Au ust 1988. [30] Michael Kearns and Leslie G. Valiant. Crypto raphic limitations on learnin Boolean formulae and nite automata. Journal of the Association for Computin Machinery, 41(1):67 95, January 1994. [31] Richard Maclin and David Opitz. An empirical evaluation of ba in and boostin . In Proceedin s of the Fourteenth National Conference on Arti cial Intellience, pa es 546 551, 1997. [32] Llew Mason, Peter Bartlett, and Jonathan Baxter. Direct optimization of mar ins improves eneralization in combined classi ers. Technical report, Deparment of Systems En ineerin , Australian National University, 1998. [33] C. J. Merz and P. M. Murphy. UCI repository of machine learnin databases, 1999. www.ics.uci.edu/ mlearn/MLRepository.html. [34] J. R. Quinlan. Ba in , boostin , and C4.5. In Proceedin s of the Thirteenth National Conference on Arti cial Intelli ence, pa es 725 730, 1996. [35] J. Ross Quinlan. C4.5: Pro rams for Machine Learnin . Mor an Kaufmann, 1993. [36] Robert E. Schapire. The stren th of weak learnability. Machine Learnin , 5(2):197 227, 1990. [37] Robert E. Schapire. Usin output codes to boost multiclass learnin problems. In Machine Learnin : Proceedin s of the Fourteenth International Conference, pa es 313 321, 1997. [38] Robert E. Schapire. Driftin ames. In Proceedin s of the Twelfth Annual Conference on Computational Learnin Theory, 1999. [39] Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boostin the mar in: A new explanation for the e ectiveness of votin methods. The Annals of Statistics, 26(5):1651 1686, October 1998. [40] Robert E. Schapire and Yoram Sin er. Improved boostin al orithms usin con dence-rated predictions. In Proceedin s of the Eleventh Annual Conference on Computational Learnin Theory, pa es 80 91, 1998. To appear, Machine Learnin .
Theoretical Views of Boostin and Applications
25
[41] Robert E. Schapire and Yoram Sin er. BoosTexter: A boostin -based system for text cate orization. Machine Learnin , to appear. [42] Robert E. Schapire, Yoram Sin er, and Amit Sin hal. Boostin and Rocchio applied to text lterin . In SIGIR ’98: Proceedin s of the 21st Annual International Conference on Research and Development in Information Retrieval, 1998. [43] Hol er Schwenk and Yoshua Ben io. Trainin methods for adaptive boostin of neural networks. In Advances in Neural Information Processin Systems 10, pa es 647 653, 1998. [44] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134 1142, November 1984. [45] Vladimir N. Vapnik. The Nature of Statistical Learnin Theory. Sprin er, 1995.
Extended Stochastic Complexity and Minimax Relative Loss Analysis Kenji Yamanishi Theory NEC Laboratory, Real World Computin Partnership, C&C Media Research Laboratories, NEC Corporation, 4-1-1 Miyazaki, Miyamae-ku, Kawasaki, Kana awa 216-8555, Japan, yaman s @ccm.cl.nec.co.jp
Abs rac . We are concerned with the problem of sequential prediction usin a iven hypothesis class of continuously-many prediction strate ies. An e ective performance measure is the minimax relative cumulative loss (RCL), which is the minimum of the worst-case di erence between the cumulative loss for any prediction al orithm and that for the best assi nment in a iven hypothesis class. The purpose of this paper is to evaluate the minimax RCL for eneral continuous hypothesis classes under eneral losses. We rst derive asymptotical upper and lower bounds on the minimax RCL to show that they match (k 2c) ln m within error of o(ln m) where k is the dimension of parameters for the hypothesis class, m is the sample size, and c is the constant dependin on the loss function. We thereby show that the cumulative loss attainin the minimax RCL asymptotically coincides with the extended stochastic complexity (ESC), which is an extension of Rissanen’s stochastic complexity (SC) into the decision-theoretic scenario. We further derive non-asymptotical upper bounds on the minimax RCL both for parametric and nonparametric hypothesis classes. We apply the analysis into the re ression problem to derive the least worst-case cumulative loss bounds to date.
1 1.1
Introduction Minimax Re ret
We start with the minimax re ret analysis for the sequential stochastic prediction problem. Let Y be an alphabet, which can be either discrete or continuous. We rst consider the simplest case where Y is nite. Observe a sequence y1 y2 ) takes a value in Y. A stochastic prediction al orithm where each yt (t = 1 2 A performs as follows: At each round t = 1 2 n, A assi ns a probability mass yt−1 The probability function over Y based on the past sequence y t−1 = y1 mass function can be written as a conditional probability P ( y t−1 ) After the assi nment, A receives an outcome yt and su ers a lo arithmic loss de ned by − ln P (yt y t−1 ). This process oes on sequentially. Note that A is speci ed by . After observin a sequence of conditional probabilities: P ( y t−1 ) : t = 1 2 aPsequence y m = y1 ym of len th m, A su ers a cumulative lo arithmic loss m t−1 ) where P ( y0 ) = P0 ( ) is iven. t=1 − ln P (yt y O. Watanabe, T. Yokomori (Eds.): ALT’99, LNAI 1720, pp. 26 38, 1999. c Sprin er-Verla Berlin Heidelber 1999
Extended Stochastic Complexity and Minimax Relative Loss Analysis
27
The oal of stochastic prediction is to make the cumulative loss as small as possible. We introduce a reference set of prediction al orithms, which we call a hypothesis class, then evaluate the cumulative loss for any al orithm relative to it. For sample size m, we de ne the worst-case re ret for A relative to a hypothesis class H by ! m m X X def t−1 t−1 − ln P (yt y ) − inf − ln f (yt y ) Rm (A : H) = sup ym
f 2H
t=1
t=1
which means that the worst-case di erence between the cumulative lo arithmic loss for A and that for the best assi nment of a sin le hypothesis in H. Further we de ne the minimax re ret for sample size m by Rm (H) = inf Rm (A : H) A
where the in mum is taken over all stochastic prediction al orithms. Note that in the minimax re ret analysis we require no statistical assumption for dataeneration mechanism, but consider the worst-case with respect to sequences. Notice here that for any m, a stochastic prediction al orithm speci es a joint probability mass function by P (y m ) =
m Y
P (yt y t−1 )
(1)
t=1
Thus the minimax re ret is rewritten as Rm (H) = inf sup ln P
ym
supf 2H f (y m ) P (y m )
Shtarkov [9] showed that the minimax re ret is attained by the joint probability mass function under the normalized maximum likelihood, de ned as follows: P (y m ) = P
supf 2H f (y m ) m y m supf 2H f (y )
and then the minimax re ret amounts to be X sup f (y m ) Rm (H) = y m f 2H
(2)
Speci cally consider the case where the joint distribution is iven by a product of probability mass function belon in to a parametric hypothesis class where is a k-dimensional compact set in iven by Hk = P ( ) : k IR . Let be the maximum likelihood estimator (m.l.e.) of from y m (i.e., Qm = ar max 2 P (y m ) where P (y m ) = t=1 P (yt )). Rissanen [8] proved that under the condition that the central limit theorem holds for m.l.e. uniformly over , Rm (Hk ) is asymptotically expanded as follows: Z p m k + ln I( ) d + o(1) (3) Rm (Hk ) = ln 2 2
28
Kenji Yamanishi def
where I( ) = (E [− 2 ln P (y) i j ])i j denotes the Fisher information matrix and o(1) oes to zero uniformly with respect to y m as m oes to in nity. For a iven sequence y m and Hk , the ne ative lo -likelihood for y m under the joint probability mass function that attains the minimax re ret is called the stochastic complexity (SC) of y m (relative to Hk ) [6],[7],[8], which we denote as SC(y m ). That is, Z p m k + ln I( ) d + o(1) (4) SC(y m ) = − ln P (y m ) + ln 2 2 The SC of y m can be thou ht of as an extension of Shannon’s information in the sense that the latter measures the information of a data sequence relative to a sin le hypothesis while the former does so relative to a class of hypotheses. 1.2
Purpose of This Paper and Overview of Results
We extend the minimax re ret analysis in two ways. One is to extend it into a eneral decision-theoretic scenario in which the hypothesis class is a class of real-valued functions rather than a class of probability mass functions, and the prediction loss is measured in terms of eneral loss functions (e. ., square loss,Hellin er loss) rather than the lo arithmic loss. It is motivated by the fact that in real problems such as on-line re ression and pattern-reco nition, prediction should be made deterministically and a variety of loss functions should be used as a distortion measure for prediction. We analyze the minimax relative cumulative loss (RCL), which is an extension of the minimax re ret, under eneral losses. The minimax RCL has been investi ated in the community of computational learnin theory, but most of work are restricted to speci c cases: 1) the case where eneral loss functions are used but a hypothesis class is nite (e. ., [10], [3]) and 2) the case where a hypothesis class is continuous but only speci c loss functions such as the square loss and the lo arithmic loss are used (e. ., [1], [4],[5],[2],[11]), or only speci c hypothesis classes such as the Bernoulli model and the linear re ression model (e. ., [2],[11]) are used. This paper o ers a universal method of minimax RCL analysis relative to eneral continuous hypothesis classes under eneral loss functions. We rst derive asymptotical upper and lower bounds on the minimax RCL to show that they match (k 2c) ln m within error of o(ln m) where k is the dimension of parameters for the hypothesis class, m is the sample size, and c is the constant dependin on the loss function. Accordin to [12], we introduce the extended stochastic complexity (ESC) to show that the ESC is an approximation to the minimax solution to RCL. The relation between ESC and the minimax RCL is an analo ue of that between SC and the minimax re ret. This ives a unifyin view of the minimax RCL/the minimax re ret analysis. The other way of extension is to introduce a method of non-asymptotical analysis, while (3) is an asymptotical result, that is, it is e ective only when the sample size is su ciently lar e. Since sample size is not necessarily lar e enou h in real situations, theoretical bounds that hold for any sample size would be practically useful.
Extended Stochastic Complexity and Minimax Relative Loss Analysis
29
Non-asymptotical bounds on the minimax re ret and RCL were derived in [1],[5],[2],[11] for continuous hypothesis classes under speci c losses. We take a new approach to derive non-asymptotical bounds on the minimax RCL both for parametric and non-parametric hypothesis classes under eneral losses. The rest of this paper is or anized as follows: Section 2 ives a formal de nition of the minimax RCL. Section 3 overviews results on asymptotical analysis of the minimax RCL. Section 4 ives non-asymptotical analysis of the minimax RCL. Section 5 shows an application of our analysis to the re ression problem.
2
Minimax RCL
For a positive inte er n, let X be a subset of IRn , which we call the domain. Let Y = 0 1 or Y = [0 1], which we call the ran e. Let Z = [0 1] or Z be a set of probability mass functions over Y. We call Z the decision space. We set 0 D = X Y. We write an element in D as D = (x y) Let L : Y Z IR+ be a loss function. A sequential prediction al orithm A performs as follows: At each round t = 1 2 A receives xt X then outputs a predicted result zt Z on the basis Dt−1 where Di = (xi yi ) (i = 1 t − 1). Then A receives the of Dt−1 = D1 correct outcome yt Y and su ers a loss L(yt zt ). Hence A de nes a sequence where ft (xt ) = zt . A hypothesis class H is a set of of maps: ft : t = 1 2 sequential prediction al orithms. De nition 1. For sample size m, for a hypothesis class H, let Dm (H) be a subset of Dm dependin on H. For any sequential prediction al orithm A, we de ne the worst-case relative cumulative loss (RCL) by ! m m X X def sup L(yt zt ) − min L(yt ft (xt )) Rm (A : H) = Dm 2D m (H)
f 2H
t=1
t=1
where zt is the output of A at the tth round. We de ne the minimax RCL by Rm (H) = inf Rm (A : H) A
(5)
where the in mum is taken over all sequential prediction al orithms. We assume here that any prediction al orithm knows the sample size m in advance. Consider the special case where X = Y = 0 1 Z =the set of probability mass functions over Y, and the loss function is the lo arithmic loss: L(y P ) = − ln P (y) (Throu hout this paper we call this case the probabilistic case.) We can easily check that in the probabilistic case the minimax RCL (5) is equivalent with the minimax re ret (3). Hereafter, we consider only the case where Z = [0 1], that is, the prediction is made deterministically. Below we ive examples of loss functions for this case.
30
Kenji Yamanishi
L(y z) = (y − z)2 (square loss) 1−y y (entropic loss) L(y z) = y ln + (1 − y) ln z 1−z 2 2 p 1 y− z + 1−y− 1−z L(y z) = (Hellin er loss) 2 1 L(y z) = (−(2y − 1)(2z − 1) + ln(e2z−1 + e−2z+1 ) + B) (lo istic loss) 2 where B = ln(1 + e−2 ). def
For a loss function L, we de ne L0 (z) and L1 (z) by L0 (z) = L(0 z) and def L1 (z) = L(1 z), respectively. We make the followin assumption for L. Assumption 1 The loss function L satis es: 1) L0 (z) and L1 (z) are twice continuously di erentiable with respect to z. L0 (0) = L1 (1) = 0 For any 0 < z < 1, L00 (z) > 0 and L01 (z) < 0. 2) De ne by def
=
L00 (z)L01 (z)2 − L01 (z)L00 (z)2 0 00 0 00 0
−1
sup
Then 0 < < . 3) Let G(y z w) = (L(y z) − L(y w)) For any y z w y 2 + ( G(y z w) y)2 0.
(6)
[0 1],
2
G(y z w)
For example, = 1 for the entropic loss, = 2 for the square loss, and = 2 for the Hellin er loss. In the case of Y = 0 1 instead of Y = [0 1], Condition 3) is not necessarily required.
3
Asymptotical Results
Accordin to [12], we introduce the notion of ESC in order to derive upper bounds on the minimax RCL. De nition 2. Let be a probability measure on a hypothesis class H. For a iven loss function L, for a sequence Dm Dm we de ne the extended stochastic complexity (ESC) of Dm relative to H by Z Pm 1 def L(yy f (x )) =1 (df ) (7) I(Dm : H) = − ln e− While SC was de ned as the cumulative lo arithmic loss that attains the minimax re ret, ESC is not de ned as the cumulative loss that attains the minimax RCL. This is because it is not possible to et an explicit form of the minimax solution to RCL for eneral losses. It will turn out in this section, however, that the ESC is a ti ht upper bound on the cumulative loss that attains the minimax RCL within error of o(ln m)
Extended Stochastic Complexity and Minimax Relative Loss Analysis
31
Lemma 1. Under Assumption 1, there exists a sequential prediction al orithm such that for any Dm , its cumulative loss is upper-bounded by I(Dm : H). Lemma 1 can be proven usin Vovk’s a re atin al orithm [10] with its analysis in [3]. In order to make a connection between ESC and the minimax RCL, we focus on the speci c case where H is a parametric class such that any prediction al orithm in H is written as a sequence of functions each of which belon s to a where is a k-dimensional compact set parametric class Hk = f ( ) : in IRk . In this case the ESC of Dm relative to Hk is written as Z Pm 1 L(yy f (x )) =1 (8) I(Dm : Hk ) = − ln d ( )e− where ( ) is a prior probability density over . We make the followin assumption for L Hk and . Assumption 2 The followin conditions hold for L Hk and : 1) De ne a matrix J( ) by def
J( ) =
1 m
2
Pm t=1
L(yt f (xt )) i
j
i j=1 k
and let (Dm : ) be the lar est ei envalue of J( ) Let d = dm be a sequence such that dm > k, limm!1 dm = , and limm!1 (dm m) = 0. Let be the Pm minimum loss estimator of de ned by = ar min 2 t=1 L(yt f (xt )) Then for some 0 < < Dm
(Dm : )
sup
sup
:j − j<(dm m)1
2
p def 0 def : − dm m and Nm = IRk : − 2) Let Nm = p dm m . Then for some 0 < r < 1, for all su ciently lar e m, vol(Nm ) 0 ) where vol(S) is Lebes ue volume of S. r vol(Nm , ( ) c 3) For some c > 0 for any The followin lemma ives an asymptotical upper bound on the worst-case ESC. Pm Lemma 2. [12] Let Dm ( ) be a subset of Dm such that t=1 L(yt f (xt )) m m = 0 Under Assumption 2, for all D D ( ) we have = I(Dm : Hk )
min 2
m X t=1
L(yt f (xt )) +
1 k m 1 + ln ln + o(1) 2 2 rc
where limm!1 o(1) = 0 uniformly over Dm ( ).
(9)
32
Kenji Yamanishi
Note that in the probabilistic case, = 1 hence the ri hthand side of (9) coincides the SC of (4) within error of O(1). The main technique to prove (9) is Laplace’s method, which approximates an inte ral usin a Gaussian inte ral in the nei hborhood of the minimum loss estimator. The proof is sketched as follows: p Proof. For sample size m, let m = dm m for dm as in Assumption 2. For a : − minimum loss estimator of from Dm , let N m = m be Pm m a nei hborhood of . We denote t=1 L(yt f (xt )) as L(D : f ). Then for all N m we have Dm Dm ( ), lettin I(Dm : Hk ) Z 1 − ln
d
N
( ) exp (−
L(Dm : f ))
m T ( − ) J( )( − ) d L(D : f ) − − ln 2 N m ! Z k mX 1 m 2 d ( ) exp − L(D : f ) − ( i − i) − ln 2 i=1 N m ! Z k X 1 m d exp − ( i − i )2 (10) L(Dm : f ) − ln ( ) + ln 2 i=1 N m 1
Z
( ) exp −
m
m
where is the point in N m that attains the minimum of ( ). The notations of J( ) and follow Assumption 2. IRk : − Then Let N 0m = m Z exp − N
m
mX
!
k
2
(
i
−
2
i)
Z d
exp −
r N
i=1
m
2 k r 1− dm m
mX
!
k
2 k
i
−
i)
(
)k
(
i=1 2 q
2
d
−1 (11)
See [12] for the last inequality. Plu in nity yield (9).
in (11) into (10) and lettin dm
o to
Combinin Lemmas 1 with 2 leads the followin asymptotical upper bound on the minimax RCL. Theorem 1. Under Assumptions 1 and 2, Rm (Hk )
1 k m 1 + ln ln + o(1) 2 2 rc
where Dm (H) as in De nition 1 is set to Dm ( ).
(12)
Extended Stochastic Complexity and Minimax Relative Loss Analysis
33
In order to investi ate how ti ht (12) is, we derive an asymptotical lower bound on the minimax RCL. Theorem 2. [13]. When L is the entropic loss or the square loss, for some re ularity condition for H, k − o(1) ln m (13) Rm (Hk ) 2 Furthermore, under some re ularity conditions for Hk and a eneral L in addition to Assumptions 1 and 2, (13) holds. (Note: The current forms of the re ularity conditions required for eneral losses are very complicated [13] and remain to be simpli ed. ) By Theorems 1 and 2, we have the followin corollary relatin the ESC to the minimax re ret. It is formally summarized as follows: Corollary 1. Let def
Im (Hk ) =
sup
I(D
Dm 2D m ( )
m
: Hk ) − min
m X
! L(yt f (xt ))
t=1
Then under the conditions as in Theorems 1 and 2, lim
m!1
Im (H) − Rm (Hk ) =0 ln m
Corollary 1 shows that the ESC can be thou ht of as the cumulative loss that attains the minimax RCL within error of o(ln m). It corresponds to the fact that SC is the cumulative lo arithmic loss that attains the minimax re ret. This ives a rationale that ESC is a natural extension of SC.
4 4.1
Non-asymptotical Results Lo -Loss Case
Bounds (4), (12) and (13) is asymptotical in the sense that they are e ective only when the sample size m is su ciently lar e. This section derives nonasymptotical upper bounds on the minimax RCL, which mi ht not be ti ht, but hold for any sample size. First, we overview the results by Cesa-Bianchi and Lu osi [1] for the probabilistic case under the lo arithmic loss. Let (S d) be a metric space where S is a class of joint probability mass functions each of which is decomposed as in (1) and d is the metric such that d(f
)=
m X t=1
where dt (f
!1 d2t (f
2
)
) = supy ln f (yt y t−1 ) − ln (yt y t−1 )
34
Kenji Yamanishi
For T S for > 0 let N (T ) be the cardinality of the smallest subset S such that for all f T for some T 0 , d(f ) . The followin theorem shows a non-asymptotical upper bound on the minimax re ret. T0
Theorem 3. [1] For any class H Z Rm (H) inf ln N (H ) + 12 >0
p ln N (H )d
0
In derivin the above bound we don’t require re ularity condition for H such as the central limit theorem condition required for (3) to be satis ed. It actually holds re ardless H is parametric or no-parametric. Theorem 3 leads as a special case the followin non-asymptotical upper bound on the minimax re ret for parametric hypothesis classes. Corollary 2. [1] Consider a class H such that for some positive constants k and c, for all > 0 (14) ln N (H ) k ln c m Then for m
(288 ln(c m)) kc2 , we have k c2 ln(c m) k ln m + ln + 5k 2 2 k
Rm (H)
(15)
Note that Condition (14) holds for most classes parametrized smoothly by a bounded subset of IRk . 4.2
General-Loss Cases
Next for a decision-theoretic case under eneral losses other than the lo arithmic loss, we derive non-asymptotical upper bounds on the minimax RCL in a di erent manner from Cesa-Bianchi and Lu osi’s. First we investi ate the case where the hypothesis class is parametric. Theorem 4. [13] Under Assumptions 1 and 2, Rm (Hk )
k 2
ln m +
k ln 2
eF 2 V 2
k
(16)
where Dm (H) as in De nition 1 is set to Dm ( ). Proof. The main technique to prove (16) is to discretize the hypothesis class with an appropriate size and then to apply the a re atin al orithm over the discretized hypothesis class. The most important issue is how to choose the number of discrete points. For F 1 let be a subset of whose size is N and for which the dis− 0 cretization scale is at most F k(V N )1 k . That is, sup 2 min 2
Extended Stochastic Complexity and Minimax Relative Loss Analysis
35
F k(V N )1 k This means that the discretization scale for each component in is rou hly uniform within a constant factor. It is a quite natural requirement for the discretization. Pm Pm Let = ar min 2 t=1 L(yt f (xt )) and = ar min 2 t=1 L(yt f (xt )). We write the relative cumulative loss for any al orithm A as R(A : Dm ) Lettin yt be the output of A at the tth round, we see ! m m X X m L(yt yt ) − L(yt f (xt )) (17) R(A : D ) = +
t=1 m X
t=1
L(yt f (xt )) −
t=1
L(yt yt )
! L(yt f (xt ))
t=1
and A be the a
Lettin Hk = f ( ) : Hk , by Lemma 1, we see m X
m X
re atin al orithm AG usin
I(Dm : Hk )
t=1
=−
1
m X
ln
Pm
1 X − e N
=1
L(y f (x ))
2
L(yt f (xt )) +
1
ln N
t=1
This leads: m X
sup Dm 2D m (
)
L(yt yt ) −
t=1
m X
!
L(yt f (xt )) −
t=1
Plu
m X
ln N
(18)
t=1
By Taylor expansion ar ument, for all Dm m X
1
L(yt f (xt ))
m 2
L(yt f (xt ))
t=1
Dm ( ) −
2
mF 2 k 2
V N
2
k
(19)
in (18) and (19) into (17) yields sup Dm 2D m ( )
m
R(AG : D )
1
mF 2 k ln N + 2
V N
2
k
(20)
The minimum of (20) w.r.t. N is attained by N = ( mF 2 )k 2 V Plu in this optimal size into (20) and choosin so that F is smallest yield (16). Next we investi ate the case where the hypothesis class is nonparametric, but can be approximated by a sequence of parametric hypothesis classes.
36
Kenji Yamanishi
Theorem 5. Let Hk : k = 1 2 be a sequence of classes such that H1 where Hk is a k-dimensional parametric hypothesis class. Let F be a H2 hypothesis class such that for some C > 0 sup inf sup L(Dm : f ) − L(Dm : h) m
f 2F h2Hk Dm
C k
(21)
Pm where L(Dm : f ) = t=1 L(yt f (xt )) Then under Assumption 1 and 2, for some A B > 0 dependin on C and 1 m +1 1 ln eF 2 V 2 k Rm (F ) Am +1 ln +1 m + B ln m Proof. Fix k. By (16) and (21), we have Rm (F )
Rm (Hk ) + sup inf sup L(Dm : f ) − L(Dm : h) f 2F h2Hk Dm
k k ln m + ln 2 2 Settin k = (2
Cm ln m)
1 +1
eF 2 V 2
k
+
mC k
in the last inequality yields (22).
Condition (21) means that the worst-case approximation error for Hk to F is O(1 k ). Eq. (16) is re arded as a special case of (22) where is in nite.
5
Minimax RCL for Re ression
We apply the results in Sections 3 and 4 into the re ression problem. We consider the case where the hypothesis class is a class of linear functions of a feature vector and the distortion measure is the square loss. Such a case has entensively been investi ated in the linear re ression (LR) scenario in statistics. Our analysis is di erent from conventional ones in the followin re ards: 1) Althou h it is assumed in the classical LR that a noise is additive and is enerated accordin to a Gaussian distribution, we don’t make any probabilistic assumption either for a tar et distribution or a hypothesis class, but instead perform worst-case analysis in terms of the worst-case RCL. Additionally, we emphasize that we consider the re ression problem in the on-line prediction scenario rather than in the batch-learnin scenario as in the classical LR. 2) While most al orithms investi ated in the classical LR take linear forms of a feature vector, we don’t restrict ourselves into them, but rather consider nonlinear prediction al orithms usin a hypothesis class of linear functions. x(k) ) [0 1]k : (x(1) )2 + + (x(k) )2 1 and Let X = x = (x(1) k 2 2 ) [0 1] : + + 1 . Let a Y = Z = [0 1] Let = =( 1 k 1 k hypothesis class be Hk = f (X) =
T
x: x
X
This is known as a class of linear predictors. Let L be the squareRloss function. T T V ( ) where V ( ) = e− d . Then = 2. For > 0 let ( ) = e−
Extended Stochastic Complexity and Minimax Relative Loss Analysis
37
Below we describe the a re atin al orithm [10],[12] usin Hk , denoted as AG. At the tth round, AG takes as input Dt−1 and xt , then outputs yt s.t. 1 y = 2
1 − ln 2
Z p( D
−1
)e
−2( x)2
12 d
1 +1− − ln 2
Z p( D
−1
)e
−2(1− x)2
12 ! d
where p( Dt−1 ) = R (p q) t
=
e−2( e−2(
pq +
t−1 X
−
)T
( −
)
−
)T
( −
)d
(p) (q)
xj xj
t
=
−1 t At
(p)
At
=
j=1 (p)
t−1 X
(p)
yj xj
j=1
(p)
where (p q) xj and At denote the (p q)th component of , the pth compok), p q = 1 nent of xj and the pth component of At , respectively (p q = 1 if p = q and p q = 0 if p = q We can set = 2k r = 1 2k c = e− V ( ) By (1) we have the followin upper bound on the worst-case RCL for AG: Rm (AG : Hk )
k 2mk 1 ln + ln 2k e V ( ) + o(1) 4 2
Note that the parameters in Theorem 4 are: V = F = 1. By (16), we have Rm (Hk )
k k ln m + ln 4 4
ke Γ (1 + k 2)2
k 2
(22)
2k Γ (1 + k 2) and
k
(23)
Vovk [11] derived a similar non-asymptotic upper bound on the worst-case RCL for the a re atin al orithm usin linear predictors. His bound matches (23) up to the k4 ln m term. Kivinen and Warmuth [4] proposed the radient descent al orithm (GD) and the exponentiated radient al orithm (EG) as sequential prediction al orithms usin the linear predictors. Notice here that the outputs of both EG and GD are linear in x, whereas that of AG is not linear in x. They showed sup R(GD : Dm ) = O( m) Dm sup R(EG : Dm ) = O k m ln k Dm
Eq.(22) shows that the upper bound on the worst-case RCL for AG is O(k ln m), which is smaller than those for GD and EG when m is su ciently lar e and the parameter size k is xed.
38
Kenji Yamanishi
References 1. Cesa-Bianchi,N., and Lu osi, G., Minimax re ret under lo loss for eneral classes of experts, in Proc. of COLT’99, (1999). 2. Freund,Y., Predictin a binary sequence almost as well as the optimal biased coin, in Proc. of COLT’96, 89-98 (1996). 3. Haussler, D., Kivinen, J., and Warmuth, M., Ti ht worst-case loss bounds for predictin with expert advice, Computational Learnin Theory: EuroCOLT’95, Sprin er, 69-83 (1995). 4. Kivinen,J., and Warmuth,M., Exponentiated radient versus radient descent for linear predictors, UCSC-CRL-94-16, 1994. 5. Opper,M., and Haussler,D., Worst case prediction over sequence under lo loss, in Proc. of IMA Workshop in Information, Codin , and Distribution, Sprin er, 1997. 6. Rissanen, J., Stochastic complexity, J. R. Statist. Soc. B, vol.49, 3, 223-239(1987). 7. Rissanen, J., Stochastic Complexity in Statistical Inquiry, World Scienti c, Sin apore, 1989. 8. Rissanen, J., Fisher information and stochastic complexity, IEEE Trans. on Inf. Theory, vol.IT-42, 1, 40-47 (1996). 9. Shtarkov, Y.M., Universal sequential codin of sin le messa es, Probl. Inf. Transmission., 23(3):3-17 (1987). 10. Vovk, V.G., A re atin strate ies, in Proc. of COLT’90, Mor an Kaufmann, 371386(1990). 11. Vovk, V.G., Competitive on-line linear re ression, in Proc. of Advances in NIPS’98, MIT Press, 364-370(1998). 12. Yamanishi, K., A decision-theoretic extension of stochastic complexity and its applications to learnin , IEEE Tran. on Inf. Theory, IT-44, 1424-1439(1998). 13. Yamanishi, K., Minimax relative loss analysis for sequential prediction al orithms usin parametric hypotheses, in Proc. of COLT’98, ACM Press, pp:32-43(1998).
Al ebraic Analysis for Sin ular Statistical Estimation Sumio Watanabe P&I Lab., Tokyo Institute of Technolo y, 4259 Na atsuta, Midori-ku, Yokohama, 226-8503 Japan swatanab@p .t tech.ac.jp http://watanabe-www.p .t tech.ac.jp/ ndex.html
Abstract. This paper clari es learnin e ciency of a non-re ular parametric model such as a neural network whose true parameter set is an analytic variety with sin ular points. By usin Sato’s b-function we ri orously prove that the free ener y or the Bayesian stochastic complexity is asymptotically equal to 1 lo n − (m1 − 1) lo lo n+constant, where 1 is a rational number, m1 is a natural number, and n is the number of trainin samples. Also we show an al orithm to calculate 1 and m1 based on the resolution of sin ularity. In re ular models, 2 1 is equal to the number of parameters and m1 = 1, whereas in non-re ular models such as neural networks, 2 1 is smaller than the number of parameters and m1 1.
1
Introduction
From the statistical point of view, layered learnin machines such as neural networks are not re ular models. In a re ular statistical model, the set of true parameters consists of only one point and identi able even if the learnin model is lar er than necessary to attain the true distribution (over-realizable case). On the other hand, if a neural network is in the over-realizable case, the set of true parameters is not one point or not a manifold, but an analytic variety with sinular points. For such non-re ular and non-identi able learnin machines, the maximum likelihood estimator does not exist or is not subject to the asymptotically normal distribution, resultin that their learnin e ciency is not yet clari ed [1] [2] [3][4][5]. However, analysis for the over-realizable case is necessary for selectin the optimal model which balances the function approximation error with the statistical estimation error [3]. In this paper, by employin al ebraic analysis, we prove that the free ener y F (n) ( or called the Bayesian stochastic complexity or the avera e Bayesian factor) has the asymptotic form F (n) =
1
lo n − (m1 − 1) lo lo n + O(1)
where n is the number of empirical samples. We also show that an al orithm to calculate the positive rational number 1 and the natural number m1 usin O. Watanabe, T. Yokomori (Eds.): ALT’99, LNAI 1720, pp. 39 50, 1999. c Sprin er-Verla Berlin Heidelber 1999
40
Sumio Watanabe
Hironaka’s resolution of sin ularities, and that 2 1 is smaller than the number of parameters. Since the increase of the free ener y F (n + 1) − F (n) is equal to the eneralization error de ned by the avera e Kullback distance of the estimated probability density from the true one, our result claims that non-re ular statistical models such as neural networks are the better learnin machines than the re ular models if the Bayesian estimation is applied in learnin .
2
Main Results
Let p(y x w) be a conditional probability density from an input vector x RN with a parameter w Rd , which represents RM to an output vector y probabilistic inference of a learnin machine. Let (w) be a probability density function on the parameter space Rd , whose support is denoted by W = supp n Rd . We assume that trainin or empirical sample pairs (xi yi ); i = 1 2 are independently taken from q(y x)q(x), where q(x) and q(y x) represent the true input probability and the true inference probability, respectively. In the Bayesian estimation, the estimated inference pn (y x) is the avera e of the a posteriori ensemble, pn (y x) =
p(y x w)
n (w)dw
n
n (w)
=
1 (w) p(yi xi w) Zn i=1
where Zn is a constant which ensures n (w)dw = 1. The learnin e ciency or the eneralization error is de ned by the avera e Kullback distance of the estimated probability density pn (y x) from the true one q(y x), K(n) = En
lo
q(y x) q(x y)dxdy pn (y x)
shows the expectation value over all sets of trainin sample pairs. where En In this paper we mainly consider the statistical estimation error and assume that the model can attain the true inference, in other words, there exists a parameter w0 W such that p(y x w0 ) = q(y x). Let us de ne the avera e and empirical loss functions. f (w) = fn (w) =
lo 1 n
p(y x w0 ) q(y x)q(x)dxdy p(y x w)
n
lo i=1
p(yi xi w0 ) p(yi xi w)
Note that f (w) 0. By the assumption, the set of the true parameters W0 = w W ; f (w) = 0 is not an empty set. W0 is called an analytic set or analytic variety of the analytic function f (w). Note that W0 is not a manifold in eneral, since it has sin ular points.
Al ebraic Analysis for Sin ular Statistical Estimation
41
From these de nitions, it is immediately proven [4] that the avera e Kullback distance K(n) is equal to the increase of the free ener y F (n). K(n) = F (n + 1) − F (n) F (n) = −En lo
exp(−nfn (w)) (w)dw
where F (n) is sometimes called Bayesian stochastic complexity or Bayesian factor. Theorem 1 and 2 are the main results of this paper. Let C01 be a set of all compact support and C 1 -class functions on Rd . Theorem 1 Assume that f (w) is an analytic function and (w) is a probability density function on Rd . Then, there exists a real constant C1 such that for any natural number n F (n)
1
lo n − (m1 − 1) lo lo n + C1
(1)
where the rational number − 1 ( 1 > 0) and a natural number m1 is the lar est pole and its multiplicity of the meromorphic function that is analytically continued from f (w)
J( ) =
(w)dw (Re( ) > 0)
f (w)<
where > 0 is a su ciently small constant, and (w) is an arbitrary nonzero (w) (w). C01 -class function that satis es 0 Theorem 2 Let
> 0 be a constant value. Assume that p(y x w) =
1 2
2
exp(−
y − (x w) 2 2
2
)
where (x w) is an analytic function for w Rd and a continuous function M 1 for x R . Also assume that (w) is a C0 -class probability density function. Then, there exists a constant C2 > 0 such that for any natural number n F (n) −
1
lo n + (m1 − 1) lo lo n
C2
where the rational number − 1 ( 1 > 0) and a natural number m1 is the lar est pole and its multiplicity of the meromorphic function that is analytically continued from f (w)
J( ) =
(w)dw (Re( ) > 0)
f (w)<
where
> 0 is a su ciently small constant.
From Theorem 2, if the avera e Kullback distance K(n) has an asymptotic expansion, then m1 − 1 1 1 − + o( ) K(n) = n n lo n n lo n As is well known, re ular statistical models have 1 = d 2 and m1 = 1. However, non-re ular models such as neural networks have smaller 1 and lar er m1 in eneral [3].
42
3
Sumio Watanabe
Proof of Theorem 1
Lemma 1 The upper bounds of the free ener y is iven by F (n)
− lo
exp(−nf (w)) (w)dw
[Proof of Lemma 1] From Jensen’s inequality and En (fn − f ) = 0, lemma 1 is obtained. (Q.E.D.) For a iven
> 0, a set of parameters is de ned by W = w
W
supp ; f (w) <
Theorem 3 (Sato, Bernstein, Bj¨ ork, Kashiwara) Assume that there exists 0 > 0 such that f (w) is an analytic function in W 0 . Then there exists a set ( P b), where (1) < 0 is a positive constant, (2) P = P ( w w) is a di erential operator which is a polynomial for , and (3) b( ) is a polynomial, such that P(
w
w)f (w)
+1
= b( )f (w)
(8 w
W
8
C)
The zeros of the equation b( ) = 0 are real, rational, and ne ative numbers. [Explanation of Theorem] This theorem is proven based on the al ebraic property of the rin of partial di erential operators. See references [6][7][8]. The rationality of the zeros of b( ) = 0 is shown based on the resolution of sin ularity [9][13]. The smallest order polynomial b( ) that satis es the above relation is called a Sato-Bernstein polynomial or a b-function. Hereafter > 0 is taken smaller than that in this theorem. We can assume that < 1 without loss of enerality. For a iven analytic function f (w), let us de ne a complex function J( ) of C by f (w)
J( ) =
(w)dw
W
Lemma 2 Assume that (w) is a C01 -class function. Then, J( ) can be analytically continued to the meromorphic function on the entire complex plane, in other words, J( ) has only poles in < . Moreover J( ) satis es the followin conditions. (1) The poles of J( ) are rational, real, and ne ative numbers. −1) = 0. (2) For an arbitrary a R, J( + a −1) = 0, and J(a
Al ebraic Analysis for Sin ular Statistical Estimation
43
[Proof of Lemma 2] J( ) is an analytic function in the re ion Re( ) > 0. J( + a −1) = 0 is shown by the Lebes ue’s conver ence theorem. For a > 0, J(a + −1) = 0 is shown by the the Riemann-Lebes ue theorem. Let P be the adjoint operator of P = P ( w w ). Then, by Theorem 3, J( ) =
1 b( )
P f (w)
+1
1 b( )
(w)dw =
W
f (w)
+1
P (w)dw
W
C 1 , J( ) can be analytically continued to J( − 1) if b( ) = 0. Because P −1) = 0. For b( ) = 0, By analytic continuation, then even for a < 0, J(a + then such is a pole which is on the ne ative part of real axis. (Q.E.D.) De nition Poles of the function J( ) are on the ne ative part of the real axis and contained in the set m + ; m = 0 −1 −2 b( ) = 0 They are ordered ( k > 0 is a rational from the bi er to the smaller by − 1 − 2 − 3 number.) and the multiplicity of − k is de ned by mk . We de ne a function I(t) from R to R by (t − f (w)) (w)dw
I(t) =
(2)
W
where
> 0 is taken as above. By de nition, if t >
or t < 0 then I(t) = 0.
Lemma 3 Assume that (w) is a C01 -class function. Then I(t) has an asymptotic expansion for t 0. (The notation = shows that the term can be asymptotically expanded.j 1 mk −1
ck m+1 t
I(t) =
k −1
(− lo t)m
(3)
k=1 m=0
where m! ck m+1 is the coe cient of the (m + 1)-th order in the Laurent expansion of J( ) at = − k . [Proof of Lemma 3] The special case of this lemma is shown in [10]. Let IK (t) be the restricted sum in I(t) from k = 1 to k = K. It is su cient to show that, for an arbitrary xed K, lim (I(t) − IK (t))t = 0 (8 > −
t!0
K+1
+ 1)
(4)
1
I(t)t dt. The simple calculation shows
From the de nition of J( ), J( ) = 0
1
t
+
k −1
(− lo t)m dt =
0
m! Therefore, ( + k )m+1 K mk −1
1
(I(t) − IK (t))t dt = J( ) − 0
k=1 m=0
ck m+1 ( + k )m+1
44
Sumio Watanabe
By puttin t = e−x and by usin the inverse Laplace transform and the previous Lemma 2, the complex inte ral path can be moved by re ularity, which leads eq.(4). (Q.E.D.) [Proof of Theorem 1] By combinin the above results, we have − lo
F (n)
= − lo
n
e−nt I(t)dt = − lo
= − lo k=1 m=0 j=0 1
exp(−nf (w)) (w)dw W
0 1 mk −1 m
=
− lo
exp(−nf (w)) (w)dw W 1
0
t dt e−t I( ) n n
ck m+1 (lo n)j (m Cj ) n k
n
e−t t
k −1
(− lo t)m−j dt
0
lo n − (m1 − 1) lo lo n + O(1)
where I(t) is de ned by eq.(2) with (w) instead of (w). (Q.E.D.)
4
Proof of Theorem 2
Hereafter, we assume that the model is iven by p(y x w) =
1 1 exp(− (y − (x w))2 )) 2 2
It is easy to eneralize the result to a eneral standard deviation ( > 0) case and a eneral output dimension (N > 1) case. For this model, f (w) = fn (w) =
1 2 1 2n
( (x w) − (x w0 ))2 q(x)dx n
( (xi w) − (xi w0 ))2 − i=1
1 n
n i(
(xi w) − (xi w0 ))
i=1
where i yi − (xi w0 ) are independent samples from the standard normal distribution. Lemma 4 Let xi i ni=1 be a set of independent samples taken from q(x)q0 (y), where q(x) is a compact support and continuous probability density and q0 (y) is the standard normal distribution. Assume that the function (x w) is analytic for w and continuous for x, and that the Taylor expansion of (x w) amon w absolutely conver es in the re ion T = w; wj − wj < rj . For a iven constant w; wj − wj < arj . Then, the followin s 0 < a < 1, we de ne the re ion Ta hold. (1) If (x w)q(x)dx = 0, there exists a constant c0 such that for an arbitrary n, n 1 (xi w) 2 < c0 < An En sup n i=1 w2Ta
Al ebraic Analysis for Sin ular Statistical Estimation
45
(2) There exists a constant c00 such that for an arbitrary n, Bn
1 n
En sup w2Ta
n
(xi w) 2 < c00 <
i i=1
[Proof of Lemma 4] We show (1). The statement (2) can be proven by the same method. This lemma needs proof because supw2Ta is in the expectation. We kd ) and denote k = (k1 k2 1
1
ak (x)(w − w)k =
(x w) = k=0
ak1 kd (x)(w1 − w1 )k1 k1
(wd − wd )kd
kd =0
Let K = supp q(x) be a compact set. Since (x w) is analytic for w, by Cauchy’s inte ral formula for several complex functions, there exists > 0 such that d
ak (x)
rj −
M
kj
M
max
(x w)
x2K w2Ta
j=1
ak (x)q(x)dx = 0. Thus
and that
1 n
En
n
ak (xi ) 2
1 2
1 2
ak (x) 2 q(x)dx
=
j
i=1
M rj −
kj
Therefore, 1
An2 = En sup w2Ta
1 n
1
En sup w2Ta
k=0
where
n
(xi w) 2
1 2
w2Ta
i=1
1 n
n
ak (xi )(w − w)k
n (w)
n (w)
1 2
1
ak (xi )(w − w)k
2
1 2
i=1 k=0
<
(j = 1 2
d). (Q.E.D.)
is de ned as follows. n (w)
Note that
2
n
i=1
is taken so that arj < rj −
The function
1 n
= En sup
=
n(f (w) − fn (w)) f (w)
is holomorphic function of w except W0 .
Theorem 4 Assume that (x w) is analytic for w Rd and continuous for M x R . Also assume that q(x) is a compact support and continuous function. Then, there exists a constant C3 such that for arbitrary n En where W
sup w2W nW0
Rd is a compact set.
n (w)
2
< C3
46
Sumio Watanabe
[Proof of Theorem 4] Outside of the nei hborhood of W0 , this theorem can be proven by the previous lemma. We assume that W = W . By compactness of W , W is covered by a union of nite small open sets. Thus we can assume w W W . By inductively applyin the Weierstrass’ is in the nei hborhood of w0 preparation theorem [14] to the holomorphic function (x w) − (x w0 ), there exists a nite set of functions gj hj Jj=1 , where gj (w) is a holomorphic function and hj (x w) is a continuous function for x and a holomorphic function for w, such that J
(x w) − (x w0 ) =
gj (w)hj (x w)
(5)
j=1
where the matrix hj (x w0 )hk (x w0 )q(x)dx
Mjk
is positive de nite. Let > 0 be taken smaller than the minimum ei en value of the matrix Mjk . By the de nition, f (w) =
1 2
J
gj (w)gk (w)
hj (x w)hk (x w)q(x)dx
j k=1 J
is bounded by f (w)
A(w) =
gj (w) 2 by takin small
2 1 2
> 0. We de ne
j=1 J
gj (w)gk (w) j k=1
1 n
n
ajk (xi w) i=1
hj (x w)hk (x w)q(x)dx − hj (xi w)hk (xi w)
ajk (xi w) = J
gj (w)
B(w) = j=1
1 n
n i hj (xi
w)
i=1
Then A(w) + B(w) = f (w) − fn (w). By the Cauchy-Schwarz inequality, En
n (w)
2
= En En
n f (w) − fn (w) 2 f (w) 2n ( A(w) 2 + B(w) 2 ) f (w)
Con t
For the last inequalities, we applied the previous lemma 4. (Q.E.D.) [Proof of Theorem 2] We de ne n
=
sup w2W nW0
n (w)
Al ebraic Analysis for Sin ular Statistical Estimation 2 n
Then, by Theorem 4, En complexity satis es
<
F (n) = −En lo
47
. The free ener y or the Bayesian stochastic
exp(−nfn (w)) (w)dw W
exp(−nf (w) −
= −En lo
nf (w) n (w)) (w)dw
W
−En lo
exp(−nf (w) +
nf (w)) (w)dw
n
W
Let us de ne Zi (n) (i = 1 2) by Zi (n) =
exp(−nf (w) +
nf (w)) (w)dw
n
W (i)
where W (1) = W and W (2) = W
W . Then
−En lo (Z1 (n) + Z2 (n))
F (n)
Z2 (n) ) Z1 (n)
= −En lo Z1 (n) − En lo (1 +
(6)
Let F1 (n) and F2 (n) be the rst and the second terms of eq.(6), respectively. For F1 (n), the same procedure as the upper bound can be applied, ck m+1
F1 (n) = −En [ lo
m
n
kmj
n
Cj (lo n)j k
e−t+
p t
t
k −1
(− lo t)m−j dt ]
0 n
=
n
1 lo n − (m1 − 1) lo lo n − En lo
e−t+
n
p t
t
1 −1
dt + O(1)
0
=
1
lo n − (m1 − 1) lo lo n + O(1)
where we used
n
t
Z1
2 n ).
(1 2)(t +
The term F2 (n) is evaluated by usin (lo n)m1 −1 n 1
exp(−nf (w)) (w)dw
c1 m 1
nf (w) − 2
(1 − (W )) exp(−
W
and Z2
exp(− W nW
we obtain Z2 Z1 F2 (n)
exp(
2 n
−En lo (1 + exp(
2 n
) (w)dw
n − 2
2 n
)
2) for su ciently lar e n. Hence 2 n
2
))
−En
2 n
2
− En lo
In the last inequality, we used Theorem 4. (Q.E.D.)
1 + exp( exp(
2 n
2
2 n
2
)
)
>−
48
5
Sumio Watanabe
Al orithm to Calculate the Learnin E ciency
The important values 1 and m1 can be calculated by resolution of sin ularities. Atiyah showed [13] that the followin theorem is directly proven from Hironaka’s theorem. Theorem 5 (Hironaka) Let f (w) be a real analytic function de ned in a nei hborhood of 0 Rd . Then there exist an open set U 0, a real analytic manifold U such that U 0 and a proper analytic map g : U 0 U A is an isomorphism, where A = f −1 (0) and A0 = g −1 (A), (1) g : U 0 A0 ud ) centered at (2) for each P U 0 there are local analytic coordinates (u1 P so that, locally near P , we have ud )uk11 uk22
ud )) = h(u1
f (g(u1
where h is an invertive analytic function and ki
ukdd
0.
This theorem shows that the sin ularity of f can be locally resolved. The followin is an al orithm to calculate 1 and m1 . Al orithm to calculate the sin ular learnin e ciency (1) Cover the analytic variety W0 = w supp ; f (w) = 0 by the nite union of open nei hborhoods U . (2) For each nei hborhood U , nd the analytic map g by usin blowin up. (3) For each nei hborhood, the function J ( ) is calculated. J ( )
f (w)
(w)dw
U
f (g(w))
= −1 (U
(g(u)) g 0 du
) d
ui k
h(u)
= −1 (U
)
(g(u)) g 0 du
i=1
where g 0 is Jacobian. The last inte ration can be done for each variable ui , and poles and their multiplicities. of J (z) are obtained. J ( ), poles and their multiplicities can be calculated. (4) By J( ) = Example.1 (Re ular Models) For the re ular statistical models, by usin the wd ) the avera e loss function f (w) can be locally appropriate coordinate (w1 written by d
wi2
f (w) = i=1
The blowin up of the sin ularity, we nd a map g : (u1 w1 = u1
wi = u1 ui (2
i
d)
ud )
(w1
wd ),
Al ebraic Analysis for Sin ular Statistical Estimation
49
Then the function J( ) is d
J( ) =
u21
u2i ) u1
(1 +
d−1
(u1 u1 u2 u1 u3
u1 ud )du1 du2
dud
i=2
This function has the pole at the free ener y is
= −d 2 with the multiplicity m1 = 1. Therefore,
d lo n + O(1) 2 In this case, we also calculate the Sato’s b-function [11]. F (n) =
Example.2 If the model p(y x a b) =
1 1 exp(− (y − a tanh(bx))2 ) 2 2
is trained usin samples from p(y x 0 0), then f (a b) = a2
tanh(bx)2 q(x)dx
In this case, the deepest sin ularity is the ori in, and in the nei hborhood of the ori in, f (a b) = a2 b2 . From this fact, it immediately follows that 1 = 1 2, m1 = 2, resultin that F (n) =
1 lo n − lo lo n + O(1) 2
Example.3 Let us consider a neural network 1 1 exp(− (y − (x a b c d))2 ) 2 2 (x a b c d) = a tanh(bx) + c tanh(dx)
p(y x a b c d) =
Assume that the true re ression function be (x 0 0 0 0). Then, the deepest sin ularity of f (a b c d) is (0 0 0 0) and in the nei hborhood of the ori in, f (a b c d) = (ab + cd)2 + (ab3 + cd3 )2 since the hi her order term can be bounded by the above two terms (see [12]). By usin blowin -up twice, we can nd a map g : (x y z w) (a b c d) a=x
b = y 3 w − yz
c = zx d = y
By usin this transform, we obtain f (g(x y z w)) = x2 y 6 [w2 + (y 2 w − z)3 + z 2 ] g 0 (x y z w) = xy 3 resultin that
1
= 2 3, and m1 = 1, and F (n) = (2 3) lo n + O(1).
For the more eneral cases, some inequalities were obtained [3][12]. It is shown in [11] that, for all cases, 2 1 d.
50
6
Sumio Watanabe
Conclusion
Mathematical foundation for sin ular learnin machines such as neural networks is established based on the al ebraic analysis. The free ener y or the stochastic complexity is asymptotically iven by 1 lo n− (m1 − 1) lo lo n+ con t , where 1 and m1 are calculated by resolution of sin ularities. Analysis for the maximum likelihood case or the zero temperature limit is an important problem for the future. We expect that al ebraic analysis also plays an important role for such analysis.
References 1. Ha iwara, K., Toda, N., Usui, S.,: On the problem of applyin AIC to determine the structure of a layered feed-forward neural network. Proc. of IJCNN Na oya Japan. (1993) 2263 2266 2. Fukumizu,K.:Generalization error of linear neural networks in unidenti able cases. In this issue. 3. Watanabe,S.: Inequalities of eneralization errors for layered neural networks in Bayesian learnin . Proc. of ICONIP 98 (1998) 59 62 4. Levin, E., Tishby, N., Solla, S.A.: A statistical approaches to learnin and eneralization in layered neural networks. Proc. of IEEE 78(10) (1990) 1568 1674 5. Amari,S., Fujita,N., Shinomoto,S.: Four Types of Learnin Curves. Neural Computation 4 (4) (1992) 608 618 6. Sato, M.,Shintani,T.: On zeta functions associated with prehomo eneous vector space. Anals. of Math., 100 (1974) 131 170 7. Bernstein, I.N.: The analytic continuation of eneralized functions with respect to a parameter. Functional Anal. Appl.6 (1972) 26 40. 8. Bj¨ ork, J.E.: Rin s of di erential operators. Northholand (1979) 9. Kashiwara, M.: B-functions and holonomic systems. Inventions Math. 8 (1976) 33 53. 10. Gel’fand, I.M., Shilov, G.E.: Generalized functions. Academic Press, (1964). 11. Watanabe, S.:Al ebraic analysis for neural network learnin . Proc. of IEEE SMC Symp., 1999, to appear. 12. Watanabe,S.:On the eneralization error by a layered statistical model with Bayesian estimation. IEICE Trans. J81-A (1998) 1442-1452. (The En lish version is to appear in Elect. and Comm. in Japan. John Wiley and Sons) 13. Atiyah, M.F.: Resolution of Sin ularities and Division of Distributions. Comm. Pure and Appl. Math. 1 (1970) 145 150 14. H¨ ormander,L.:An introduction to complex analysis in several variables. Van Nostrand. (1966)
Generalization Error of Linear Neural Networks in Unidenti able Cases Kenji Fukumizu RIKEN Brain Science Institute, Wako, Saitama 351-0198, Japan
[email protected]. o.jp http://www.islab.brain.riken. o.jp/∼fuku
Abs rac . The statistical asymptotic theory is often used in theoretical results in computational and statistical learnin theory. It describes the limitin distribution of the maximum likelihood estimator (MLE) as an normal distribution. However, in layered models such as neural networks, the re ularity condition of the asymptotic theory is not necessarily satis ed. The true parameter is not identi able, if the tar et function can be realized by a network of smaller size than the size of the model. There has been little known on the behavior of the MLE in these cases of neural networks. In this paper, we analyze the expectation of the eneralization error of three-layer linear neural networks, and elucidate a stran e behavior in unidenti able cases. We show that the expectation of the eneralization error in the unidenti able cases is lar er than what is iven by the usual asymptotic theory, and dependent on the rank of the tar et function.
1
Introduction
This paper discusses a non-re ular property of multilayer network models, caused by its structural characteristics. It is well-known that learnin in neural networks can be described as the parametric estimation from the viewpoint of statistics. Under the assumption of Gaussian noise in the output, the least square error estimator is equal to the maximal likelihood estimator (MLE), whose statistical behavior is known in detail. Therefore, many researchers have believed that the behavior of neural networks is perfectly described within the framework of the well-known statistical theory, and have applied theoretical methodolo ies to neural networks. It has been clari ed recently that the usual statistical asymptotic theory on the MLE does not necessarily hold in neural networks ([1],[2]). This always happens if we consider the model selection problem in neural networks. Assume that we have a neural network model with H hidden units as a hypothesis space, and that the tar et function can be realized by a network with a smaller number of hidden units than H. In this case, as we explain in Section 2, the true parameter in the hypothesis class, which realizes the tar et function, is hi hdimensional and not identi able. The distribution of the MLE is not subject to O. Watanabe, T. Yokomori (Eds.): ALT’99, LNAI 1720, pp. 51 62, 1999. c Sprin er-Verla Berlin Heidelber 1999
52
Kenji Fukumizu
the ordinary asymptotic theory in this case. We cannot apply any methods such as AIC and MDL, which are based on the asymptotic theory. In this paper, we discuss the MLE of linear neural networks, as the simplest multilayer model. Also in this simple model, the true parameter loses identi ability if and only if the tar et is realized by a network with a smaller number of hidden units. As the rst step to investi ate the behavior of a learnin machine in unidenti able cases, we calculate the expectation of the eneralization error of linear neural networks in asymptotic situations, and derive an approximate formula for lar e-scale networks. From these results, we see that the eneralization error in unidenti able cases is lar er than what is derived from the usual asymptotic theory. While the ordinary asymptotic theory asserts that the expectation of the eneralization error depends only on the number of parameters, the eneralization error in linear neural networks depends on the rank of tar et function.
2 2.1
Neural Networks and Identi ability Neural Networks and Identi ability of Parameter
A neural network model can be described as a parametric family of functions RM , where is a parameter vector. A three-layer neural f ( ; ) : RL network with H hidden units is de ned by ! H L X X i wij ujk xk + j + i (1 i M ) (1) f (x; ) = j=1
k=1
where = (wij i ujk j ) summarizes all the parameters. The function (t) is called an activation function. In the case of a multilayer perceptron, a bounded and non-decreasin function like tanh(t) is often used. We consider re ression problems, assumin that an output of the tar et system is observed with a noise. An observed sample (x y) RL RM satis es y = f (x) + z
(2)
where f (x) is the tar et function, which is unknown to a learner, and z ) is a normal N (0 2 IM ) is a random vector representin noise, where N ( distribution with mean and variance-covariance matrix . We use IM for the M M unit matrix. An input vector x is enerated randomly with its probability q(x)dx. A set of N trainin data, (x( ) y ( ) ) N=1 , is an independent sample from the joint probability p(y x)q(x)dxdy, where p(y x) is de ned by eq.(2). We discuss the maximum likelihood estimator (MLE), denoted by , assumin the statistical model p(y x; ) =
1 (2
2 )M 2
exp −
1 2
2
y − f (x; )
2
(3)
Generalization Error of Linear Neural Networks in Unidenti able Cases
53
0
0
Tar et
Aj
Bj
Fi . 1. Unidenti able cases in neural networks
which has the same noise model as the tar et. Under these assumptions, it is easy to see that the MLE is equivalent to the least square error estimator, which minimizes the followin empirical error: Eemp =
N X
y(
)
− f (x( ) ; )
2
(4)
=1
We evaluate the accuracy of the estimation usin the expectation of eneralization error: hZ i (5) f (x; ) − f (x) 2 q(x)dx E en Ef(x( ) y( ) )g We sometimes call it simply eneralization error if not confusin . It is easy to see that the expected lo likelihood is directly related to E en as hZ Z i 1 E en + const (6) Ef(x( ) y( ) )g p(y x)q(x)(− lo p(y x; ))dydx = 2 2 Throu hout this paper, we assume that the tar et function is realized by the model; that is, there exists a true parameter 0 such that f (x; 0 ) = f (x). One of the special properties of neural networks is that, if the tar et can be realized by a network with a smaller number of hidden units than the model, the set of true parameters that realize the tar et function is not a point but a union of hi h-dimensional manifolds (Fi .1). Indeed, the tar et can be realized in the parameter set where wi1 = 0 (8 i) holds and u1k takes an arbitrary value, and also in the set where u1k = 0 (8 i) and wi1 takes an arbitrary value, assumin (0) = 0. We say that the true parameter is unidenti able if the set of true parameters is a union of manifolds whose dimensionality is more than one. The usual asymptotic theory cannot be applied if the true parameter is unidenti able. The Fisher information matrix is sin ular in such a case ([3]). In the presence of noise in the output data, the MLE is located a little apart from the hi hdimensional set.
54
2.2
Kenji Fukumizu
Linear Neural Networks
We focus on linear neural networks hereafter, as the simplest multilayer model. A linear neural network (LNN) with H hidden units is de ned by f (x; A B) = BAx where A is a H
L matrix and B is a M H
M
(7)
H matrix. We assume that L
throu hout this paper. Althou h f (x; A B) is just a linear map, the model is not equal to the set of all linear maps from RL to RM , but is the set of linear maps of rank not reater than H. Then, the model is not equivalent to the linear re ression model Cx C : M L matrix . This model is known as reduced rank re ression in statistics ([4]). The parameterization in eq.(7) has trivial redundancy. The transform (A B) (GA BG−1 ) does not chan e the map for any non-sin ular matrix G. Given a linear map of rank H, the set of parameters that realize the map consists of an H H-dimensional manifold. However, we can easily eliminate this redundancy if we restrict the parameterization so that the rst H rows of A make the unit matrix. Therefore, the essential number of parameters is H(L+M −H). In other words, we can re ard BA as a point in an H(L + M − H)-dimensional space. More essential redundancy arises when the rank of a map is less than H. Even if we use the above restriction, the set of parameters that realize such a map is still hi h-dimensional. Then, in LNN, the parameter BA is identi able if and only if the rank of the tar et is equal to H. In this sense, we can re ard linear neural networks as a multilayer model, because it preserves the unidenti ability explained in Section 2.1. If the rank of the tar et is H, the usual asymptotic theory holds. It is well known that E en is iven by 2
E
en
=
N
H(L + M − H) + O(N −3 2 )
(8)
usin the number of parameters.
Generalization Error of Linear Neural Networks 3.1
Exact Results
It is known that the MLE of a LNN is exactly solved. We introduce the followin notations; X = (x(1)
x(N ) )T
Y = (y (1)
y (N ) )T
and Z = (z (1)
z (N ) )T (9)
Generalization Error of Linear Neural Networks in Unidenti able Cases
55
Proposition 1 ([5]). Let VH be an M H matrix whose i-th column is the ei envector correspondin to the i-th lar est ei envalue of Y T X(X T X)−1 X T Y . Then, the MLE of a linear neural network is iven by −1 (10) B A = VH VHT Y T X X T X
Note that the MLE is unique even when the tar et is not identi able, because the statistical data include noise. It distributes alon the set of true parameters. The expectation of the eneralization error is iven by the followin Theorem 1. Assume that the rank of the tar et is r ( H), and the variancecovariance matrix of the input x is positive de nite. Then, the expectation of the eneralization error of a linear neural network is 2
E
en
=
r(L + M − r) + (M − r L − r H − r) + O(N −3 2 )
N
(11)
where (p n q) is the expectation of the sum of the q lar est ei envalues of a random matrix followin the Wishart distribution Wp (n; Ip ). (The proof is iven in Appendix.) The density function of the ei envalues known as p Y 1 Pp 1 exp − i Zn 2 1=1 i=1
p
1
n−p−1 2
Y
i
(
i
−
0 of Wp (n; Ip ) is
j)
(12)
1ijp
where Zn is a normalizin constant. However, the explicit formula of (p n q) is not known in eneral. In the followin , we derive an exact formula in a simple case and an approximation for lar e-scale networks. We can exactly calculate (2 n 1) as follows. Since the expectation of the trace of a matrix from the distribution W2 (n; I2 ) is equal to 2, we have only 2 and = to calculate E[ 1 − 2 ]. By transformin the variable as r = 1 + 2 −1 cos 1 2 ), we can derive E[
1−
Z 1Z 2 1 e−r rn−3 sinn−3 (2r cos )2 2r sin d dr 4Γ (n − 1) 0 0 Γ ( n+1 2 ) (13) =2 Γ ( n2 )
2] =
Then, we obtain E[
1]
=n+
Γ ( n+1 2 ) n Γ(2)
(14)
From this fact, we can calculate the expectation of the eneralization error in the case H = M − 1 and r = H − 1.
56
Kenji Fukumizu
Theorem 2. Assume H = M − 1, Then, the expectation of the eneralization error in the case r = H − 1 and r = H is ive by 8 p +3 ) < 2 (M − 1)(L + 1) − 1 + Γ ( L−M 2 if r = H −1 (unidenti able), L−M +2 N Γ( ) 2 E en = : 2 (M − 1)(L + 1) if r = H (identi able). N
(15) The interestin point is that the eneralization error chan es dependin on n Γ ( n+1 2, the identi ability of the true parameter. Since 2 ) Γ ( 2 ) > 1 for n E en in the unidenti able case is lar er than E en in the identi able case. If the number of input units is very lar e, from the Stirlin ’s formula, the di erence 2p L 2, which reveals much worse between these errors is approximated by N eneralization in unidenti able cases. 3.2
Generalization Error of Lar e Scale Networks
We analyze the eneralization error of a lar e scale network in the limit when L, M , and H o to in nity in the same order. Let S Wp (n; Ip ) be a random 0 be the the ei envalues of n−1 S. The matrix, and 1 2 p −1 empirical ei envalue distribution of n S is de ned by Pn :=
1 ( ( 1) + ( 2) + p
+ ( p ))
(16)
where ( ) is the Dirac measure at . The stron limit of Pn is iven by Proposition 2 ([6]). Let 0 conver es almost everywhere to (u) =
1 2
1. If n
,p
and p n
, then Pn
p (u − um )(uM − u) (u)du u
− 1)2 , uM = ( where um = ( function of [um uM ].
+ 1)2 , and
(17)
(u) denotes the characteristic
Fi ure 2 shows the raph of (u) for = 0 5. Ru We de ne u as the -percentile point of (u); that is u M (u)du = . If M (2 ), the density of t is we transform the variable as t = u − um +u 2 (t) =
2
and the -percentile point t is iven by lim n p!1 p n!
1 (p n np
p) =
R uM u
1 − t2 t+1+
2
R1 t
u (u)du =
(18)
(t)dt = . Then, we can calculate
1n
cos−1 (t ) − t
Combinin this result with Theorem 1, we obtain
q o 1 − t2
(19)
Generalization Error of Linear Neural Networks in Unidenti able Cases
57
0.8
0.6
0.4
0.2
0.5
1
2
1.5
Fi . 2. Density of ei envalues
Theorem 3. Let r (r 2
E
en
N
2.5
0 5 (u)
H) be the rank of the tar et. Then, we have
q 1 −1 cos (t ) − t 1 − t2 r(L + M − r) + (L − r)(M − r)
when L M H r
with
M−r L−r
and
From elementary calculus, we can prove
H−r M−r 1
+ O(N −3 2 )
(20)
q 1 − t2
(1 +
.
cos−1 (t ) − t
(1 − )) for 0 1 and 0 1. Therefore, we see that in unidenti able 2 cases (i.e. r H), E en is reater than N H(L + M − H). Also in these results, E en depends on the rank of the tar et. This shows clear di erence from usual discussion on identi able cases, in which E en does not depend even on the model but only depends on the number of parameters (eq.(8)). 3.3
Numerical Simulations
First, we make experiments usin LNN with 50 input, 20 hidden, and 30 output units. We prepare tar et functions of rank from 0 to 20. The eneralization error of the MLE with respect to 10000 trainin data is calculated. The left raph of Fi .3 shows the avera e of the eneralization errors over 100 data sets and the theoretical results iven in Theorem 3. We see that the experimental and theoretical results coincide very much. Next, we investi ate the eneralization error in an almost unidenti able case, where the true parameter is identi able but has very small sin ular values. We prepare a LNNwith 2 input, 1 hidden, and 2 output units. The tar et function is f (x; 0 ) = 0 ( 0)x, where is a small positive number. The tar et is identi able for a non-zero . The ri ht raph of Fi .3 shows the avera e of the eneralization errors for 1000 trainin data. Surprisin ly, while usin 1000 trainin data for only 3 parameters, the result shows that the eneralization errors for small are much lar er than what is iven by the usual asymptotic theory. They are rather close to E en of the unidenti able case marked by .
58
Kenji Fukumizu Experimetal Theoretical
0.00004
GeneralizationWerror
GeneralizationWerror
0.00070
0.00065
0.00060
0.00003
0.00002
WExperimentalWresults
0.00001
AsymptoticWtheoryWinWidentifiableWcases
0
5
10 15 RankWofWtheWtarget L=50, M=30, H=20, N=20000.
TheoreticalWresultWinWtheWunidentifiableWcase
20 0.00000 0.01
0.1
1
Epsilon
Fi . 3. Experimental results
4
Conclusion
This paper discussed the behavior of the MLE in unidenti able cases of multilayer neural networks. The ordinary methods based on the asymptotic theory cannot be applied to neural networks, if the tar et is realized by a smaller number of hidden units than the model. As the rst step to clarifyin the correct behavior of multilayer models, we elucidated the theoretical expression of the expectation of the eneralization error for linear neural networks, and derived an approximate formula for lar e scale networks. From these results, we see that the eneralization error in unidenti able cases is lar er than the eneralization error in identi able cases, and dependent on the rank of the tar et. This shows clear di erence from ordinary models which always ive a unique true parameter.
Appendix A
Proof of Theorem 1
Let C0 = B0 A0 be the coe cient of the tar et function, and be the variancecovariance matrix of the input vector x. From the assumption of the theorem, is positive de nite. The expectation of the eneralization error is iven by E We de ne an M
en
= EX Y [Tr[(B A − C0 ) (B A − C0 )T ]]
(21)
L random matrix W by W = Z T X(X T X)−1
2
(22)
Note that all the elements of W are subject to the normal distribution N (0 mutually independent, and independent of X. From Proposition 1, we have B A − C0 = (VH VHT − IM )C0 + VH VHT W (X T X)−1
2
2
),
(23)
Generalization Error of Linear Neural Networks in Unidenti able Cases
59
This leads the followin decomposition; E
en
= EX W [Tr[C0 C0T (IM − VH VHT )]] 1
+ EX W [Tr[VH VHT W (X T X)− 2 We expand (X T X)1
2
1
(X T X)− 2 W T ]]
(24)
and X T X as (X T X)1
2
=
1 2
N
T
X X=N
+
+F NK
(25)
Then, the matrices F and K are of the order O(1) as N oes to in nity. We write = p1N hereafter for notational simplicity, and obtain the expansion of 1 T T −1 T X Y as N Y X(X X) 1 T Y X(X T X)−1 X T Y = T (0) + T (1) + N
T( )
2
T (2)
(26)
where T (0) = C0 C0T T (1) = C0 KC0T + C0 T
(2)
T
= WW +
1 2
W F C0T
WT + W
+ C0 F W
T
1 2
C0T (27)
Since the column vectors of VH are the ei envectors of T ( ), they are obtained by the perturbation of the ei envectors of T (0) . Followin the method of Kato ([7], Section II), we will calculate the projection Pj ( ) onto the ei enspace correspondin to the ei envalue j ( ) of T ( ), We call Pj ( ) an ei enprojection. (0) Let 1 = C0 C0T , Pi r > 0 be the positive ei envalues of T (1 i r) be the correspondin ei enprojections, and P0 be the ei enprojections correspondin to the ei envalue 0 of T (0) . Then, from the sin ular value decomposition of C0 1 2 , we see that there exist projections Qi (1 i r) of RL such that their ima es are mutually ortho onal 1-dimensional subspaces and the equalities 1 2
C0T Pi C0
1 2
=
i Qi
hold for all i. We de ne the total projection Q by Pr Q = i=1 Qi i,
(28)
(29)
First, let i ( ) (1 i r) be the ei envalue obtained by the perturbation of and Pi ( ) be the ei enprojection correspondin to i ( ). Clearly, the equality Pi ( ) = Pi + O( )
holds.
(30)
60
Kenji Fukumizu
Next, we consider the perturbation of P0 . Generally, by the perturbation of eq.(26), the ei envalue 0 of T (0) splits into several ei envalues. Since the perturbation is caused by a positive de nite random matrix, these ei envalues are pos> M( ) > 0 itive and di erent from each other almost surely. Let r+1 ( ) > be the ei envalues, and Pr+j ( ) be the correspondin ei enprojections. We de ne the total projection of the ei envalues r+j by PM−r (31) P0 ( ) = j=1 Pr+j ( ) The non-zero ei envalues of T ( )P0 ( ) are r+j ( ) (1 the expansion of Pr+j ( ), we expand T ( )P0 ( ) as P1 T ( )P0 ( ) = n=1 n T (n)
j
M − r). To obtain (32)
Then, from Kato ([7],(2.20)), we see that the coe cient matrices of eq.(32) are in eneral iven by T (1) = P0 T (1) P0 T (2) = P0 T (2) P0 − P0 T (1) P0 T (1) S − P0 T (1) ST (1) P0 − ST (1) P0 T (1) P0 T (3) = −P0 T (1) P0 T (2) S − P0 T (2) P0 T (1) S − P0 T (1) ST (2) P0 − P0 T (2) ST (1) P0 − ST (1) P0 T (2) P0 − ST (2)P0 T (1) P0 + P0 T (1) P0 T (1) ST (1)S + P0 T (1) ST (1) P0 T (1) S + P0 T (1) ST (1) ST (1) P0 + ST (1) P0 T (1) P0 T (1) S + ST (1) P0 T (1) ST (1)P0 + ST (1) ST (1) P0 T (1) P0 − P0 T (1) P0 T (1) P0 T (1) S 2 − P0 T (1) P0 S 2 T (1) P0 − P0 T (1) S 2 T (1) P0 T (1) P0 − S 2 T (1) P0 T (1) P0 T (1) P0 (33) where S is de ned by S=
Pr
i=1
1 i
Pi
(34)
which is the inverse of T (0) in the ima e of I − P0 . Note that from eqs (28), (29), and (34), the equality 1 2
C0T SC0
1 2
=Q
(35)
holds. From the fact T (0) P0 = 0, we have C0 P0 = 0
(36)
Usin eq.(35) and eq.(36), we can derive T (1) = 0
T (2) = P0 (W W T − W QW T )P0
(37)
In particular, Pr+j ( ) is the ei enprojection of 1 2
T ( )P0 ( ) = T (2) + T (3) +
2
T (4) +
(38)
Generalization Error of Linear Neural Networks in Unidenti able Cases
61
In the leadin term T (2) , P0 W (IL − Q) is the ortho onal projection of W onto an M −r dimensional subspace in the ran e and onto an L−r dimensional subspace in the domain respectively. Thus, the distribution of the random matrix T (2) is equal to the Wishart distribution WM−r (L − r; 2 IM−r ). We expand Pr+j ( ) as (1)
Pr+j ( ) = Pr+j + Pr+j +
2
(2)
Pr+j + O( 3 )
(39)
From Kato ([7], (2.14)), the coe cients are in eneral iven by (1)
Pr+j = − Pr+j T (3) Sj − Sj T (3) Pr+j (2)
Pr+j = − Pr+j T (4) Sj − Sj T (4) Pr+j + Pr+j T (3) Sj T (3) Sj + Sj T (3) Pr+j T (3) Sj + Sj T (3) Sj T (3) Pr+j − Pr+j T (3) Pr+j T (3) Sj2 − Pr+j T (3) Sj2 Pr+j − Sj2 T (3) Pr+j T (3) Pr+j
(40)
where Sj is de ned by X
Sj = −
1kM−r k6=j
j
1 −
k
1
Pr+j −
j
(I − P0 )
(41)
(2) . The matrix Sj is Here 1 M−r are the non-zero ei envalues of T (2) equal to the inverse of T − j IM in the ima e of I − Pr+j . Usin the expansions obtained above, we will calculate the eneralization error. The rst term of eq.(24) can be rewritten as
PM
C0T Pr+j ( )]]
j=H+1−r EX W [Tr[C0
(42)
Havin Pr+j C0 = 0 in mind, we obtain Tr[C0 C0T Pr+j ] = 0 1 (2) Tr[C0 C0T Pr+j ] = 2 Tr[C0 C0T (I − P0 )T (3) Pr+j T (3) (I − P0 )] (1)
(43)
j
Usin eq.(36), eq.(37), and the fact that C0 C0T (I − P0 )S = I − P0 , we can derive from eq.(33) Tr[C0 C0T Pr+j ] = Tr[(T (1) P0 T (2) − T (1) P0 T (1) ST (1) ) (2)
Pr+j (T (2) P0 T (1) − T (1) ST (1)P0 T (1) )S]
(44)
Furthermore, eq.(27) leads T (1) P0 T (2) Pr+j − T (1) P0 T (1) ST (1) Pr+j = C0 =
j C0
1 2
W T P0 (W W T − W QW T )Pr+j 1 2
W T Pr+j
(45)
62
Kenji Fukumizu
Finally, we obtain Tr[C0 C0T Pr+j ] = Tr[C0 (2)
1 2
W T Pr+j W
1 2
C0T S] = Tr[Pr+j W QW T ]
(46)
The random matrices Pr+j and W QW T are independent, because W Q and W (IL − Q) are independent. Therefore, PM−r EX W [Tr[C0 C0T (IM − VH VHT )]] = 2 j=H−r+1 EX W [Tr[Pr+j W QW T ]] =
2 2
r(M − H) + O( 3 )
(47)
is obtained. Usin eqs.(25) and (30), the second term of eq.(24) is rewritten as 1
1
EX W [Tr[VH VHT W (X T X)− 2 (X T X)− 2 W T ]] hP i PH−r r T T Tr[P W W ] + Tr[P W W ] (48) = 2 EX W i r+j i=1 j=1 Pr Because i=1 Pi is a non-random ortho onal projection onto an r-dimensional subspace and the distribution of each element of W is N (0 2 ), we have h hP ii r T = 2 rL (49) EX W Tr i=1 Pi W W We can calculate the second part of the ri ht hand side of eq.(48) as Tr[Pr+j W W T ] = Tr[Pr+j W QW T ] + Tr[Pr+j (W W T − W QW T )] = Tr[Pr+j W QW T ] +
j
(50)
Because j is the j-th lar est ei envalues of a random matrix from the Wishart distribution WM−r (L − r 2 IM−r ), we obtain PH−r (51) EX W [ j=1 Tr[Pr+j W W T ]] = 2 r(H − r) + (M − r L − r H − r) From eqs.(47), (49), and (51), we prove the the theorem.
References 1. Ha iwara, K., Toda, N., Usui, S.: On the problem of applyin AIC to determine the structure of a layered feed-forward neural network. Proc. 1993 Intern. Joint Conf. Neural Networks (1993) 2263 2266 2. Fukumizu, K.: Special statistical properties of neural network learnin . Proc. 1997 Intern. Symp. Nonlinear Theory and its Applications (NOLTA’97) (1997) 747 750 3. Fukumizu, K.: A Re ularity condition of the information matrix of a multilayer perceptron network. Neural Networks 9 (1996) 871 879 4. Reinsel, G.C., Velu, R.P.: Multivariate Reduced Rank Re ression. Sprin er-Verla , Berlin Heidelber New York (1998) 5. Baldi, P. F., Hornik, K.: Learnin in linear neural networks: a survey. IEEE Trans. Neural Networks 6 (1995) 837 858 6. Watcher, K.W.: The stron limits of random matrix spectra for sample matrices of independent elements. Ann. Prob. 6 (1978) 1 18 7. Kato, T.: Perturbation Theory for Linear Operators (2nd ed). Sprin er-Verla , Berlin Heidelber New York (1976)
The Computational Limits to the Co nitive Power of the Neuroidal Tabula Rasa Jir Wiedermann Ins i u e of Compu er Science Academy of Sciences of he Czech Republic Pod vodarenskou vez 2 182 07 Prague 8 Czech Republic w eder@u vt.cas.cz
Abstract. The neuroidal abula rasa (NTR) as a hypo he ical device which is capable of performing asks rela ed o cogni ive processes in he brain was in roduced by L.G. Valian in 1994. Neuroidal ne s represen a compu a ional model of he NTR. Their basic compu a ional elemen is a kind of a programmable neuron called neuroid. Essen ially i is a combina ion of a s andard hreshold elemen wi h a mechanism ha allows modi ca ion of he neuroid’s compu a ional behaviour. This is done by changing i s s a e and he se ings of i s weigh s and of hreshold in he course of compu a ion. The compu a ional power of an NTR crucially depends bo h on he func ional proper ies of he underlying upda e mechanism ha allows changing of neuroidal parame ers and on he universe of allowable weigh s. We will de ne ins ances of neuroids for which he compu a ional power of he respec ive ni e-size NTR ranges from ha of ni e au oma a, hrough Turing machines, up o ha of a cer ain res ric ed ype of BSS machines ha possess super Turing compu a ional power. The la er wo resul s are surprising since similar resul s were known o hold only for cer ain kinds of analog neural ne works.
1
Introduction
Nowadays, we are wi nessing a s eadily increasing in eres owards unders anding he algori hmic principles of cogni ion. The respec ive branch of compu er science has been recen ly appropria ely named as cogni ive compu ing. This noion, coined by L.G. Valian [8], deno es any compu a ion whose compu a ional mechanism is based on our ideas abou brain compu a ional mechanisms and whose goal is o model cogni ive abili ies of living organisms. There is no surprise ha mos of he corresponding compu a ional models are based on formal models of neural ne s. Numerous varian s of neural ne s have been proposed and s udied. They di er in he compu a ional proper ies of heir basic building elemen s, viz. neurons. Usually, wo basic kinds of neurons are dis inguished: discre e ones ha compu e This research was suppor ed by GA CR Gran No. 201/98/0717 O. Watanabe, T. Yokomori (Eds.): ALT’99, LNAI 1720, pp. 63 76, 1999. c Sprin er-Verla Berlin Heidelber 1999
64
Jir Wiedermann
wi h Boolean values, and analog (or con inuous) ones ha compu e wi h any real or ra ional number be ween 0 and 1 As far as he compu a ional power of he respec ive neural ne s is concerned, i is known ha he ni e ne s consis ing of discre e neurons are compu a ionally equivalen o ni e au oma a (cf. [5]). On he o her hand, ni e ne s of analog neurons wi h ra ional weigh s, compu ing in discre e s eps wi h ra ional values, are compu a ionally equivalen o Turing machines (cf. [3]). If weigh s and compu a ions wi h real values are allowed hen he respec ive analog ne s possess even super Turing compu a ional abili ies [4]. No ypes of ni e discre e neural ne s are known ha would be more powerful han he ni e au oma a. An impor an aspec of all in eres ing cogni ive compu a ions is learning. Neural ne s learn by adjus ing he weigh s on neural in erconnec ions according o a cer ain learning algori hm. This algori hm and he corresponding mechanism of weigh adjus men are no considered as par of he ne work. Inspired by real biological neurons, Valian sugges ed in 1988 [6] a special kind of programmable discre e neurons, called neuroids, in order o make he learning mechanism a par of neural ne s. Based on i s curren s a e and curren exci a ion from rings of he neighboring neuroids, a neuroid can change in he nex s ep all i s compu a ional parame ers (i.e., can change i s s a e, hreshold, and weigh s). In his monograph [7] Valian in roduced he no ion of a neuroidal abula rasa (NTR). I is a hypo he ical device which is capable of performing asks rela ed o cogni ive processes. Neuroidal ne s serve as a compu a ional model of he NTR. Valian described a number of neuroidal learning algori hms demons ra ing a viabili y of neuroidal ne s o model he NTR. Never heless, insu cien a en ion has been paid o he compu a ional power of he respec ive ne s. Wi hou pursuing his idea any fur her Valian merely men ioned ha he compu a ional power of neuroids depends on he res ric ion pu upon heir possibili ies o self modify heir compu a ional parame ers. I is clear ha by iden ifying a compu a ional power of any learning device we ge an upper quali a ive limi on i s learning or cogni ive abili ies. Depending on his limi , we can make conclusions concerning he e ciency of he device a hand and hose rela ed o i s appropria eness o serve as a realis ic model of i s real, biological coun erpar . In his paper we will s udy he compu a ional power of he neuroidal abula rasa which is represen ed by neuroidal ne s. The compu a ional limi s will be s udied w.r. he various res ric ions on he upda e abili ies of neuroidal compu a ional parame ers. In Sec ion 2 we will describe a broad class of neuroidal ne works as in roduced by Valian in [7]. Nex , in Sec ion 3, hree res ric ed classes of neuroidal ne s will be in roduced. They will include ne s wi h a ni e, in ni e coun able (i.e, in eger), and uncoun able (i.e., real) universe of weigh s, respec ively. Sec ion 4 will briefly ske ch he equivalence of he mos res ric ed version of ni e neuroidal ne s namely hose wi h a ni e se of parame ers, wi h he ni e au oma a.
The Compu a ional Limi s o he Cogni ive Power
65
In Sec ion 5 we fur her show he compu a ional equivalence of he la er neuroidal ne s wi h he s andard neural ne s. The nex varian of neuroidal ne s, viz. hose wi h in eger weigh s, will be considered in Sec ion 6. We will prove ha ni e neuroidal ne s wi h weigh s of size S(n) which allow a simple ari hme ic over heir weigh s (i.e., adding or sub rac ing of he weigh s), are compu a ionally equivalen o compu a ions of any S(n) space bounded Turing machine. In Sec ion 7 we will increase he compu a ional power of he previously considered model of neuroidal ne s by allowing heir weigh s o be real numbers. The resul ing model will urn o be compu a ionally equivalen o he so called addi ive BSS machine ([1]). This machine model is known for i s abili y o solve some undecidable problems. Finally, in he conclusions we will discuss he meri s of he resul s presen ed.
2
Neuroidal Nets
In wha follows we will de ne neuroidal ne s making use of he original Valian ’s proposal [7], essen ially including his no a ion. De nition 2.1 A neuroidal ne N is a quintuple N = (G W X ), where G = (V E) is the directed raph describin the topolo y of the network; V is a nite set of N nodes called neuroids labeled by distinct inte ers 1 2 N, and E is a set of M directed ed es between the nodes. The ed e ( j) for j 1 N is an ed e directed from node to node j. W is the set of numbers called weigh s. To each ed e ( j) E there is a value w j W assi ned at each instant of time. X is the nite set of the modes of neuroids which a neuroid can be in each instant. Each mode is speci ed as a pair (q p) of values where q is the member of a nite set Q of states, and p is an inte er from a nite set T called the set of hresholds of the neuroid. Q consists of two kinds of states called ring and quiescen states. To each node there is also a Boolean variable f havin value one or zero dependin on whether the node is in a rin state or not. is the recursive mode upda e func ion of form : X W X W be the sum of those wei hts wk of neuroid that are on Let w ed es (k ) comin P from neuroids which are currently rin , i.e., formally w = P j k firing wk = fj wj The value of w is called the exci a ion of at (k
)∈E
(j )∈E
that time. The mode update function de nes for each combination (s w ) holdin at time t the mode s0 X that neuroid will transit to at time t + 1: (s w ) = s0 is the recursive weigh upda e func ion of form : X W W 0 1 W . It de nes for each wei ht wj at time t the wei ht wj0 to which it will transit at time t + 1, where the new wei ht can depend on the values of each s , w , wj , and fj at time t: (s w wj fj ) = wj0
66
Jir Wiedermann
The elemen s of se s Q, T , W , and f ’s are called parameters of ne N A con uration of N a ime t is a lis of modes of all neurons followed by a lis of weigh s of all edges in N a ha ime. The respec ive lis s of parame ers are per inen o neuroids ordered by heir labels and o edges ordered lexicographically w.r. . he pair of labels (of neuroids) ha iden ify he edge a hand. Thus a any ime a con gura ion is an elemen from X N W M The computation of a neuroidal network is de ermined by he initial conditions and he input sequence. The ini ial condi ions specify he ini ial values of weigh s and modes of he neuroids. These are represen ed by he ini ial con gura ion. The inpu sequence is an in ni e sequence of inpu s or stimuli which speci es for each t = 0 1 2 a se of neuroids along wi h he s a es in o which hese neuroids are forced o en er (and hence forced o re or preven ed from ring) a ha ime by mechanisms ou side he ne (by peripherals). N If here is a Formally, each s imulus is an N uple from he se Q symbol q a h posi ion in he t h N uple st hen his deno es he fac ha he neuroid is forced o en er s a e q a ime t The special symbol is used as don’ care symbol a posi ions which are no influenced by peripherals a ha ime. A computational step of neuroidal ne N which nds i self in a con gura ion ct and receives i s inpu st a ime t is performed as follows. Firs , neuroids are forced o en er in o s a es as dic a ed by he curren s imuli. Neurons no influenced by peripherals a ha ime re ain heir original s a e as in con gura ion ct In his way a new con gura ion c0t is ob ained. Exci a ion w is compu ed for his con gura ion now and he mode and weigh upda es are realized for each neuroid in parallel, in accordance wi h he respec ive func ion and In his way a new con gura ion ct+1 is en ered. The resul of he compu a ion af er he t h s ep is he N uple of s a es of all neuroids in ct+1 This N uple is called he action a ime t Obviously, any ac ion is an elemen in QN Then he nex compu a ional s ep can begin. The ou pu of he whole compu a ion can be seen as an in ni e sequence of ac ions. From he compu a ional poin of view any neuroidal ne can be seen as a ransducer which reads an in ni e sequence of inpu s (s imuli) and produces an in ni e sequence of ou pu s (ac ions). For more de ails abou he model see [7].
3
Variants of Neuroidal Nets
In he previous de ni ion of neuroidal ne s we allowed se W o be any se of numbers and he weigh and mode upda e func ions o be arbi rary recursive func ions. In ui ively i is clear ha by res ric ing hese condi ions we will ge varian s of neural ne s di ering in heir expressiveness as well as in heir compu ing power. In his monograph Valian [7] discusses his problem and sugges s wo ex reme possibili ies.
The Compu a ional Limi s o he Cogni ive Power
67
The rs one considers such neuroidal ne s where he se of weigh s of individual neuroids is ni e. This is called a simple complexi y- heore ic model in Valian ’s erminology. We will also call he respec ive model of a neuroid as a ni e weigh neuroid. No e ha in his case func ions and can bo h be described by ni e ables. The nex possibili y we will s udy are neuroidal ne s where he universe of allowable weigh s and hresholds is represen ed by he in ni e se of all in egers. In his case i is no longer possible o describe he weigh upda e func ion by a ni e able. Wha we ra her need is a simple recursive func ion ha will allow efcien weigh modi ca ions. Therefore we will consider a weigh upda e func ion which allows se ing a weigh o some cons an value, adding or sub rac ing he weigh s, and assigning exis ing weigh s o o her inpu s edges. Such a weigh upda e func ion will be called a simple arithmetic update function. The respec ive neuroid will be called an in eger weigh neuroid. The size of each weigh will be given by he number of bi s needed o specify he respec ive weigh value. This is essen ially a model ha is considered in [7] as he coun erpar of he previous model. The nal varian of neuroidal ne s which we will inves iga e is he varian of he previously men ioned model wi h real weigh s. The resul ing model will be called an additive real neuroidal net.
4
Finite Wei ht Neuroidal Nets and Finite Automata
I is obvious ha in he case of neuroidal ne s wi h ni e weigh s here is bu a nal number of di eren con gura ions a single neuroid can en er. Hence i s compu a ional ac ivi ies like hose of any ni e neuroidal ne , can be described by a single ni e au oma on (or more precisely: by a ni e ransducer). In order o ge some insigh in o he rela ion be ween he sizes of he respec ive devices we will describe he cons ruc ion of he respec ive ransducer in more de ail in he nex heorem. In fac his ransducer will be a Moore machine (i.e., he ype of a ni e au oma on producing an ou pu af er each ransi ion) since here is an ou pu (ac ion) produced by N af er each compu a ional move. Theorem 4.1 Let N be a nite neuroidal net consistin of N neuroids with a nite set of wei hts. Then there is a constant c > 0 and a nite Moore automaton A of size (cN ) that simulates N Sketch of the proof: We will describe he cons ruc ion of he Moore au oma on A = (I S q0 O ) Here I deno es he inpu alphabe whose elemen s are N - uples N of he s imuli: I = Q Se S is a se of s a es consis ing of all con gura ions of N i.e., S = X N W M S a e q0 is he ini ial s a e and i is equal o he ini ial con gura ion of N Se O deno es a se of ou pu s of A I will consis of all possible ac ions of N : O = QN The ransi ion func ion : I S S O is de ned as follows: ( s1 ) = (s2 o) if and only if he neuroid N in con gura ion s1 and wi h inpu will en er con gura ion s2 and produce ou pu o in one compu a ional move. I is clear ha he inpu ou pu behaviour of bo h N and A is equivalen . 2
68
Jir Wiedermann
No e ha he size of he au oma on is exponen ial w.r. he size of he neuroidal ne . In some cases such a size explosion seems o be unavoidable. For ins ance, a neuroidal ne consis ing of N neuroids can implemen a binary coun er ha can coun up o cN where c 2 is a cons an which depends on he number of s a es of he respec ive neuroids. The equivalen ni e au oma on would hen require a leas Ω(cN ) s a es. Thus he advan age of using neuroidal ne s ins ead of ni e au oma a seems o lie in he descrip ion economy of he former devices. The reverse simula ion of a ni e au oma on by a ni e neuroidal ne is rivial. In fac , a single neuroid, wi h a single inpu , is enough. During he simula ion, his neuroid ransi s o he same s a es as he simula ed au oma on would. There is no need for a neuroid o make use of i s hreshold mechanism.
5
Simulatin Neuroidal Nets by Neural Nets
Neural ne s are a res ric ed kind of neuroidal ne works in which he neuroids can modify nei her heir weigh s nor heir hresholds. The respec ive se of neuroidal s a es consis s of only wo s a es of a ring and quiescen s a e. Moreover, he neurons are forced o re if and only if he exci a ion reaches he hreshold value. The compu a ional behaviour of neural ne works is de ned similarly as ha of he neuroidal ones. I has been observed by several au hors ha neural ne s are also compu aionally equivalen o he ni e au oma a (cf. [5]). Thus, we ge he following consequence of he previous heorem: Corollary 5.1 The computational power of neuroidal nets with a nite set of wei hts is equivalent to that of standard non-pro rammable neural nets. In order o be er apprecia e he rela ionship be ween he sizes of he respecive neuroidal and neural ne s, we will inves iga e he direc simula ion of ni e neuroidal ne s wi h ni e weigh s by ni e neural ne s. Theorem 5.1 Let N = (G W X ) be a nite neuroidal net consistin of N neuroids and M ed es. Let the set of wei hts of N be nite. Let L , and D , respectively, be the number of all di erent sets of ar uments of the correspondin wei ht and mode update function. Let S be the set of all possible excitation values, S 2jW j Then N can be simulated by a neural network N 0 consistin of O(( X + S + L + D )N + W M ) neurons. Proof: I is enough o show ha o any neuroid of N an equivalen neural ne work C can be cons ruc ed. A any ime he neuroid is described by i s ins an aneous descrip ion , viz. i s mode and he corresponding se of weigh s. The idea of simula ion is o cons ruc a neural ne for all combina ions of parame ers ha represen a possible ins an aneous descrip ion of The ins an aneous value of each parame er will be represen ed by a special module. There will also be wo ex ra modules o realize he mode and weigh upda e func ions. Ins ead of changing he parame ers he simula ing neural
The Compu a ional Limi s o he Cogni ive Power
Ci
6Output 6
6
Æ {module
66 6 Mode module
? -6 6 66
69
-
{module
6-6-6-6 6 6 6 6 66 Weight mod.
Excit.
6 6 6 -
-
6 6
??
Input
Cj
Ck
Fi . 1. Da a flow in module C simula ing a single neuroid
ne will merely swi ch among he appropria e values represen ing he parame ers of he ins an aneous descrip ion of he simula ed neuroid. The de ails are as follows. C will consis of ve di eren modules. Firs , here are hree modules called mode module, excitation module, and wei ht module. The purpose of each of hese modules is o represen a se of possible values of he respec ive quan i y ha represen s, in order of heir above enumera ion, a possible mode of a neuroid, a possible value of he o al exci a ion coming from he rings of adjacen neurons, and possible weigh s for all incoming edges. Thus he mode module M consis s of X neurons. For each pair of form (q p) X wi h q Q and p T here is a corresponding neuron in M . Moreover, he neuron corresponding o he curren mode of neuroid is ring, while all he remaining neurons in M are in a quiescen s a e. The weigh module W consis s of a wo dimensional array of neurons. To each incoming edge o here is a row of W neurons. Each row con ains neurons corresponding o each possible value from he se W . If ( j) E is an incoming edge o carrying he weigh w j W , hen he corresponding neuron in he corresponding row of W is ring. The exci a ion module E consis s of O( S ) neurons. Among hem, a each ime only one neuron isPring, namely he P one ha corresponds o he curren exci a ion w of Le w = = fj wj a ha ime. In order o compu e j k firing wk (k
)∈E
(j )∈E
w we have o add only hose weigh s ha occur a he connec ions from curren ly ring neuroids. Therefore we shall rs check all pairs of form fj ; wj o see which weigh value wj should par icipa e in he compu a ion of he o al exci a ion. This
70
Jir Wiedermann
will be done by dedica ing special neurons t jk o his ask, wi h k ranging over all weigh s in W Each neuron t jk will receive 2 inpu s. The rs one from a neuron from he j- h row and k- h column in he weigh module, which corresponds o some weigh w W This connec ion will carry weigh w The o her connec ion will come from Cj and will carry he weigh 1. Neuron t jk will re i j is ring and he curren weigh of connec ion j is equal o w In o her words, t jk will re i i s exci a ion equals exac ly w + 1 This calls for implemen ing an equali y es which requires he presence of some addi ional neurons, bu we will skip he respec ive de ails. The ou comes from all t jk are hen again summed and es ed for equali y agains all possible exci a ion values. In his way he curren value of w is de ermined even ually and he respec ive neurons serve as ou pu neurons of he exci a ion module. Besides hese hree modules here are wo more modules ha represen , and realize he ransi ion func ions and respec ively. The module con ains one neuron for each se of argumen s of he mode upda e func ion . Neuron d responsible for he realiza ion of he ransi ion of form (s w ) = s0 has i s hreshold equal o 2. I s incoming edges from each ou pu neuron in M and from each ou pu neuron from E carry he weigh equal o 1. Clearly, d res i neurons corresponding o bo h quan i ies s and w re. Firing of e will subsequen ly inhibi he ring of a neuron corresponding o s and exci e he ring of a neuron corresponding o s0 . Moreover, if he s a e corresponding o s0 is a ring s a e of hen also a special neuron out in C is made o re. The module is cons ruc ed n a similar way. I also con ains one neuron per each se of argumen s of he weigh upda e func ion . The neuron responsible for he realiza ion of he ransi ion of form (s w wj fj ) = wj0 has he hreshold 4. I s incoming edges of weigh 1 connec o i each ou pu neuron in M , o each ou pu neuron in W , o each neuron from he row corresponding o he j h incoming edge of , in E , and o he ou pu from Cj . Clearly, res i neurons corresponding o all four quan i ies s , w , wj , and fj re. Firing of will subsequen ly inhibi he ring of a neuron corresponding o wj in W and exci e he ring of a neuron corresponding o wj0 , also in W . Schema ically, he opology of ne work C is ske ched in Fig.1. For simplici y reasons only he flow of da a is depic ed by arrows. The size of C is given by he sum of all sizes of all i s modules. The whole ne N 0 hus con ains N mode , exci a ion , and modules, of size X , S , D , and L , respec ively. Moreover, for each of M edges of N here is a comple e row of W neurons. This al oge her leads o he size es ima ion as s a ed in he s a emen of he heorem. 2
From he previous heorem we can see ha he size of a simula ing neural ne work is larger han ha of he original neuroidal ne work. I is linear in bo h he number of neuroids and edges of he neuroidal ne work. The cons an of propor ionali y depends linearly on he size of program of individual neuroids, and exponen ially on he size of he universe of weigh s. However, no e ha he neural ne cons ruc ed in he la er heorem is much smaller han ha ob ained via he direc simula ion of he ni e au oma on corresponding o he simula ed neuroidal ne . A neural ne , simula ing he au oma on from he proof of heorem 4.1, would be of size (cN ) for some cons an c > 0 To summarize he respec ive resul s, we see ha when comparing ni e neuroidal ne s o s andard, non-programmable neural ne s, he programmabili y of
The Compu a ional Limi s o he Cogni ive Power
71
he former does no increase heir compu a ional power; i merely con ribu es o heir grea er descrip ive expressiveness.
6
Inte er Wei ht Neuroidal Nets and Turin Machines
Now we will show ha in he case of in eger weigh s here exis neuroidal ne s of he ni e size ha can simula e any Turing machine. Since we will be in eres ed in space bounded machines w.l.o.g. we will rs consider a single ape Turing machine in place of a simula ed machine. In order o ex end our resul s also for sublinear space complexi ies we will la er also consider single ape machines wi h separa e inpu apes. Firs we show ha even a single neuroid is enough for simula ion of a single ape Turing machine. Theorem 6.1 Any sin le tape Turin machine1 of time complexity T (n) and of space complexity S(n) can be simulated in time O(T (n)S 2 (n)) by a sin le neuroid makin use of inte er wei hts of size O(S(n)) and of a simple arithmetic wei ht update function. Sketch of the proof: Since we are dealing wi h space bounded Turing machines (TMs), w.l.o.g. we can consider only single ape machines. Thus in wha follows we will describe simula ion of one compu a ional s ep of a single ape Turing machine M of space complexi y S(n) wi h ape alphabe 0 1 I is known (cf. [2]) ha he ape of such a machine can be replaced by wo s acks, SL and SR , respec ively. The rs s ack holds he con en s of M’s ape o he lef from he curren head posi ion while he second s ack represen s he res of he ape. The lef or he righ end of he ape, respec ively, nd hemselves a he bo oms of he respec ive s acks. Thus we assume ha M’s head always scans he op of he righ s ack. For echnical reasons we will add an ex ra symbol 1 o he bo om of each s ack. During i s compu a ion M upda es merely he op, or pushes he symbols o, or pops he symbols from he op of hese s acks. Wi h he help of a neuroid n we will represen machine M in a con gura ion described by he con en s of i s wo s acks and by he s a e of he machine’s ni e s a e con rol in he following way. The con en s of bo h s acks will be represen ed by wo in egers vL and vR respec ively. No e ha bo h vL vR 1 hanks o 1’s a he bo oms of he respec ive s acks. The ins an aneous s a e of M is s ored in he s a es of n To simula e M we merely have o manipula e he above men ioned wo s acks in a way ha corresponds o he ac ions of he simula ed machine. Thus, he ne has o be able o read he op elemen of a s ack, o dele e i ( o pop he s ack), and o add ( o push) a new elemen on o he op of he s ack. W.r. . our represen a ion of a s ack by an in eger v, say, reading he op of a s ack asks for de ermining he pari y of v Popping an elemen from or pushing i o a s ack means compu ing of v 2 and 2v respec ively. All his mus be done wi h he help of addi ions and sub rac ions. The idea of he respec ive algori hm ha compu es he pari y of any v > 0 is as follows. From v we will successively sub rac he larges possible power of 2, no 1
No e ha in he case of single ape machines he inpu size is coun ed in o he space complexi y and herefore we have S(n)) n
72
Jir Wiedermann
grea er han v un il he value of v drops o ei her 0 or 1. Clearly, in he former case, he original value of v was even while in he la er case, i was odd. More formally, he resul ing algori hm looks as follows: wh le v > 1 do p1 := 1; p2 := 2; wh le p2 v do p1 := p1 + p1; p2 := p2 + p2 od; v := v − p1 od; f v = 1 then re urn(odd) else re urn(even) ; By a similar algori hm we can also compu e he value of v 2 (i.e., he value of v shif ed by one posi ion o he righ , losing hus i s righ mos digi ). This value is equal o he sum of he halves of he respec ive powers (as long as hey are grea er han 1) compu ed in he course of previous algori hm. The ime complexi y of bo h algori hms is O(S 2 (n)) The neuroidal implemen a ion of previous algori hms looks as follows. The algori hms will have o make use of he values represen ing bo h s acks, vL and vR respec ively. Fur hermore, hey will need access o he auxiliary values p1 p2 v and o he cons an 1 All he previously men ioned values will be s ored as weigh s of n For echnical reasons imposed by func ionali y res ric ions of neuroids, which will become clearer la er, we will need o s ore also he inverse values of all previously men ioned variables. These will be also s ored in he weigh s of n Hencefore, he neuroid n will have 12 inpu s. These inpu s are connec ed o n via 12 connec ions. Making use of he previously in roduced no a ion, he rs six will hold he weigh s w1 = vL w2 = vR w3 = v w4 = p1 w5 = p2 and w6 = 1 The remaining six will carry he same bu inverse values. The ou pu of n is connec ed o all 12 inpu s. The neuroid simula es each move of M in a series of s eps. Each series perform one run of he previously men ioned (or of a similar) algori hm and herefore consis s of O(S 2 (n) s eps. A he beginning of he series ha will simula e he (t + 1)-s move of M he following invarian is preserved by n for any t > 0 Weigh s w1 and w2 represen he con en s of he s acks af er he t- h move and w7 = −w1 and w8 = −w2 The remaining weigh s are se o zero. A he beginning of compu a ion, he lef s ack is emp y and he righ s ack conains he inpu word of M We will assume ha n will accep i s inpu by en ering a designa ed s a e. Also, he hreshold of n will be se o 0 all he ime. Assume ha a ime t he ni e con rol of M is in s a e q Un il i s change, his s a e is s ored in all for hcoming neuroidal s a es ha n will en er. In order o read he symbol from he op of he righ s ack he neuroid has o de ermine he las binary digi of w2 or, in o her words, i has o de ermine he pari y of w2 To do so, we rs perform all he necessary ini ializa ion assignmen s o auxiliary variables, and o heir coun erpar s holding he nega ive values. In order o perform he necessary es s (comparisons), he neuroid mus en er a ring s a e. Due o he neuroidal compu a ional mechanism and hanks o he connec ion among he ou pu of n and all i s inpu s, all i s non zero weigh s will par icipa e in he subsequen comparison of he o al exci a ion agains n’s hreshold. I is here ha we will make a proper use of weigh s wi h he opposi e sign: he weigh s (i.e., variables) ha should no be compared, and should no be forgo en, will par icipa e in a comparison wi h opposi e signs.
The Compu a ional Limi s o he Cogni ive Power
73
For ins ance, o perform he comparison p2 v we merely swi ch o he posi ive value of p2 and he nega ive value of v from he comparison by emporarily se ing he respec ive weigh s w5 and w9 o zero. All he o her weigh values remain as hey were. As a resul , af er he ring s ep, n will compare v − p2 agains i s hreshold value (which is permanen ly se o 0) and will en er a s a e corresponding o he resul of his comparison. Af er he comparison, he weigh s se emporarily o zero can be res ored o heir previous values (by assignmen s w5 := −w11 and w9 := −w3 ). The ransi ion of M in o a s a e as dic a ed by i s ransi ion func ion is realized by N af er upda ing he s acks appropria ely, by s oring he respec ive machine s a e in o he s a e of n The simula ion ends by en ering in o he nal s a e. I is clear ha he simula ion runs in ime as s a ed in he heorem. The size of any s ack, and hence of any variable, never exceeds he value S(n) + 1. Hence he size of he weigh s of n will be bounded by he same value. 2
No e ha a similar cons ruc ion, s ill using only one neuroid, would also work in case a mul iple ape Turing machine should be simula ed. In order o simula e a k ape machine, he resul ing neuroid will represen each ape by 12 weigh s as i did before. This will lead o a neuroid wi h 12k incoming edges. Nex we will also show ha a simula ion of an o line Turing machine by a ni e neuroidal ne work wi h unbounded weigh s is possible. This will enable us o prove a similar heorem as before which holds for arbi rary space complexi ies: Theorem 6.2 Let M be an o line multiple tape Turin machine of space complexity S(n) 0 Then M can be simulated in a cubic time by a nite neuroidal net that makes use of inte er wei hts of size O(S(n)) and of a simple arithmetic wei ht update function. Sketch of the proof: In order o read he respec ive inpu s he neuroidal ne will be equipped wi h he same inpu ape as he simula ed Turing machine. Excep he neuroid n ha akes care of a proper upda e of s acks ha represen he respec ive machine apes, he simula ing ne will con ain also wo ex ra neuroids ha implemen he con rol mechanism of he inpu head movemen . For each move direc ion (lef or righ ) here will be a special so called move neuroid which will re if and only if he inpu head has o move in a respec ive direc ion. The symbol read by he inpu head will represen an addi ional inpu o neuroid n simula ing he moves of M The informa ion abou he move direc ion will be inferred by neuroid n As can be seen from he descrip ion of he simula ion in he previous heorem, n keeps rack on ha par icular ransi ion of M ha should be realized during each series of i s s eps simula ing one move of M Since s can ransmi his informa ion o he respec ive move neuroids only via ring and canno dis inguish be ween he wo arge neuroids, we will have o implemen a ( ni e) coun er in each move neuroid. The coun er will coun he number of rings of s occurring in an unin errup ed sequence. Thus a he end of he las s ep of M’s move simula ion (see he proof of he previous heorem) s will send wo successive rings o deno e he lef move and hree rings for he righ move. The respec ive signals will reach bo h move neuroids, bu wi h he help of coun ing hey will nd which of hem is in charge for moving he head. Some care over synchroniza ion of all hree neuroids mus be aken. 2
74
Jir Wiedermann
I is clear ha he compu a ions of ni e neuroidal ne s wi h in eger weigh s can be simula ed by Turing machines. Therefore he compu a ional power of bo h devices is he same. In 1995, Siegelmann and Sonn ag [4] proved ha he compu a ional power of cer ain analog neural ne s is equivalen o ha of Turing machines. They considered ni e neural ne s wi h xed ra ional weigh s. A ime t, he ou pu of heir analog neuron is a value be ween 0 and 1 which is de ermined by applying a so-called piecewise linear activation func ion o he exci a ion w of a ha ime (see de ni ion 2.1): : w 0 1 For nega ive exci a ion, akes he value 0, for exci a ion grea er han 1 value 1 and for exci a ions be ween 0 and 1 (w ) = w The respec ive ne compu es synchronously, in discre e ime s eps. We will call he respec ive ne s as synchronous analo neural nets. Siegelmann and Sonn ag’s analog neural ne works simula ing a universal Turing machine consis ed of 883 neurons. This can be compared wi h he simple cons ruc ion from Theorem 5.1 requiring bu a single neuroid. Never heless, he equivalency of bo h ypes of ne works wi h Turing machines proves he following corollary: Corollary 6.1 Finite synchronous analo neural nets are computationally equivalent to nite neuroidal nets with inte er wei hts.
7
Real Wei ht Neuroidal Nets and the Additive BSS Model
Now we will charac erize he compu a ional power of neuroidal ne s wi h real parame ers. We will compare heir e ciency owards a res ric ed varian of he BSS model. The BSS model (cf. [1]) is a model ha is similar o RAM which compu es wi h real numbers under he uni cos model. In doing so, all four basic ari hme ic opera ions of addi ions, sub rac ions, mul iplica ion and division are allowed. The addi ive BSS model allows only for he former wo ari hme ic opera ions. Theorem 7.1 The additive real model of neuroidal nets is computationally equivalent to the additive BSS model workin over binary inputs. Sketch of the proof: The simula ion of a ni e addi ive real model of neuroidal ne N on he addi ive BSS model B is a s raigh forward ma er. For he reverse simula ion, assume ha he binary inpu o N is provided o B by a mechanism similar o ha from Theorem 6.2. One mus rs refer o he heorem (Theorem 1 in Chap er 21 in [1]) ha shows ha by a sui able encoding a compu a ion of any addi ive machine can be done using a xed ni e amoun of memory (in a ni e number of regis ers , each holding a real number) wi hou exponen ial increase in he running ime. The resul ing machine F is hen simula ed by N in he following way: The con en s of ni ely many regis ers of F are represen ed as (real) weigh s of a single neuroid r. Addi ion or sub rac ion of weigh s, as necessary, is done direc ly by
The Compu a ional Limi s o he Cogni ive Power
75
weigh upda e func ion. A comparison of weigh s is done wi h he help of r’s hreshold mechanism in a similar way o ha in he proof of Theorem 6.1. To single ou from he comparison he weigh s ha should no be compared one can use a similar rick as in Theorem 6.1: o each such a weigh a weigh wi h he opposi e sign is main ained.
The power of ni e addi ive neuroidal ne s wi h real weigh s comes from heir abili y o simula e oracular or nonuniform compu a ions. For ins ance, in [1] i is shown ha he addi ive real BSS machines decide all binary se s in exponen ial ime. Their polynomial ime coincides wi h he nonuniform complexi y class P/poly.
8
Conclusions
The paper brings a rela ively surprising resul showing compu a ional equivalence be ween cer ain kinds of discre e programmable and analog ni e neural ne s. This resul o ers new insigh s in o he na ure of compu a ions of neural ne s. Firs , i poin s o he fac ha he abili y of changing weigh s is no a condi ion sine qua non for learning. A similar e ec can be achieved by making use of reasonably res ric ed kinds of analog neural ne s. Second, he resul showing compu a ional equivalency of he respec ive ne s suppor s he idea ha all reasonable compu a ional models of he brain are equivalen (cf. [9]). Third, for modeling of cogni ive or learning phenomena, he neuroidal ne s seem o be preferred over he analog ones, due o he ransparency of heir compu a ional or learning mechanism. As far as heir appropria eness for he ask a hand is concerned, neuroidal ne s wi h a ni e se of weigh s seem o presen he maximal func ionali y ha can be achieved by living organisms. As ma hema ical models also more powerful varian s are of in eres .
References 1. Blum, M. Cucker, F. Shub, M. Smale, M.: Complexi y and Real Compua ion. Springer, New York, 1997, 453 p. 2. Hopcrof , J.E. Ullman, J. D.: Formal Languages and heir Rela ion o Auoma a. Addison Wesley, Reading, Mass., 1969 3. Indyk, P.: Op imal Simula ion of Au oma a by Neural Ne s. Proc. of he 12 h Annual Symp. on Theore ical Aspec s of Compu er Science STACS’95, LNCS Vol. 900, pp. 337 348, 1995 4. Siegelmann, H. T. Sonn ag, E.D.: On Compu a ional Power of Neural Ne works. J. Comput. Syst. Sci., Vol. 50, No. 1, 1995, pp. 132 150 5. S ma, J. Wiedermann, J.: Theory of Neuroma a. Journal of he ACM, Vol. 45, No. 1, 1998, pp. 155 178 6. Valian , L.: Func ionali y in Neural Ne s. Proc. of he 7 h Na . Conf. on Ar . In elligence, AAAI, Morgan Kaufmann, San Ma eo, CA, 1988, pp. 629 634 7. Valian , L.G.: Circui s of he Mind. Oxford Universi y Press, New York, Oxford, 1994, 237 p., ISBN 0 19 508936 X
76
Jir Wiedermann
8. Valian , L.G.: Cogni ive Compu a ion (Ex ended Abs rac ). In: Proc. of he 38 h IEEE Symp. on Fond. of Comp. Sci., IEEE Press, 1995, p. 2 3 9. Wiedermann, J.: Simula ed Cogni ion: A Gaun le Thrown o Compu er Science. To appear in ACM Compu ing Surveys, 1999
The Consistency Dimension and Distribution-Dependent Learnin from Queries (Extended Abstract)? Jose L. Balcazar1, Jorge Cas ro1 , David Guijarro1 , and Hans-Ulrich Simon2 1
2
Dept. LSI, Universitat Politecnica de Catalunya, Campus Nord, 08034 Barcelona, Spain, balqu ,castro,dav d @ls .upc.es Fakult¨ at f¨ ur Informatik, Lehrstuhl Mathematik und Informatik, Ruhr-Universitaet Bochum, D-44780 Bochum, s mon@lm .ruhr-un -bochum.de
Abs rac . We prove a new combinatorial characterization of polynomial learnability from equivalence queries, and state some of its consequences relatin the learnability of a class with the learnability via equivalence and membership queries of its subclasses obtained by restrictin the instance space. Then we propose and study two models of query learnin in which there is a probability distribution on the instance space, both as an application of the tools developed from the combinatorial characterization and as models of independent interest.
1
Introduction
The main models of learning via queries were in roduced by Angluin [2, 3]. In hese models, he learning algori hm ob ains informa ion abou he arge concep asking queries o a eacher or exper . The algori hm has o ou pu an exac represen a ion of he arge concep in polynomial ime. Targe concep s are formalized as languages over an alphabe . Frequen ly, i is assumed ha he eacher can answer correc ly wo kinds of ques ions from he learner: membership queries and equivalence queries1 . Unless o herwise speci ed, all our discussions are in he proper learning framework where he hypo heses come from he same class as he arge concep . A combina orial no ion, called approxima e ngerprin s, urned ou o charac erize precisely hose concep classes ha can be learned from polynomially many equivalence queries of polynomial size [4, 6]. The essen ial in ui ion behind ha fac is ha he exis ence of queries ha shrink he number of possibili ies for he arge concep by a polynomial fac or is no only clearly su cien , bu also necessary o learn: if no such queries are available hen adversaries can be designed ha force any learner o spend oo ?
1
Work supported in part by the EC throu h the Esprit Pro ram EU BRA pro ram under project 20244 (ALCOM-IT) and the EC Workin Group EP27150 (NeuroColt II) and by the spanish DGES PB95-0787 (Koala). Such a teacher is called sometimes minimally adequate .
O. Watanabe, T. Yokomori (Eds.): ALT’99, LNAI 1720, pp. 77 92, 1999. c Sprin er-Verla Berlin Heidelber 1999
78
Jose L. Balcazar et al.
many queries in order o iden ify he arge . This in ui ion can be fully formalized along he lines of he ci ed works; he formaliza ion can be found in [7]. Hellers ein e al. gave a beau iful charac eriza ion of polynomially (EQ,MQ)learnable represen a ion classes [8]. They in roduced he no ion of polynomial cer i ca es for a represen a ion class R and proved ha R is polynomially learnable from equivalence and membership queries i i has polynomial cer i ca es. The rs main con ribu ion of his paper is o propose a new combina orial charac eriza ion of learnabili y from equivalence queries, surprisingly close o cer i ca es, and qui e di eren (and also simpler o handle) han he approxima e ngerprin s: he s rong consis ency dimension, ha one can see as he analog of he VC dimension for query models. Angluin [2, 3] showed ha , when only approxima e iden i ca ion is required, equivalence queries can be subs i u ed by a random sample. Thus, a PAC learning algori hm can be ob ained from an exac learning algori hm ha makes equivalence queries. In PAC learning, in roduced by Valian [11], one has o learn a arge concep wi h high probabili y, in polynomial ime (and, a foriori, from a polynomial number of examples), wi hin a cer ain error, under all probabili y dis ribu ions on he examples. Because of his las requiremen , o learn under all dis ribu ions, PAC learning is also called dis ribu ion-free, or dis ribu ion-independen , learning. Dis ribu ion-independen learning is a s rong requiremen , bu i can be relaxed o de ne PAC learning under speci c dis ribu ions, or families of dis ribu ions. Indeed, several concep classes ha are no known o be polynomially learnable, or known no o be polynomially learnable if RP = NP, urn ou o be polynomially learnable under some xed dis ribu ion or families of dis ribu ions. In comparison o PAC learning, one drawback of he query models is ha hey do no have his added flexibili y of relaxing he dis ribu ion-free condi ion. The s andard ransforma ion se s hem au oma ically a he dis ribu ion-free level. The second main con ribu ion of his paper is he proposal of wo learning models in which coun erexamples are no adap a ively provided by a (helpful or reacherous) eacher, bu ins ead are nonadap a ively sampled according o a probabili y dis ribu ion. We prove ha he dis ribu ion-free form of one of hese models exac ly coincides wi h s andard learning from equivalence queries, while he o her model is cap ured by he randomized version of he s andard model. This allows us o ex end, in a na ural way, he query learning model o an explici dis ribu ionfree se ing where his res ric ive condi ion can be na urally relaxed. Some of he fac s ha we prove of hese new models make use of he consis ency dimension charac eriza ion proved earlier as he rs con ribu ion of he paper. Our no a ion and erminology is s andard. We assume familiari y wi h he query-learning model. Mos de ni ions will be given in he same sec ion where hey are needed. Generally, le X be a se , called instance space or domain in he sequel. A concept is a subse of X, where we prefer some imes o regard C as a func ion from X o 0 1 . A concept class is a se C 2X of concep s. An elemen of X is called an instance. A pair (x b), where b 0 1 is a binary
The Consistency Dimension and Distribution-Dependent Learnin
79
label, is called example for concept C if C(x) = b. A sample is a collec ion of labeled ins ances. Concep C is said o be consistent with sample S if C(x) = b for all (x b) S. A representation class is a four- uple R = (R ). and are ni e alphabe s. S rings of charac ers in are used o describe elemen s of he domain is he X, and s rings of charac ers in are used o encode concep s. R ∗ se of s rings ha are concep encodings or representations. Le :R− 2 be a func ion ha maps hese represen a ions in o concep s over . For ease of echnical exposi ion, we assume ha , for each r R here exis s some n 1 such n . Thus each concep wi h a represen a ion in R has a domain of ha (r) (r) : r R is he he form n (as opposed o domain ).2 The se C = concep class associa ed wi h R. 0 1 w.r.t. representation We de ne he de ne he size of concept C : n class R as he leng h of he shor es s ring r R such ha C = (r), or as if C is no represen able wi hin R. This quan i y is deno ed by C R . Wi h hese de ni ions, C is a doubly parame erized class , ha is, i is par i ioned in o se s Cn m con aining all concep s from C wi h domain n and size a mos m. The kind of query-learning considered in his paper is proper in he sense ha concep s and hypo heses are picked from he same class C. We will however allow ha he size of an hypo hesis exceeds he size of he arge concep . The number of queries needed in he wors case o ob ain an a rma ive answer from he eacher, or learning complexi y , given ha he arge concep belongs o Cn m and ha he hypo heses of he learner may be picked from Cn M , is deno ed by LCO R (n m M ), where O speci es he allowed query ypes. In his paper, ei her O = EQ or O = (EQ M Q). We speak of polynomial O-learnabili y if LCO R (n m M ) is polynomially bounded in n m M . We close his sec ion wi h he de ni ion of a version space. A any in ermedia e s age of a query-learning process, he learner knows (from he eacher’s answers received so far) a sample S for he arge concep . The current version space V is he se of all concep s from Cn m which are consis en wi h S. These are all concep s being s ill conceivable as arge concep s.
2
The Stron Consistency Dimension and Its Applications
The proof, as i was given in [8], of he charac eriza ion of (EQ,MQ)-learning in erms of polynomial cer i ca es implici ly con ains concre e lower and upper bounds on he number of queries needed o learn R. In Subsec ion 2.1, we make hese bounds more explici by in roducing he so-called consis ency dimension of R and wri ing he bounds in erms of his dimension (and some o her parame ers associa ed wi h R). In Subsec ion 2.2, we de ne he no ions of a s rong 2
This is a purely technical restriction that allows us to present the main ideas in the most convincin way. It is easy to eneralize the results in this paper to the case of domains with strin s of varyin len th.
80
Jose L. Balcazar et al.
cer i ca e and of he s rong consis ency dimension and show ha hey he same purpose for EQ-learning as he former no ions did for (EQ,MQ)-learning: we derive lower and upper bounds on he number of EQs needed o learn R in erms of he s rong consis ency dimension and conclude ha R is polynomially EQ-learnable i i has a polynomial s rong cer i ca e. In Subsec ion 2.3, we prove ha he s rong consis ency dimension of a class equals he maximum of he consis ency dimensions aken over all subclasses (induced by a res ric ion of he domain). This implies ha he number of EQs needed o learn a concep class roughly equals he o al number of EQs and MQs needed o learn he hardes subclass. For ease of echnical exposi ion, we need he following de ni ions. A partially de ned concept C on domain n is a func ion from n o 0 1 , where s ands for unde ned . Since par ially de ned concep s and samples can be iden i ed in he obvious manner, we use he erms par ially de ned concep and sample in erchangeably in he sequel. The support of C is de ned as n : C(x) 0 1 . The breadth of C is de ned as he supp(C) = x cardinali y of i s suppor and deno ed as C . The size of C is de ned as he smalles size of a concep ha is consis en wi h C. I is deno ed as C R . No e ha his de ni ion coincides wi h he previous de ni ion of size when C has full n . Sample Q is called subsample of sample C (deno ed as Q C) suppor if supp(Q) supp(C) and Q C coincide on supp(Q). Throughou his sec ion, R=( R ) deno es a represen a ion class de ning a doubly parame erized concep class C. 2.1
Certi cates and Consistency Dimension
R has polynomial certi cates if here exis wo-variable polynomials p and q, 0 1 he following condi ion is such ha for all m n > 0, and for all C : n valid: ( Q C: Q q(m n) Q R > m) (1) C R > p(n m) The consistency dimension of R is he following hree-variable func ion: m > 0 and n > 0, is he smalles number d cdimR (n m M ), where M 0 1 he following condi ion is valid: such ha for all C : n C
R
>M
( Q
C: Q
d
Q R > m)
(2)
An obviously equivalen bu qui e useful reformula ion of Condi ion (2) is ( Q
C: Q
d
QR
m)
C
R
M
(3)
0 1 of bread h a mos d has a conIn words: if each subsample of C : n sis en represen a ion of size a mos m, hen C has a consis en represen a ion of size a mos M . The following resul is (more or less) implici in [8]. Theorem 1 cdimR (n m M )
MQ LCEQ (n m M ) R
cdimR (n m M ) log Cn m
The Consistency Dimension and Distribution-Dependent Learnin
No e ha
81
he lower and he upper bound are polynomially rela ed because log Cn m
m log(1 +
)
(4)
Clearly, Theorem 1 implies ha R is polynomially (EQ,MQ)-learnable i i has polynomial cer i ca es. We omi he proof of Theorem 1. 2.2
Stron Certi cates and Stron Consistency Dimension
We wan o adap he no ions cer i ca e and consis ency dimension o he framework of EQ-learning. Surprisingly, we can use syn ac ically almos he same no ions, excep for a sub le bu s riking di erence: he universe of C will be ex ended from he se of all concep s over domain n o he corresponding se of par ially de ned concep s. This leads o he following de ni ions. R has polynomial stron certi cates if here exis wo-variable polynomials 0 1 Condi ion (1) p and q, such ha for all m n > 0, and for all C : n is valid. The stron consistency dimension of R is he following hree-variable funcion: scdimR (n m M ), where M m > 0 and n > 0, is he smalles number d 0 1 Condi ion (2) is valid. Again, ins ead of Consuch ha for all C : n di ion (2), we can use he equivalen Condi ion (3). In words: if each subsample 0 1 of bread h a mos d has a consis en represen a ion of of C : n size a mos m, hen C has a consis en represen a ion of size a mos M . Theorem 2 scdimR (n m M ) LCEQ R (n m M )
scdimR (n m M ) ln Cn m
Proof. For brevi y reasons, le q = LCEQ R (n m M ) and d = scdimR (n m M ). We prove he rs inequali y by exhibi ing an adversary ha forces any learner o spend as many queries as given by he s rong consis ency dimension. The minimali y of d implies ha here is a sample C such ha s ill C R > M bu ( Q C : Q d−1 Q R m). Thus, any learner, issuing up o d − 1 equivalence queries wi h hypo heses of size a mos M , fails o be consis en wi h C, and a coun erexample from C can be provided such ha here is s ill a leas one consis en concep of size a mos m (a po en ial arge concep ). Hence, a leas d queries go by un il an a rma ive answer is ob ained. In order o prove q d ln Cn m , we describe an appropria e EQ-learner A. A keeps rack of he curren version space V (which is Cn m ini ially). For i = 0 1, le SV be he se x
n
: he frac ion of concep s C
V wi h C(x) = 1 − i is less han 1 d
In o her words, a very large frac ion (a leas 1 − 1 d) of he concep s in V vo es for ou pu label i on ins ances from SV . Le CV be he sample assigning o all remaining ins ances label i 0 1 o all ins ances from SV and label ( hose wi hou a so clear majori y). Le Q be an arbi rary bu xed subsample d. The de ni ion of SV implies ( hrough some easy- oof CV such ha Q check coun ing) ha here exis s a concep C V Cn m ha is consis en wi h
82
Jose L. Balcazar et al.
Q. Applying Condi ion (3), we conclude ha CV R M , i.e., here exis s an H Cn M ha is consis en wi h CV . The punchline of his discussion is: if A issues he EQ wi h hypo hesis H, hen he nex coun erexample will shrink he curren version space by a fac or 1−1 d (or by a smaller fac or). Since he ini ial 1, a version space con ains Cn m concep s and since A is done as soon as V su cien ly large number of EQs is ob ained by solving (1 − 1 d)q Cn m for q. Clearly, q = d ln Cn m
e−q
d
Cn m
1
is su cien ly large.
Since he lower and he upper bound in Theorem 2 are polynomially rela ed according o Inequali y (4), we ob ain Corollary 3 R is polynomially EQ-learnable i it has a polynomial stron certi cate. 2.3
EQs Alone versus EQs and MQs
The goal of his subsec ion is o show ha he number of EQs needed o learn a concep class is closely rela ed o he o al number of EQs and MQs needed o learn he hardes subclass. The formal s a emen of he main resul s requires he following de ni ions. n be a family of subdomains. The restriction Le S = (Sn )n1 wi h Sn n 0 1 to Sn is he par ially de ned concep (sample) of a concept C : wi h suppor Sn which coincides wi h C on i s suppor . The class con aining all res ric ions of concep s from C o he corresponding subdomain from S is called he subclass of C induced by S and deno ed as C S. The no ions polynomial cer i ca e, consis ency dimension, and learning complexi y are adap ed o he subclass of C induced by S in he obvious way. R S (in words: R res ric ed o S) has polynomial certi cates if here exis wo-variable 0 1 polynomials p and q, such ha for all m n > 0, and for all C : n such ha supp(C) = Sn , Condi ion (1) is valid. The consistency dimension of R S is he following hree-variable func ion: cdimR (Sn m M ) is he smalles 0 1 number d such ha for all M m > 0 n > 0, and for all C : n such ha supp(C) = Sn , Condi ion (2) is valid. Again, ins ead of Condi ion (2), we can use he equivalen Condi ion (3). MQ (Sn m M ) is de ned as he smalles o al number of EQs Quan i y LCEQ R and MQs needed o learn he class of concep s from Cn m res ric ed o Sn wi h hypo heses from Cn M res ric ed o Sn . Quan i y LCEQ R (Sn m M ) is unders ood analogously. No e ha LCEQ R (Sn m M )
LCEQ R (n m M )
(5)
is valid in general, because EQs become more powerful (as opposed o MQs which become less powerful) when we pass from he full domain o a subdomain (for he obvious reasons). We have he analogous inequali y for he s rong consis ency
The Consistency Dimension and Distribution-Dependent Learnin
83
MQ dimension, bu no such s a emen can be made for LCEQ or he consis ency R dimension. The following resul is a s raigh forward generaliza ion of Theorem 1.
Theorem 4 cdimR (Sn m M ) log (C S)n m .
MQ LCEQ (Sn m M ) R
cdimR (Sn m M )
We now urn o he main resul s of his sec ion. The rs one s a es ha he s rong consis ency dimension of a class is he maximum of he consis ency dimensions aken over all induced subclasses: Theorem 5 scdimR (n m M ) = maxS
n
cdimR (S m M ).
Proof. Le d be he smalles d which makes Condi ion (2) valid for all 0 1 . Le d (S) be he corresponding quan i y when C ranges C : n only over all samples wi h suppor S. I is eviden ha d = maxS n d (S). The heorem now follows, because by de ni ion d = scdimR (n m M ) and d (S) = cdimR (S m M ). Corollary 6 1. A representation class R has a polynomial stron certi cate i all its induced subclasses have a polynomial certi cate. 2. A representation class is polynomially EQ-learnable i all its induced subclasses are polynomially (EQ,MQ)-learnable. The hird resul s a es ha he number of EQs needed o learn a class equals roughly he o al number of EQs and MQs needed o learn he hardes induced subclass. Corollary 7 maxS LCEQ R (n
m M)
Remember ha
3
n
MQ LCEQ (S m M ) R
ln Cn m
maxS
n
LCEQ R (n m M )
MQ LCEQ (S R
and
m M) .
he gap ln Cn m is bounded above by m ln(1 +
).
Equivalence Queries with a Probability Distribution
Le now D deno e a class of probabili y dis ribu ions on X, he ins ance space for a compu a ional learning framework. The wo subsec ions of his sec ion in roduce respec ive varian s of equivalence query learning ha somehow ake such dis ribu ions in o accoun . We briefly describe now he rs one. In he ordinary model of EQ-learning C, wi h hypo heses from H, he coun erexamples for incorrec hypo heses are arbi rarily chosen, and we can hink of an in elligen adversary making hese choices. EQ-learnin C from D-teachers (s ill wi h hypo heses from H) proceeds as ordinary EQ-learning, excep for he following impor an di erences: 1. Each run of he learning algori hm refers o an arbi rary bu xed pair (C D) such ha C C and D D, and o a given con dence parame er 0 1.
84
Jose L. Balcazar et al.
2. The goal is to learn C from the D-teacher, i.e., C is considered as arge concep (as usual), and he coun erexample o an incorrec hypo hesis H is randomly chosen according o he condi ional dis ribu ion D( C H), where deno es he symme ric di erence of se s. Success is de ned when his symme ric di erence has zero probabili y. The learner mus achieve a success probabili y of a leas 1 − . Clearly, he more res ric ed he class D of probabili y dis ribu ions, he easier he ask for he learner. In his ex ended abs rac , we focus on he following hree choices of D. Dall deno es he class of all probabili y dis ribu ions on X. This is he mos general case. Dun f deno es he class of dis ribu ions ha are uniform on a subdomain S X and assign zero probabili y o ins ances from X S. This case will be relevan in a la er sec ion. D = D is he mos speci c case, where D cons ains only a single probabili y dis ribu ion D. We use i only briefly in he las sec ion. Loosely speaking, he main resul s of his sec ion are as follows: The nex subsec ion proves ha , for D = Dall , EQ-learning from D- eachers is exac ly as hard (same number of queries) as he s andard model. (This resul is only es ablished for de erminis ic learners.) Thus, we are no ac ually in roducing ye one more learning model, bu charac erizing an exis ing, widely accep ed, one in a manner ha provides he addi ional flexibili y of he probabili y dis ribu ion parame er. Thus we ob ain a sensible de ni ion of dis ribu ion-dependen equivalence-query learning. In he nex sec ion, we in roduce a combina orial quan i y, called he sphere number, and show ha i represen s an informa ion- heore ic barrier in he model of EQ-learning from Dun f - eachers (even for randomized learning algori hms). However, his barrier is overcome for each xed dis ribu ion D in he model of EQ-learning from he D- eacher. 3.1
Random versus Arbitrary Counterexamples
We use upper index EQ[D] o indica e ha he role of he EQ-oracle.
he D- eacher for some D
D plays
Theorem 8 LCEQ[Dall ] (C H) = LCEQ (C H). Proof. Le A be an algori hm which EQ-learns C from D- eachers wi h hypo heses from H. Le l LCEQ (C H) be he larges number of EQs needed by A when we allow an adversary o re urn arbi rary coun erexamples o hypo heses.3 3
For the time bein , there is no uarantee that A succeeds at all, because it expects the counterexamples to be iven from a D-teacher. We will however see subsequently that there exists a distribution which sort of simulates the adversary.
The Consistency Dimension and Distribution-Dependent Learnin
85
Since LCEQ (C) is de ned aking all algori hms in o accoun , we lose no generali y in assuming ha A always queries hypo heses ha are consis en wi h previous coun erexamples, so ha all he coun erexamples received along any run are Hl−2 C and di eren . There mus exis a concep C C, hypo heses H0 xl−2 X, such ha he learner issues he l − 1 incorrec hyins ances x0 po heses H when learning arge concep C, and he x are he coun erexamples re urned o hese hypo heses by he adversary, respec ively. We claim ha here exis s a dis ribu ion D such ha , wi h probabili y 1− , he D- eacher re urns he same coun erexamples. This is echnically achieved by se ing D(x ) = (1− ) , for i = 0 l − 3, and D(xl−2 ) = l−2 . An easy compu a ion shows ha he probabili y ha he D- eacher presen s ano her sequence of coun erexamples as he adversary is a mos (l − 2) . Se ing = (l − 2), he proof is comple e. Therefore, he dis ribu ion-free case of our model coincides wi h s andard EQ-learning. Corollary 9 Let R = ( R ) be a representation class de nin a doubly EQ[D ] parameterized concept class C. Then LCR all (n m M ) = LCEQ R (n m M ) for all M m > 0 n > 0. This obviously implies ha learners for he dis ribu ion-free equivalence model can be ransformed, hrough he s andard EQ model, in o dis ribu ionfree PAC learners. We no e in passing ha , applying he s andard echniques direc ly on our model, we can prove he somewha s ronger fac ha , for each individual dis ribu ion D, a learner from D- eachers can be ransformed in o an algori hm ha PAC-learns over D. We also can assume knowledge of a bound on he size of he arge concep , by applying he usual rick of guessing i and increasing he guess whenever necessary. 3.2
EQ-Learnin from Random Samples
In his subsec ion, we discuss ano her varian of he ordinary EQ-learning model. Given a represen a ion class C, EQ-learnin from D-samples of size p and with hypotheses from H proceeds as ordinary EQ-learning, excep for he following di erences: 1. Each run of he learning algori hm refers o an arbi rary bu xed pair (C D) such ha C C and D D, and o a given con dence parame er 0 1. 2. The goal of he learner is o learn C from (ordinary) EQs and a sample P consis ing of p examples drawn independen ly a random according o D and labeled correc ly according o C. In o her words, ins ead of EQ-learning C from scra ch, he learner ge s P as addi ional inpu . The learner mus ob ain an a rma ive answer wi h a probabili y a leas 1 − of success. Again he goal is o ou pu a hypo hesis for which he probabili y of disagreemen wi h he arge concep is zero; his ime, he informa ion abou he dis ribu ion does no come from he coun erexamples, bu ra her from he ini ial addi ional
86
Jose L. Balcazar et al.
sample. We will show in his sec ion ha , for cer ain dis ribu ions, his model is s ric ly weaker han he model of EQ-learning from D- eachers. However, in he dis ribu ion-free sense, i corresponds o he randomized version of he model described previously. We rs s a e (wi hou proof) ha each algori hm for EQ-learning from Dsamples can be conver ed in o a randomized algori hm for EQ-learning from D- eachers, such as hose of he previous sec ion, a he cos of a modera e overhead in he number of queries. Theorem 10 Let q be the number of EQs needed to learn C from D-samples of size p and with hypothesis from H. It holds, LC EQ[D] (C H) (p + 1)(p + q). We show nex an example ha has an iden i ca ion learning algori hm in he EQ from D- eachers learning model, bu does no have such algori hm in he EQ learning from D-samples model. + tk of monomials, where each A DNFn formula is any sum t1 + t2 + xn x1 xn . monomial t is he produc of some li erals chosen from x1 Le DNF = n DNFn be he represen a ion class of disjunc ive normal form formulas. Le us consider he class D of dis ribu ions D de ned in he following way. 1. Assume ha wo di eren words xn and yn have been chosen for each n Consider he associa ed dis ribu ion D de ned by: D(xn ) = 6 2 (1 n2 − 1 2n ) D(yn ) = 6 ( 2 2n ) D(zn ) = 0 for any word zn of leng h n di eren from xn and yn D is ob ained by le ing xn and yn run over all pairs of di eren words of leng h n. Le C be now any class able o represen concep s consis ing of pairs xn yn wi hin a reasonable size; for concre eness, pick DNF formulas consis ing of comple e min erms. A very easy algori hm learns hem in our model of EQ from D- eachers. The algori hm has o do a mos wo equivalence queries o know he value of he arge formula f on xn and yn . Firs , i asks whe her f is idenically zero. If a coun erexample e is given e mus be xn or yn i will make a second query f = te ?, where te is he monomial ha only evalua es o one on e ( he min erm). Thus we nd whe her ei her or bo h of f (xn ) and f (yn ) are 1, and if so we also know xn and/or yn hemselves. Now he arge formula is iden i ed: he value of he formula on o her poin s does no ma er because hey have zero probabili y. However, i is no di cul o see ha here is a dis ribu ion D D such ha DNF formulas are no iden i able in he model of learning from EQ and D-samples. Here we refer o learning DNF’s of size polynomial in n from polynomially many equivalence queries of polynomial size, and wi h an ex ra ini ial sample of polynomial bread h. Firs we no e ha sampling according o D D
The Consistency Dimension and Distribution-Dependent Learnin
87
here is a non-negligible probabili y of ob aining a sample ha only con ains copies of xn . Lemma 11 For any polynomial q and 0 1, there exists an inte er k0 such that for all n k0 the probability that a D-sample S of size q(n 1 ) does not contain yn is reater than . Then, he following nega ive resul follows: Theorem 12 There exist a distribution D in D such that DNF is not EQ learnable from D-samples. The essen ial idea of he proof is ha , af er an ini ial sample revealing a single word, he algori hm is lef wi h a ask close enough o ha of learning DNFs in he s andard model wi h equivalence queries, which is impossible [4].
4
The Sphere Number and Its Applications
The remainder of he paper uses he machinery developed in Sec ion 2 o ob ain s ronger resul s rela ing he models of he previous sec ion, under one more echnical condi ion: ha he learning algori hm knows he size of he arge concep , and never queries hypo heses longer han ha . Some impor an learning algori hms do no have his proper y, bu here are s ill qui e a few (among he exac learners from equivalence queries only) ha work in sor of an incremen al fashion ha leads o his proper y. The resul s become in eres ing because hey lead o a precise charac eriza ion of randomized learners from D- eachers. We rs rewri e our combina orial ma erial of he previous sec ion in an ex remely useful, geome rically in ui ive form (1-spheres), and prove ha for m = M hese s ruc ures cap ure clearly he s rong consis ency dimension. Applica ions follow in he nex subsec ion. 4.1
Stron Consistency Dimension and 1-Spheres
A popular me hod for ge ing lower bounds on he number of queries is o show ha he class of arge concep s con ains a basic hard- o-learn combina orial s ruc ure. For ins ance, if he emp y se is no represen able bu N single ons are, hen he number of EQs, needed o iden ify a par icular single on, is a leas N . In his Subsec ion, we consider a concep ually similarly simple s ruc ure: he so-called 1-spheres. They are ac ually a disguised (read isomorphic) version of se s of single ons, wi h he emp y se simul aneously forbidden. Then we show ha he s rong consis ency dimension is lower bounded by he size of he larges 1-sphere ha can be represen ed by C. Moreover, for M = m bo h quan i ies coincide. To make he las s a emen s precise, we need several de ni ions. Le S be a ni e se , and S0 S. The 1-sphere with support S around center S0 , deno ed as HS1 (S0 ) in he sequel, is he collec ion of se s S1 S such ha S0 S1 = 1,
88
Jose L. Balcazar et al.
where deno es he symme ric di erence of se s. In o her words, S1 S belongs o HS1 (S0 ) if he Hamming dis ance be ween S0 and S1 is 1. Thus, i is formed by all he poin s a dis ance (radius) 1 from he cen er in Hamming space. n . Le S 0 be an arbi rary subse of S. The Le us now assume ha S 0 1 which represents S 0 (as a subset of S) is he sample sample C 0 : n wi h suppor S ha assigns label 1 o all ins ances from S 0 , and label 0 o all ins ances from S S 0 . We say ha HS1 (S0 ) is representable by Cn [m:M] if he following wo condi ions are valid: (A) Le C0 be he sample wi h suppor S which represen s S0 . Then, C0 R > M. HS1 (S0 ), (B) Each sample C1 wi h suppor S, which represen s a se S1 sa is es C1 R m. Thus, for he par icular case of M = m, all poin s in Hamming space on he surface of he sphere are represen able wi hin size m bu he cen er is no ; jus as he above-men ioned use of single ons, which form he 1-sphere cen ered on he emp y se . The size of HS1 (S0 ) is de ned as S . We de ne he hree-variable func ion sphR (n m M ), called sphere number of R in he sequel, as he size of he larges 1-sphere which is represen able by Cn [m:M] . We now urn o he main resul of his subsec ion, which implies ha he EQ (n m M ). sphere number is ano her lower bound on LCR Theorem 13 sphR (n m M )
scdimR (n m M ) with equality for M = m.
Proof. For he sake of brevi y, le d = scdimR (n m M ) and s = sphR (n m M ). Le HS1 (S0 ) be a larges 1-sphere ha is represen able by Cn [m:M] . Thus, S = s. In order o prove d s, we assume for sake of con radic ion d s. Consider he sample C0 wi h suppor S ha represen s S0 . By Condi ion (A), C0 R > M . According o Condi ion (2) applied o C0 , here exis s a subsample d s and Q R > m. Le SQ = supp(Q) S. Le Q1 be Q C0 such ha Q a sample wi h suppor S ha o ally coincides wi h Q (and hus wi h C0 ) on SQ , and coincides wi h C0 on S SQ excep for one ins ance. Clearly, Q1 represen s a Q1 R , we arrived se S1 HS1 (S0 ). By Condi ion (B), Q1 R m. Since Q R a a con radic ion. Finally, we prove s d for he special case ha M = m. I follows from he 0 1 minimali y of d and Condi ion (2) ha here exis s a sample C : n such ha he following holds: 1. C R > m. 2. Q0 C : Q0 3. Q C : ( Q
d Q0 R > m d−1 Q R m).
Le S deno e he suppor of Q0 . No e ha S = d (because o herwise he las S be he se represen ed by wo condi ions become con radic ory). Le S0 Q0 . We claim ha HS1 (S0 ) is represen able by Cn [m:m] (which would conclude he proof). Condi ion (A) is obvious because Q0 R > m. Condi ion (B) can be
The Consistency Dimension and Distribution-Dependent Learnin
89
seen as follows. For each x S, de ne Qx as he subsample of C wi h suppor S x , and Q0x as he sample wi h suppor S ha coincides wi h C on S x , bu disagrees on x. Because each Qx is a subsample of C of bread h d − 1, i follows ha Qx R m for all x S. We conclude ha he same remark applies o samples Q0x , since a concep ha is consis en wi h Qx , bu inconsis en wi h Q0 , mus be consis en wi h Q0x . Finally no e ha he samples Q0x , x S, are exac ly he represen a ions of he se s in HS1 (S0 ), respec ively.
4.2
Applications of the Sphere Number
In his subsec ion, C deno es a concep class. The main resul s of his sec ion are derived wi hou referring o a represen a ion class R. We will however some imes apply a general heorem o he special case where he concep class consis s of concep s wi h a represen a ion of size a mos m. I will be convenien o adap some of our no a ions accordingly. For ins ance, X and he following we say ha 1-sphere HS1 (S0 ) is representable by C if S wo condi ions are valid: (A) C does no con ain a hypo hesis H ha assigns label 1 o all ins ances in S0 and label 0 o all ins ances in S S0 . (B) For each S 0 HS1 (S0 ), here exis s a concep C 0 C ha assigns label 1 o all ins ances in S 0 and label 0 o all ins ances in S S 0 . xs , hen The following no a ion will be used in he sequel. If S = x1 x for i = 1 s. Thus, S1 Ss are he se s belonging o S = S0 HS1 (S0 ). The concep from C which represen s S in he sense of Condi ion (B) is deno ed as C . The sphere number associated with C, deno ed as sph(C), is he size of he larges 1-sphere ha is represen able by C. Similar conven ions are made for he learning complexi y measure LC. Theorem 14 Let C = HS1 (S0 ) be a 1-sphere and D an arbitrary but xed distribution on S. Then, LCEQ[D] (C) 1 + log(1 ) . xs , and le C1 Cs be he concep s from C used Proof. Le S = x1 Ss HS1 (S0 ), respec ively. Le H1 Hs be a permu a ion o represen S1 Cs sor ed according o increasing values of D(x ). Consider he EQof C1 learner which issues i s hypo heses in his order. I follows ha as long as here exis coun erexamples of a s ric ly posi ive probabili y, he probabili y ha he eacher re urns he coun erexample xj associa ed wi h he arge concep Cj is a leas 1 2 per query. Thus, he probabili y ha he arge is no known af er log(1 ) EQs is a mos . Thus, wi h probabili y a leas 1 − , one more query su ces o receive answer YES. As he number of EQs needed o learn 1-spheres from arbi rary coun erexamples equals he size s of he 1-sphere, and he upper bound in Theorem 14
90
Jose L. Balcazar et al.
does no depend on s a all, he model of EQ-learning from he D- eacher for a xed dis ribu ion D is, in general, more powerful han he ordinary model. The gap be ween he number of EQs needed in bo h models can be made arbi rarily large. Recall ha Dun f deno es he class of dis ribu ions ha are uniform on a subdomain S X and assign zero probabili y o ins ances from X S. Theorem 15 The followin lower bound even holds for randomized learners: LCEQ (C)
LCEQ[Dun f ] (C)
(1 − )sph(C)
Proof. The rs inequali y is rivial. We prove he second one. xs , and le C1 Cs be he concep s from C used o Le S = x1 Ss HS1 (S0 ), respec ively. For j = 1 s, le Dj be he represen S1 probabili y dis ribu ion ha assigns zero probabili y o xj and is uniform on he remaining ins ances from S. Clearly, Dj Dun f . A learner mus receive answer YES wi h probabili y a leas 1 − of success for each pair (C D), where C C is he arge concep , and coun erexamples are re urned randomly according o D D. I follows ha , if arge concep Cj is Cs , and coun erexamples are subsedrawn uniformly a random from C1 quen ly re urned according o Dj , answer YES is s ill ob ained wi h probabili y a leas 1 − of success. No e ha we randomize over he uniform dis ribu ion on he 1-sphere (random selec ion of he arge concep ), over he drawings of dis ribu ion Dj condi ioned o he curren se s of coun erexamples, respec ively, and over he in ernal coin osses of he learner. Assume w.l.o.g. ha all hypo heses are consis en wi h he coun erexamples received so far. Le C 0 be he nex hypo hesis, and S 0 S he subse of ins ances from S being labeled 1 by C 0 . Because HS1 (S0 ) is represen able by C, S 0 mus di er from S0 on a leas one elemen of S. If S 0 = Sj , hen he learner receives xj is no emp y. No e ha answer YES. O herwise, he se U = (S 0 Sj ) he coun erexample x o C 0 is picked from U uniformly a random. This leads o he removal of only C from he curren version space V. The punchline of his discussion is ha he following holds af er he re urnal of q coun erexamples: Cs . 1. The curren version space V con ains s−q candida e concep s from C1 They are (by symme ry) s a is ically indis inguishable o he learner. 2. The nex hypo hesis is essen ially a random guess in V, ha is, he chance o receive answer YES is exac ly 1 V . The reason is ha , from he perspec ive of he learner, all candida e arge concep s in V are equally likely.4 4
This mi ht look unintuitive at rst lance, because the learner does not necessarily draw the next hypothesis at random from V accordin to the uniform distribution. But notice that a random bit cannot be uessed with a probability of success lar er than 1 2 no matter which procedure for uessin is applied. This is the kind of ar ument that we used.
The Consistency Dimension and Distribution-Dependent Learnin
91
If answer YES is received before s EQs were issued, hen only because i was guessed wi hin V by chance. We can illus ra e his by hinking of wo players. Player 1 de ermines a random a number be ween 1 and s ( he hidden arge concep ). Player 2 s ar s random guesses. The probabili y ha he arge number was de ermined af er q guesses is exac ly q s. Thus, a leas (1 − )s guesses are required o achieve probabili y 1 − of success. Corollary 16 Let R = ( R ) be a representation class de nin a doubly parameterized concept class C. The followin lower bound holds for all m and n, even for randomized learners: LCEQ R (n m m)
EQ[Dun
LCR
f]
(n m m)
(1 − )sphR (n m m)
This means ha , considering learning algori hms ha do no make queries longer han he size of he arge concep , he informa ion- heore ic barrier for EQlearning from arbi rary coun erexamples is s ill a barrier for EQ-learning from Dun f - eachers. This nega ive resul even holds when he learner is randomized, so ha his implies ha i applies as well o he model of EQ-learning from Dsamples, which has been proved earlier o be subsumed by randomized learners from D- eachers. On he o her hand, no e ha he resul s of his sec ion generalize readily o he case in which he hypo heses queried come from a di eren class H larger han C, or in par icular o learners which query hypo hesis of size up o M > m when he arge is known o have size a mos m. The poin is ha , for his case, he las corollary does no have anymore he in erpre a ion ha we have described, since he sphere number no longer is guaran eed o coincide wi h he s rong consis ency dimension, so he lower bound for he learning complexi y ha we would ob ain is no longer, in principle, he same informa ion- heore ic barrier as for EQ-learning.
References [1] D. An luin. Inference of reversible lan ua es. J. Assoc. Comput. Mach., 29:741 765, 1982. [2] D. An luin. Learnin re ular sets from queries and counterexamples. Information and Computation, 75:87 106, 1987. [3] D. An luin. Queries and concept learnin . Machine Learnin , 2:319 342, 1988. [4] D. An luin. Ne ative results for equivalence queries. Machine Learnin , 5:121 150, 1990. [5] J. Castro and J. L. Balcazar. Simple pac learnin of simple decision lists. In Sixth Internatinal Workshop, ALT’95, pa es 239 248, Fukuoka, Japan, 1995. LNAI. Sprin er. [6] R. Gavalda. On the power of equivalence queries. In EUROCOLT’93, pa es 193 203. LNAI. Sprin er, 1993. [7] Y. Hayashi, S. Matsumoto, A. Shinoara, and M. Takeda. Uniform characterization of polynomial-query learnabilities. In First International Conference, DS’98, pa es 84 92, Fukuoka, Japan, 1998. LNAI. Sprin er.
92
Jose L. Balcazar et al.
[8] L. Hellerstein, K. Pillaipakkamnatt, V. Ra havan, and D. Wilkins. How many queries are needed to learn? Journal of the ACM, 43(5):840 862, 1996. [9] M. Li and P. Vitanyi. Learnin simple concepts under simple distributions. SIAM Journal on Computin , 20(5):911 935, 1991. [10] H. Simon. Learnin decision lists an trees with equivalence queries. In Second European Conference, EuroCOLT’95, pa es 322 336, Barcelona, Spain, 1995. Lecture Notes in Arti cial Intelli ence. [11] L. Valiant. A theory of the learnable. Comm. ACM, 27:1134 1142, 1984.
The VC-Dimension of Subclasses of Pattern Lan ua es Andrew Mi chell1 , Tobias Sche er2 , Arun Sharma1 , and Frank S ephan3 1
University of New South Wales, School of Computer Science and En ineerin , Sydney, 2052 NSW, Australia, andrewm,
[email protected] 2 Otto-von-Guericke-Universit¨ at Ma debur , FIN/IWS, Universit¨ atsplatz 2, 39106 Ma debur , Germany,
[email protected] debur .de 3 Mathematisches Institut, Universit¨ at Heidelber , Im Neuenheimer Feld 294, 69120 Heidelber , Germany,
[email protected] .de
Abs rac . This paper derives the Vapnik Chervonenkis dimension of several natural subclasses of pattern lan ua es. For classes with unbounded VC-dimension, an attempt is made to quantify the rate of rowth of VC-dimension for these classes. This is achieved by computin , for each n, size of the smallest witness set of n elements that is shattered by the class. The paper considers both erasin (empty substitutions allowed) and nonerasin (empty substitutions not allowed) pattern lan ua es. For erasin pattern lan ua es, optimal bounds for this size within polynomial order are derived for the case of 1 variable occurrence and unary alphabet, for the case where the number of variable occurrences is bounded by a constant, and the eneral case of all pattern lan ua es. The extent to which these results hold for nonerasin pattern lan ua es is also investi ated. Some results that shed li ht on e cient learnin of subclasses of pattern lan ua es are also iven.
1
Introduction
The simple and in ui ive no ion of pa ern languages was formally in roduced by Angluin [1] and has been s udied ex ensively, bo h in he con ex of formal language heory and compu a ional learning heory. We give a brief overview of he work on learnabili y of pa ern languages o provide a con ex for he resul s in his paper. We refer he reader o Salomaa [20, 21] for a review of he work on pa ern languages in formal language heory. In he presen paper, we consider bo h kinds of pa ern languages: erasin (when emp y subs i u ions are allowed) and nonerasin (when emp y subs i uions are not allowed). Angluin [2] showed ha he class of nonerasing pa ern languages is iden i able in he limi from only posi ive da a in Gold’s model [8]. Since i s in roduc ion, pa ern languages and heir varian s have been a subjec of in ense s udy in iden i ca ion in he limi framework (for a review, see Shinohara and Arikawa [24]). Learnabili y of he class of erasing pa ern languages was O. Watanabe, T. Yokomori (Eds.): ALT’99, LNAI 1720, pp. 93 105, 1999. c Sprin er-Verla Berlin Heidelber 1999
94
Andrew Mitchell et al.
rs considered by Shinohara [23] in he iden i ca ion in he limi framework. This class urns ou o be very complex and i is s ill open whe her for ni e alphabe of size > 1, he class of erasing pa ern languages can be iden i ed in he limi from only posi ive da a1 . Since he class of nonerasing pa ern languages is iden i able in he limi from only posi ive da a, a na ural ques ion is if here is any gain o be had if nega ive da a is also presen . Lange and Zeugmann [14] observed ha in he presence of bo h posi ive and nega ive da a, he class of nonerasing pa ern languages is iden i able wi h 0 mind changes; ha is, here is a learner ha af er looking a a su cien number of posi ive and nega ive examples comes up wi h he correc pa ern for he language. This res ric ed one-sho version of iden i ca ion in he limi is referred o as nite identi cation. Since ni e iden i ca ion is a ba ch model, ni e learning from bo h posi ive and nega ive da a may be viewed as an idealized version of Valian ’s [25] PAC model. In his paper, we show ha even he VC-dimension of pa erns wi h one single variable and an alphabe size of 1 is unbounded. This implies ha even his res ric ed class of pa ern languages is no learnable in Valian ’s sense, even if we omi all polynomial ime cons rain s from Valian ’s de ni ion of learning. This resul (which holds for bo h he nonerasing and erasing cases) may appear o be a odds wi h he observa ion of Lange and Zeugmann [14] ha he class of nonerasing pa ern languages can be ni ely learned from bo h posi ive and nega ive da a. The apparen discrepancy be ween he wo resul s is due o a sub le di erence on he manner in which he wo models rea da a. In ni e iden i ca ion, he learner has he luxury of wai ing for a ni e, bu unbounded, sample of posi ive and nega ive examples before making i s conjec ure. On he o her hand, he learner in he PAC model is required o perform on any xed sample of an adequa e size. So, clearly he condi ions in he PAC se ing wi h respec o da a presen a ion are more s ric . Since his res ric ed class and several o her subclasses of pa ern languages considered in his paper have unbounded VC-dimension, we make an a emp o quan ify he ra e of grow h of VC-dimension for hese classes. This is achieved by compu ing, for each n, size of he smalles wi ness se of n elemen s ha is sha ered by he class. The mo iva ion for compu ing such a Vapnik Chervonenkis Witness Size is as follows. Al hough classes wi h unbounded VCdimension are no PAC-learnable in general, hey may become learnable under cer ain cons rain s on he dis ribu ion. An of en used cons rain is ha he disribu ion favors shor s rings. Therefore, an in eres ing ques ion is: How large is he VC-dimension if only s rings of a cer ain leng h are considered? Ma hema ically, i is perhaps more elegan o pose he ques ion: Wha is he leas leng h m such ha n s rings of size up o m are sha ered? We refer o his value of m as he Vapnik Chervonenkis Dimension Witness Size for n and express i as a func ion of n, vcdws(n). Hence, higher he grow h ra e of he func ion vcdws, 1
See Mitchell [17] where a subclass of erasin pattern lan ua es is shown to be identi able in the limit from only positive data. This paper also shows that the class of erasin pattern lan ua es is learnable if the alphabet size is 1 or ∞.
The VC-Dimension of Subclasses of Pattern Lan ua es
95
smaller is he local VC-dimension for s rings up o a xed leng h and easier i may be o learn he class under a sui ably cons rained varian of he PAC model. Al hough he VC-dimension of 1-variable pa ern languages is unbounded, we no e ha if a leas one posi ive example is presen , he VC-dimension of nonerasing pa ern languages becomes bounded ( his can be formally expressed in he erminology of version-spaces). Unfor una ely, his does no help in he case of erasing pa ern languages, as we are able o show ha he VC-dimension of 1-variable erasing pa ern languages is unbounded even in he presence of a posi ive example. In k-variable pa erns, he bound k is on he number of dis inc variables in he pa ern and no on he o al number of occurrences of all variables. We also consider he case where he number of occurrences of all he variables in a pa ern is bounded. We show ha he VC-dimension of he class of languages genera ed by pa erns wi h a mos 1 variable occurrence is 2. For variable occurrence coun 2, he VC dimension urns ou o be unbounded provided he alphabe size is a leas 2 and he pa ern has a leas wo dis inc variables. We also consider he case where he only requiremen is ha each variable occur exac ly n imes in he pa ern (so, here is nei her any bound on he number of dis inc variables nor any bound on he o al number of variable occurrences). We show ha he VC dimension of languages genera ed by pa erns in which each variable occurs exac ly once is unbounded. We no e ha his resul also holds for any general n. Having es ablished several VC-dimension resul s, we urn our a en ion o issues involved in e cient learning of pa ern language subclasses. One problem wi h e cien learning of pa ern languages is he NP-comple eness of he membership decision [1].2 This NP-comple eness resul already implies ha pa ern languages canno be learned polynomially in Valian ’s sense when he hypo heses are pa erns (because Valian requires ha , for a given ins ance, he ou pu of he hypo hesis mus be compu able in polynomial ime). Schapire [22] s reng hened his resul by showing ha pa ern languages canno be polynomially PAC-learned independen of he represen a ion chosen for he hypo heses. Compu ing he ou pu of a hypo hesis canno be done in polynomial ime using any coding scheme which is powerful enough for learning pa ern languages. Also, Ko, Marron and Tzeng [13] have shown ha he problem of nding any pa ern consis en wi h a se of posi ive and nega ive examples is NP-hard. Marron and Ko [16] considered necessary and su cien condi ions on a ni e posi ive ini ial sample ha allows exac iden i ca ion of a arge k-variable pa ern from he ini ial sample and from polynomially many membership queries. La er, Marron [15] considered he exac learnabili y of k-variable pa erns wi h polynomially 2
An luin [1] showed that the class of nonerasin pattern lan ua es is not learnable with polynomially many queries if only equivalence, membership, and subset queries are allowed and as lon as any hypothesis space with the same expressive power as the class bein learned is considered. However, she ave an al orithm for exactly learnin the class with a polynomial number of superset queries.
96
Andrew Mitchell et al.
many membership queries, bu where he ini ial sample consis s of only a single posi ive example. In he PAC se ing, Kearns and Pi [11] showed ha k-variable pa ern languages can be PAC-learned under produc dis ribu ions3 from polynomially many s rings. A rs blush, heir resul appears o con radic our claim ha kvariable pa erns have an unbounded VC-dimension. A closer look a heir resul reveals ha hey assume an upper bound on he leng h of subs i u ion s rings which essen ially bounds he VC-dimension of he class. When he subs iu ions of all variables are governed by independen and iden ical dis ribu ions, hen k-variable pa ern languages can (under a mild addi ional dis ribu ional assump ion) even be learned linearly in he leng h of he arge pa ern and singly exponen ially in k [19]. In his paper we show ha in he case of nonerasing pa ern languages, he rs posi ive example s ring con ains enough informa ion o bound he necessary sample size wi hou any assump ions on he underlying dis ribu ion. (This resul holds even for in ni e alphabe s.) Unfor una ely, as already no ed his resul does no ransla e o he case of k-variable erasing pa ern languages, as even in he presence of a posi ive example, he VC-dimension of single-variable erasing pa ern languages is unbounded. We nally consider some resul s in he framework of agnos ic learning [9, 12]. Here no prior knowledge abou he ac ual concep class being learned is assumed. The learner is required o approxima e he observed dis ribu ion on classi ed ins ances almos as well as possible in he given hypo hesis language (wi h high probabili y) in polynomial ime. Agnos ic learning may be viewed as he branch of learning heory closes o prac ical applica ions. Unfor una ely, no even conjunc ive concep s [12] and half-spaces [10] are agnos ically learnable. Shallow decision rees, however, have been shown o be agnos ically learnable [3, 6].
2
Preliminaries
The symbol deno es he emp y s ring. Le s be a s ring, word, or pa ern. Then he len th of s, deno ed s , is he number of symbols in s. A pa ern is a s ring over elemen s from he basic language and variables from a variable alphabe ; we use lower case La in le ers for elemen s of and upper case La in le ers for variables. The number of variables is he number of dis inc variable symbols occurring in a pa ern, he number of occurrences of variables is he o al number of occurrences of variable symbols in a pa ern. An erasing pa ern language con ains all words x genera ed by he pa ern in he sense ha every variable occurrence A is subs i u ed wi hin he whole word by he , a non-erasing pa ern language con ains he words where same s ring A 3
More precisely, they require the positive examples in the sample to be enerated accordin to a product distribution, but allow any arbitrary distribution for the ne ative examples.
The VC-Dimension of Subclasses of Pattern Lan ua es
97
he variables are subs i u ed by non-emp y s rings only. Thus, in non-erasing pa ern languages every word is a leas as large as he pa ern genera ing i . For example, if = aAbbBabAba, hen he leng h is 10, he number of variables 2 and he number of variable occurrences is 3. In an erasing pa ern language, genera es he words abbabba (by A = and B = ) and abbaabba (by A = and B = a) which i does no genera e in he nonerasing case. In bo h cases, erasing and nonerasing, genera es he word aabbabaababa (by A = a and B = aba). This allows us o de ne subclasses of pa ern languages genera ed by a pa ern wi h up o k variables or up o l variable occurrences. Quantifyin Unbounded VC-Dimension. The VC-dimension of a class L of languages is he size of he larges se of words S such ha L sha ers S. The VC-dimension of a class L is unbounded i , for every n, here are n words xn such ha L sha ers hem. As mo iva ed in he in roduc ion, we x1 x2 in roduce he func ion vcdws(n) = min max x1
x2
xn
: L sha ers x1 x2
xn
where vcdws s ands for Vapnik Chervonenkis Dimension Witness Size and reurns he size of he smalles wi ness for he fac ha he VC-dimension is a leas n. De ermining vcdws for several na ural classes is one of he main resul s of he presen work. Version Spaces. Given a se of languages L and a se of posi ive and nega ive example s rings S, he version space V S(L S) consis s of all languages in L ha genera e all he posi ive bu none of he nega ive s rings in S. I follows from Theorem 2.1 of Blumer et al. [5] ha L can be PAC-learned wi h he sample S and a ni e number of addi ional examples if V S(L S) has a ni e VCdimension. In Sec ion 3 we will show ha , for cer ain classes, he VC dimension of he version space remains in ni e af er a sample S has been read while in Sec ion 5 we show ha , for o her classes, he VC-dimension of he version space can urn from in ni e o ni e when he rs posi ive example arrives.
3
VC-Dimension of Erasin Pattern Lan ua es
Our rs resul shows ha even he very res ric ive class of 1-variable erasing pa ern languages over he unary alphabe has an unbounded VC-dimension. This special case is he only one for which he exac value of vcdws is known. Theorem 1. The VC-dimension of the class of erasin 1-variable pattern lanua es is unbounded. If the size of the alphabet is 1, then one can determine the pn−1 exact size of the smallest witness by the formula vcdws(n) = p2 p3 where pm is the m-th prime number (p1 = 2 p2 = 3 p3 = 5 p4 = 7 ). pn−1 , le xk = amk Proof: Le = a . For he direc ion vcdws(n) p2 p3 where mk = p1 p2 pn−1 pk , ha is, mk is he produc of he rs n − 1 x1 x2 xn , primes excep he k- h one. Le xn = . For every subse E
98
Andrew Mitchell et al.
le pE be he produc of hose pk where k n and xk E and where pE = 1 if E xn . Now he pa erns ApE and apE ApE genera e a word am wi h m > 0 i pE divides m. Fur hermore, he word is genera ed by ApE bu no by apE ApE . E in he case So he language genera ed by ApE con ains exac ly he xk xn E; he language genera ed by apE ApE con ains exac ly he xk E in he pn−1 one has case xn E. Since he longes xk is x1 whose leng h is p2 p3 pn−1 . ha vcdws(n) p2 p3 For he converse direc ion assume ha he erasing 1-variable pa ern lanamn where m1 m2 mn . guages sha er E = am1 am2 dk ek Le a A genera e all elemen s in E excep amk . Now one has, for k 0 2 3 n − k , ha am1 = adk +c1 ek and amk0 = adk +ck0 ek , so mk0 − m1 = (ck0 − c1 )ek . On he o her hand, mk − m1 is no a mul iple of ek and mk − m1 has a prime fac or qk which does no divide any di erence mk0 − m1 . I follows ha for every di erence mk − m1 here is one prime number qk dividing all o her di erences mk0 − m1 and herefore, any produc of n − 2 of such di eren prime qn qk is numbers mus divide some di erence mk − m1 . The produc q2 q3 a lower bound for mk and, for he k wi h he smalles number qk , mk is a leas pn−1 . So vcdws(n) p2 p3 pn−1 . he produc of he primes p2 p3 As no ed, his is he only case where vcdws has been de ermined exac ly. I will be shown ha more variables enable smaller values for vcdws(n) while i is unknown whe her larger alphabe s give smaller values for vcdws(n) in he case of erasing 1-variable pa ern languages. The above proof even shows he following: Given any posi ive example w, here is s ill no bound on he VC-dimension of he version space of he class wi h respec o he example se w . Since he given w akes he place of xn in he proof above, one now ge s he upper bound pn−1 ins ead of p2 p3 pn−1 and uses for xk he words wamk w + p2 p3 wi h mk = p1 p2 pn−1 pk in he case k n and xk = w in he case k = n. Theorem 2. For any positive example w, the VC-dimension of the version space of the class of all erasin 1-variable pattern lan ua es is unbounded and pn . vcdws(n) w + p2 p3 However, if wo posi ive examples are presen , hen in some rare cases i may be possible o bound he number of pa erns. For example, le he alphabe be = a b . Then if bo h s rings a and b are in he language and are presen ed as posi ive examples, i is immedia e ha he only pa ern language ha sa is es his case is . An al erna ive o limi ing he number of variables is limi ing he number of variable occurrences. If he bound is 1, hen he class is qui e res ric ive and has VC-dimension 2. However, as soon as he bound becomes 2, one has unbounded VC-dimension. Theorem 3. The VC-dimension of the class of erasin pattern lan ua es enerated by patterns with at most 1 variable occurrence is 2. One can generalize he resul and show ha every 1-variable erasing pa ern language wi h up o k occurrences of his variable has bounded Vapnik Chervo-
The VC-Dimension of Subclasses of Pattern Lan ua es
99
nenkis dimension. This is no longer rue for 2-variable erasing pa ern languages wi h up o 2 occurrences. Theorem 4. The VC-dimension of the class of all erasin pattern lan ua es enerated by patterns with at most 2 variable occurrences is unbounded. Furthermore, vcdws(n) (3n + 2) 2n . Proof: For each k, le xk be he conca ena ion of all s rings an b ban where is a b. Now, for every subse E of a b n and he k- h charac er of xn le E be a s rings of leng h n such ha E (k) = a if xk E x1 x2 and E (k) = b if xk E. Now he language genera ed by Ab E bB con ains xk i b E b is a subs ring of xk i he k- h charac er in E is a b which by de ni ion xn is sha ered. is equivalen o xk E. So, he se x1 x2 The corresponding heorem holds also for nonerasing pa ern languages since A and B ake a leas he s ring an or even some hing longer. A na ural ques ion is whe her he lower bound is also exponen ial in n. The nex heorem answers his ques ion a rma ively. Theorem 5. For iven k, the class of all pattern lan ua es with up to k variable occurrences satis es vcdws(n) 2(n−1) (2k+1) . xn , one needs 2n−1 pa erns which con ain x1 and Proof: Given x1 x2 xn . Le m = x1 which is a lower bound for he size if sha er x2 x3 xn are he shor es words sha ered by he considered class. A pa x1 x2 k variable occurrences and, for he l- h variable ern genera ing x1 has h x1 m, and he occurrence in his word, one has a beginning en ry al leng h bl of he variable in x1 . Knowing x1 , each pa ern genera ing x1 has a unique descrip ion wi h he given parame ers. So one ge s he upper bound + m2k m2k+1 for he number of pa erns genera ing x1 and 1 + m 2 + m4 + has m2k+1 2n−1 , ha is, m 2(n−1) (2k+1) . The nex heorem is abou he general case of he class of all erasing pa ern languages. S ric lower bounds are vcdws(n) log(n) log( ) for he case of alphabe size 2 or more and vcdws(n) n − 1 for he case of alphabe size 1. These lower bounds are given by he size of he larges s ring wi hin a se of n s rings. These s raigh forward bounds are modulo a linear fac or op imal for he unary and binary alphabe s. Theorem 6. For arbitrary erasin pattern lan ua es: (a) If = a then vcdws(n) 2n. (b) If = a b then vcdws(n) 4 + 2 log(n). xn = an+n and Proof (a) In his case = a . Le x1 = an+1 x2 = an+2 xn . Now le E be he conca ena ion of hose An+k E be a subse of x1 x2 k wi h xk E: Taking Ak = a and all o her variables o , he word genera ed by E is xk . If a leas wo variables are no emp y or one akes a s ring s ric ly
100
Andrew Mitchell et al.
longer han 1 hen he overall leng h is a leas 2n + 2 and he word genera ed xn . So, E genera es exac ly hose words xk which are in ou side x1 x2 xn . E and he erasing pa ern languages sha er x1 x2 Proof (b) In his case = a b . Le m be he rs in eger such ha he se a b m such Um con ains a leas n s rings where Um consis s of all words w ha a b occur similar of en in w, he rs charac er of w is a and aa bb ab and xn ba are subwords of w. One can show ha m 4 + 2 log(n). Le x1 x2 be di eren words in Um . Now, for every k, le k be he pa ern ob ained from xk by replacing a by Ak and b by Bk , and le E be he conca ena ion of hose E. Since every variable occurs in E m 2 imes, one has ha k where xk ei her one variable is assigned o some xed u a b 2 and he genera ed word m 2 or here are wo variables such ha one of hem is assigned o a and is u he word um 2 does no con ain all he o her one o b. Since xk = um 2 am 2 bm 2 bm 2 am 2 and since he subwords aa ab ba bb and since xk rs charac er of xk is a one can conclude ha Al = a and Bl = b for some l. Now he word genera ed by E is xl and xl = xk only for l = k. Thus E genera es exac ly hose xk wi h xk E. The rivial lower bound is cons an 1 for in ni e alphabe . Bu i is impossible o have cons an upper bound for he size of he smalles wi ness, indeed here is, for every k, an n wi h vcwds(n) > k.
4
VC-Dimension of Nonerasin Pattern Lan ua es
Many of he heorems for erasing pa ern languages can be adap ed o he case of nonerasing pa ern languages. In many heorems he upper bound increases by a sublinear fac or (measured in he size of he previous value of he func ion vcdws). Fur hermore one ge s he following lower bound: Theorem 7. For any class of nonerasin n−1 2lo (n+2) .
pattern lan ua es, vcdws(n)
xn , one needs 2n−1 pa erns which con ain x1 and Proof: Given x1 x2 xn . Le m = x1 which is a lower bound for he size if sha er x2 x3 xn are he shor es words sha ered by he considered class. A pa ern x1 x2 genera ing x1 can be described as follows: To each posi ion one assigns ei her he value 0 if his posi ion is covered by a cons an from he pa ern genera ing i , he value h if i is he rs charac er of some occurrence of he h- h variable (h m) and m + 1 if i is some subsequen charac er of he occurrence of some variable. Toge her wi h x1 i self, his s ring ei her describes uniquely he pa ern genera ing x1 or is invalid if, for example, some variable occurring wice has a each occurrence a di eren leng h. So one ge s a mos (m + 2)m pa erns which genera e x1 . I follows ha (m + 2)m 2n−1 . This condi ion only holds if m 2lon−1 (n+2) .
The VC-Dimension of Subclasses of Pattern Lan ua es
101
Fur hermore, all lower bounds on vcdws carry over from erasing pa ern languages o nonerasing pa ern languages. For upper bounds, he following bounds can be ob ained by adap ing he corresponding resul s for erasing pa ern languages. In hese hree cases, he upper bounds are only sligh ly larger han hose for he erasing pa ern languages, bu in he general case wi h an alphabe of size 2 or more, he above lower bound is 2lon−1 (n+2) for nonerasing pa ern languages while 4 + 2 log(n) is an upper bound for he erasing pa ern languages. pn , where Theorem 8. For 1-variable pattern lan ua es, vcdws(n) p1 p2 pn are the rst n prime numbers. p 1 p2 For patterns with exactly two variable occurrences, vcdws(n) (3n + 2) 2n . 1 2 If the alphabet size is 1, then vcdws(n) 2 (3n + 5n) for the class of all nonerasin pattern lan ua es. No e ha in he case he alphabe size is wo, one can using he well-known fac ha nonerasing pa ern languages sha er he se of he xk = ak−1 ban−k ob ain ha vcdws(n) n.
5
Learnin k-Variable Nonerasin Patterns
Having es ablished several VC-dimension resul s in he previous sec ion, we now presen some PAC-learnabili y resul s. The fac ha he VC dimension of kvariable pa ern languages is in ni e sugges s ha his class is no learnable. However, we will show ha he version space becomes ni e af er he rs posi ive example has been seen. Theorem 9. Let and be iven. Let L be a k-variable pattern lan ua e and D be an arbitrary distribution on . Let S be an initial set of positive sentences of size at least one and let lm n = min w w S . Re ardin the version space, we can claim that V S(k-variable pattern lan ua es S) (lm n + k)lmin . Let h be any pattern consistent with a sample of size at most m lmin log lo (l1min +k) . . Then P (Err L D [h] > ) An exhaus ive learner can nd a consis en hypo hesis (if one exis s) af er enumera ing all possible (lm n +k)lmin pa erns. In order o decide whe her a pa ern is consis en wi h a sample he learner has o check if x L(h) for each example x and pa ern h. While his problem is NP-comple e for general pa erns, i can be solved polynomially for any xed number of variables k. Theorem 10. Given a sample S of positive and ne ative strin s, a consistent nonerasin k-variable pattern h can be found in O((lm n + k)lmin max x x S k ) that is, learnin is as in the case of PAC polynomial in parameters 1 and 1 but depends exponentially on the parameters lm n and k. An algori hm ha learns a k-variable pa ern s ill has a run ime which grows exponen ially in lm n . Under an addi ional assump ion on D and on he leng h of subs i u ion s rings, hey become e cien ly learnable for xed k [11].
102
Andrew Mitchell et al.
Patterns with k variable occurrences. If we res ric he pa erns o have a mos k occurrences of any variables, hey become even more easily learnable. The number of k-occurrence pa erns which are consis en wi h an ini ial example x is a mos as large as he number of k-variable pa erns ha is, he logari hm of he hypo hesis space size is polynomial which makes he required sample size polynomial, oo. However, he learner can nd a consis en hypo hesis much more quickly. Theorem 11. Pattern lan ua es with up to k occurrences of variables can be 2k−2 k k ) that is, polynomially for xed k. learned from a sample in O(lm n The idea of he proof is ha up o k subs rings in he shor es example can be subs i u ed by variables. Hence, we only need o ry all s ar and end posi ions of he variables and o enumera e all possible iden i ca ions of some variables.
6
Len th-Bounded Pattern Lan ua es
In his sec ion we show ha leng h-bounded pa ern languages are e cien ly learnable even in he agnos ic framework. Due o lack of space our rea men here is informal. We assume he alphabe size o be nite. There are ( + k + 1)k pa erns of leng h a mos k ( cons an s, up o k variables, and an emp y symbol). I follows immedia ely from Theorem 1 of [9] ha P (Err L D [h ] Err L D [h]− ) when h minimizes he empirical error, h = inf h2H Err L D [h] is he ruly bes k
(j j+k) 1 . approxima ion of L in H, and he sample size is a leas m 2 log In o her words, by re urning he hypo hesis wi h he leas empirical error a learner re urns (wi h high probabili y) a hypo hesis which is almos as good as he bes possible hypo hesis in he language. Hence, leng h bounded pa ern languages are agnos ically learnable. In order o nd h , a learner can enumera e he hypo hesis space in O(( + k + 1)k ). The union of leng h bounded pa erns is he power se of he se of leng h bounded pa erns hence, his hypo hesis space can be bounded k o a mos 2(j j+k+1) . This implies ha he sample complexi y is m (j j+k+1)k 1 log lo (j 1j+k+1) ( ha is polynomial in , and 1 ) bu we s ill need o nd an algori hm which nds a consis en hypo hesis in polynomial ime oge her his proves ha his class is polynomially learnable [4]. A greedy coverage algori hm which subsequen ly genera es pa erns which cover a leas one posi ive and no nega ive example can be guaran eed o nd a consis en hypo hesis (if one exis s) in O(( + k)k m+ ) where m+ is he number of posi ive examples ( ha is, polynomially for a xed k). In order o learn unions of leng h bounded pa ern languages agnos ically we would have o cons ruc a polynomial algori hm which nds an empirical error minimizing hypo hesis. No e ha his is much more di cul : The greedy algori hm will nd a consis en hypo hesis if one exis s. I may occur ha every posi ive ins ance is covered by a dis inc pa ern. In he agnos ic framework,
The VC-Dimension of Subclasses of Pattern Lan ua es
103
we would have o nd the hypo hesis which minimizes he observed error. An k enumera ive algori hm, however, would have a ime complexi y of O(2(j j+k) ).
7
Conclusion
We s udied he VC-dimension of several subclasses of pa ern languages. We showed ha even single variable pa ern languages have an unbounded VCdimension. For his and several o her classes wi h unbounded VC-dimension we fur hermore quan i ed he VC-dimension wi ness size, hus charac erizing jus how quickly he VC-dimension grows. We showed ha he VC-dimension of he class of single variable pa ern languages which are consis en wi h a posi ive example is unbounded; by con ras , he class of pa ern languages wi h k variable occurrences which are consis en wi h a posi ive example is ni e. Hence, af er he rs posi ive example has been read, he sample size which is necessary or good generaliza ion can be quan i ed. This resul does seem o vindica e recen a emp s by Reischuk and Zeugmann [18] (see also [7]) o s udy feasible average case learnabili y of single variable pa ern languages by placing reasonable res ric ions on he class of dis ribu ions. Acknowled ment We would like o express our gra i ude o he reviewers for several helpful commen s ha have improved he exposi ion of he paper. Andrew Mi chell is suppor ed by an Aus ralian Pos gradua e Award. Tobias Sche er is suppor ed by gran s WY20/1-1 and WY20/1-2 of he German Research Council (DFG), and an Erns -von-Siemens fellowship. Arun Sharma is suppor ed by he Ausralian Research Council Gran s A49600456 and A49803051. Frank S ephan is suppor ed by he German Research Council (DFG) gran Am 60/9-2.
References [1] D. An luin. Findin patterns common to a set of strin s. Journal of Computer and System Sciences, 21:46 62, 1980. [2] D. An luin. Inductive inference of formal lan ua es from positive data. Information and Control, 45:117 135, 1980. [3] P. Auer, R. C. Holte, and W. Maass. Theory and applications of a nostic PAClearnin with small decision trees. In Proceedin s of the 12th International Conference on Machine Learnin , pa es 21 29. Mor an Kaufmann, 1995. [4] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth. Occam’s razor. Information Processin Letters, 24:377 380, 1987. [5] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM, 36(4):929 965, 1989. [6] D. Dobkin, D. Gunopoulos, and S. Kasif. Computin optimal shallow decision trees. In Proceedin s of the International Workshop on Mathematics in Arti cial Intelli ence, 1996.
104
Andrew Mitchell et al.
[7] T. Erlebach, P. Rossmanith, H. Stadtherr, A. Ste er, and T. Zeu mann. Learnin one-variable pattern lan ua es very e ciently on avera e, in parallel, and by askin queries. In Min Li and Akira Maruoka, editors, Al orithmic Learnin Theory: Ei hth International Workshop (ALT ’97), volume 1316 of Lecture Notes in Arti cial Intelli ence, pa es 260 276, 1997. [8] E. M. Gold. Lan ua e identi cation in the limit. Information and Control, 10:447 474, 1967. [9] D. Haussler. Decision theoretic eneralizations of the PAC model for neural net and other learnin applications. Information and Computation, 100(1):78 150, 1992. [10] K. Hoe en, H. Simon, and K. van Horn. Robust trainability of sin le neurons. Preprint, 1993. [11] M. Kearns and L. Pitt. A polynomial-time al orithm for learnin k-variable pattern lan ua es. In R. Rivest, D. Haussler, and M. Warmuth, editors, Proceedin s of the Second Annual Workshop on Computational Learnin Theory. Mor an Kaufmann, 1989. [12] M. Kearns, R. Schapire, and L. Sellie. Towards e cient a nostic learnin . In Proceedin s of the Fifth Annual Workshop on Computational Learnin Theory, pa es 341 352. ACM Press, 1992. [13] K.-I Ko, A. Marron, and W.-G. Tsen . Learnin strin patterns and tree patterns from examples. In Machine Learnin : Proceedin s of the Seventh International Conference, pa es 384 391, 1990. [14] S. Lan e and T. Zeu mann. Monotonic versus non-monotonic lan ua e learnin . In G. Brewka, K. Jantke, and P. H. Schmitt, editors, Proceedin s of the Second International Workshop on Nonmonotonic and Inductive Lo ic, volume 659 of Lecture Notes in Arti cial Intelli ence, pa es 254 269. Sprin er-Verla , 1993. [15] A. Marron. Learnin pattern lan ua es from a sin le initial example and from queries. In D Haussler and L. Pitt, editors, Proceedin s of the First Annual Workshop on Computational Learnin Theory, pa es 345 358. Mor an Kaufmann, 1988. [16] A. Marron and K. Ko. Identi cation of pattern lan ua es from examples and queries. Information and Computation, 74(2), 1987. [17] A. Mitchell. Learnability of a subclass of extended pattern lan ua es. In Proceedin s of the Eleventh Annual Conference on Computational Learnin Theory. ACM Press, 1998. [18] R. Reischuk and T. Zeu mann. Learnin one-variable pattern lan ua es in linear avera e time. In Proceedin s of the Eleventh Annual Conference on Computational Learnin Theory, pa es 198 208. ACM Press, 1998. [19] P. Rossmanith and T. Zeu mann. Learnin k-variable pattern lan ua es e ciently stochastically nite on avera e from positive data. In Proc. 4th International Colloquium on Grammatical Inference (ICGI-98), LNAI 1433, pa es 13 24. Sprin er, 1998. [20] A. Salomma. Patterns (The Formal Lan ua e Theory Column). EATCS Bulletin, 54:46 62, 1994. [21] A. Salomma. Return to patterns (The Formal Lan ua e Theory Column). EATCS Bulletin, 55:144 157, 1994. [22] R.E. Schapire. Pattern lan ua es are not learnable. In M. Fulk and J. Case, editors, Proceedin s of the Third Annual Workshop on Computational Learnin Theory, pa es 122 129. Mor an Kaufmann, 1990. [23] T. Shinohara. Polynomial time inference of extended re ular pattern lan ua es. In RIMS Symposia on Software Science and En ineerin , Kyoto, Japan, volume 147 of Lecture Notes in Computer Science, pa es 115 127. Sprin er-Verla , 1982.
The VC-Dimension of Subclasses of Pattern Lan ua es
105
[24] T. Shinohara and A. Arikawa. Pattern inference. In Klaus P. Jantke and Ste en Lan e, editors, Al orithmic Learnin for Knowled e-Based Systems, volume 961 of Lecture Notes in Arti cial Intelli ence, pa es 259 291. Sprin er-Verla , 1995. [25] L. Valiant. A theory of the learnable. Communications of the ACM, 27:1134 1142, 1984.
On the Vγ Dimension for Re ression in Reproducin Kernel Hilbert Spaces Theodoros Ev eniou and Massimiliano Pontil Center for Biolo ical and Computational Learnin , MIT 45 Carleton Street E25-201, Cambrid e, MA 02142, USA theos,pont l @a .m t.edu
Abs rac . This paper presents a computation of the Vγ dimension for re ression in bounded subspaces of Reproducin Kernel Hilbert Spaces (RKHS) for the Support Vector Machine (SVM) re ression -insensitive loss function L , and eneral Lp loss functions. Finiteness of the Vγ dimension is shown, which also proves uniform conver ence in probability for re ression machines in RKHS subspaces that use the L or eneral Lp loss functions. This paper presents a novel proof of this result. It also presents a computation of an upper bound of the Vγ dimension under some conditions, that leads to an approach for the estimation of the empirical Vγ dimension iven a set of trainin data.
1
Introduction
The Vγ dimension, a variation of the VC-dimension [11], is important for the study of learnin machines [1,5]. In this paper we present a computation of the Vγ dimension of real-valued functions L(y f (x)) = y − f (x) p and (Vapnik’s -insensitive loss function L [11]) L(y f (x)) = y − f (x) with f in a bounded sphere in a Reproducin Kernel Hilbert Space (RKHS). We show that the Vγ dimension is nite for these loss functions, and compute an upper bound on it. We also present a second computation of the Vγ dimension in a special case of in nite dimensional RKHS, which is often the type of hypothesis spaces considered in the literature (i.e. Radial Basis Functions [9,6]). It also holds for the case when a bias is added to the functions, that is with f bein of the form f = f0 + b, where b R and f0 is in a sphere in an in nite dimensional RKHS. This computation leads to an approach for computin the empirical Vγ dimension (or random entropy of a hypothesis space [11]) iven a set of trainin data, an issue that we discuss at the end of the paper. Our result applies to standard re ression learnin machines such as Re ularization Networks (RN) and Support Vector Machines (SVM). For a re ression learnin problem usin L as a loss function it is known [1] that niteness of the Vγ dimension for all γ > 0 is a necessary and su cient condition for uniform conver ence in probability [11]. So the results of this paper have implications for uniform conver ence both for RN and for SVM re ression [5]. O. Watanabe, T. Yokomori (Eds.): ALT’99, LNAI 1720, pp. 106 117, 1999. c Sprin er-Verla Berlin Heidelber 1999
On the Vγ Dimension for Re ression in Reproducin Kernel Hilbert Spaces
107
Previous related work addressed the problem of pattern reco nition where L is an indicator function [3,7]. The fat-shatterin dimension [1] was considered instead of the Vγ one. A di erent approach to provin uniform conver ence for RN and SVM is iven in [13] where coverin number ar uments usin entropy numbers of operators are presented. In both cases, re ression as well as the case of non-zero bias b were mar inally considered. The paper is or anized as follows. Section 2 outlines the back round and motivation of this work. The reader familiar with statistical learnin theory and RKHS can skip this section. Section 3 presents a proof of the results as well as an upper bound to the Vγ dimension. Section 4 presents a second computation of the Vγ dimension in a special case of in nite dimensional RKHS, also when the hypothesis space consists of functions of the form f = f0 + b where b R and f0 in a sphere in a RKHS. Finally, section 5 discusses possible extensions of this work.
2
Back round and Motivation
We consider the problem of learnin from examples as it is viewed in the framework of statistical learnin theory [11]. We are iven a set of l examples (x1 y1 ) (xl yl ) enerated by randomly samplin from a space X Y with R accordin to an unknown probability distribution P (x y). X Rd , Y Throu hout the paper we assume that X and Y are bounded. Usin this set of examples the problem of learnin consists of ndin a function f : X Y that can be used iven any new point x X to predict the correspondin value y. The problem of learnin from examples is known to be ill-posed [11,10]. A classical way to solve it is to perform Empirical Risk Minimization (ERM) with respect to a certain loss function, while restrictin the solution to the problem to be in a small hypothesis space [11]. Formally this means minimizin the H, where L is the loss empirical risk Iemp [f ] = 1l li=1 L(yi f (xi )) with f function measurin the error when we predict f (x) while the actual value is y, and H is a iven hypothesis space. In this paper, we consider hypothesis spaces of functions which are hyperplanes in some feature space: 1
wn
f (x) =
n (x)
(1)
n=1
with:
1 n=1
wn2 n
<
(2)
where n (x) is a set of iven, linearly independent basis functions, n are iven 1 non-ne ative constants such that n=1 2n < . Spaces of functions of the form (1) can also be seen as Reproducin Kernel Hilbert Spaces (RKHS) [2,12] with kernel K iven by:
108
Theodoros Ev eniou and Massimiliano Pontil 1
K(x y)
n n (x) n (y)
(3)
n=1
For any function f as in (1), quantity (2) is called the RKHS norm of f , f 2K , while the number D of features n (which can be nite, in which case all sums above are nite) is the dimensionality of the RKHS. If we restrict the hypothesis space to consist of functions in a RKHS with norm less than a constant A, the eneral settin of learnin discussed above becomes: Minimize : subject to :
1 l
l i=1
L(yi f (xi )) A2 f 2 K
(4)
An important question for any learnin machine of the type (4) is whether it is consistent: as the number of examples (xi yi ) oes to in nity the expected error of the solution of the machine should conver e in probability to the minimum expected error in the hypothesis space [11,4]. In the case of learnin machines performin ERM in a hypothesis space (4), consistency is shown to be related with uniform conver ence in probability [11], and necessary and su cient conditions for uniform conver ence are iven in terms of the Vγ dimension (also known as level fat shatterin dimension) of the hypothesis space considered [1,8], which is a measure of complexity of the space. In statistical learnin theory typically the measure of complexity used is the VC-dimension. However, as we show below, the VC-dimension in the above learnin settin in the case of in nite dimensional RKHS is in nite both for Lp and L , so it cannot be used to study learnin machines of the form (4). Instead one needs to consider other measures of complexity, such as the Vγ dimension, in order to prove uniform conver ence in in nite dimensional RKHS. We now present some back round on the Vγ dimension [1]. The Vγ dimension of a set of real-valued functions is de ned as follows: De nition 1. Let C L(y f (x)) B, f H, with C and B < . The Vγ dimension of L in H (of the set L(y f (x)) f H ) is de ned as the maximum (xh yh ) that can be separated into two classes number h of vectors (x1 y1 ) in all 2h possible ways usin rules: class 1 if: L(yi f (xi )) class -1 if: L(yi f (xi ))
s+γ s−γ
for f H and some C + γ s B − γ. If, for any number N , it is possible to (xN yN ) that can be separated in all the 2N possible nd N points (x1 y1 ) ways, we will say that the Vγ -dimension of L in H is in nite. For γ = 0 and for s bein free to chan e values for each separation of the data, this becomes the VC dimension of the set of functions [11]. In the case of hyperplanes (1), the Vγ dimension has also been referred to in the literature [11] as the V C dimension of hyperplanes with mar in. In order to avoid confusion with names, we call the V C dimension of hyperplanes with mar in as the
On the Vγ Dimension for Re ression in Reproducin Kernel Hilbert Spaces
109
Vγ dimension of hyperplanes (for appropriate γ dependin on the mar in, as discussed below). The Vγ dimension can be used to bound the coverin numbers of a set of functions [1], which are in turn related to the eneralization performance of learnin machines. Typically the fat-shatterin dimension [1] is used for this purpose, but a close relation between that and the Vγ dimension [1] makes the two equivalent for the purpose of boundin coverin numbers and hence studyin the statistical properties of a machine. The V C dimension has been used to bound the rowth function H (l). This function measures the maximum number of ways we can separate l points usin functions from hypothesis space H. If h is h the V C dimension, then H (l) is 2l if l h, and ( el h ) otherwise [11] (where e is the standard natural lo arithm constant). In section 3 we will use the rowth function of hyperplanes with mar in to bound their VC dimension, which, as discussed above, is their Vγ dimension that we are interested in. Usin the Vγ dimension Alon et al. [1] ave necessary and su cient conditions for uniform conver ence in probability to take place in a hypothesis space H. In particular they proved the followin important theorem: Theorem 1. (Alon et al. , 1997 ) Let C L(y f (x))) B, f H, H be a set of bounded functions. The ERM method uniformly conver es (in probability) if and only if the Vγ dimension of L in H is nite for every γ > 0. It is clear that if for learnin machines of the form (4) the Vγ dimension of the loss function L in the hypothesis space de ned is nite for γ > 0, then uniform conver ence takes place. In the next section we present a proof of the niteness of the Vγ dimension, as well as an upper bound on it. 2.1
Why Not Use the VC-Dimension
Consider rst the case of Lp loss functions. Consider an in nite dimensional RKHS, and the set of functions with norm f 2K A2 . If for any N we can nd N points that we can shatter usin functions of our set accordin to the rule: class 1 if : class − 1 if :
y − f (x) p
s
y − f (x)
s
p
then clearly the V C dimension is in nite. Consider N distinct points (xi yi ) with yi = 0 for all i, and let the smallest ei envalue of matrix G with Gij = K(xi xj ) be . Since we are in in nite dimensional RKHS, matrix G is always invertible [12], so > 0 since G is positive de nite and nite dimensional ( may decrease as N increases, but for any nite N it is well de ned and = 0). For any separation of the points, we consider a function f of the form f (x) = N i=1 i K(xi x), which is a function of the form (1). We need to show that we can nd coe cients i such that the RKHS norm of the function is A2 . Notice that the norm of a function of this form is T G where ( )i = i (throu hout
110
Theodoros Ev eniou and Massimiliano Pontil
the paper bold letters are used for notin vectors). Consider the set of linear equations xj xj
class 1 : class − 1 :
N i=1 N i=1
i Gij i Gij
1
>0
1 p
>0
= sp + =s −
Let s = 0. If we can nd a solution to this system of equations such that T G A2 we can perform this separation, and since this is any separation we can shatter the N points. Notice that the solution to the system of equations is G−1 where is the vector whose components are ( )i = when xi is in class 1, T −1 G A2 . Since and − otherwise. So we need (G−1 )T G(G−1 ) A2 the smallest ei envalue of G is
> 0, we have that
T
T
G−1
2
. Moreover 2
A 2 = N . So if we choose small enou h such that A N , 2 the norm of the solution is less than A , which completes the proof. For the case of the L loss function the ar ument above can be repeated with yi = to prove a ain that the VC dimension is in nite in an in nite dimensional RKHS. Finally, notice that the same proof can be repeated for nite dimensional RKHS to show that the V C dimension is never less than the dimensionality D of the RKHS, since it is possible to nd D points for which matrix G is invertible and repeat the proof above. As a consequence the VC dimension cannot be controlled by A2 . This is also discussed in [13]. T
3
N
2
2
An Upper Bound on the V Dimension
Below we always assume that data X are within a sphere of radius R in the feature space de ned by the kernel K of the RKHS. Without loss of enerality, we also assume that y is bounded between −1 and 1. Under these assumptions the followin theorem holds: Theorem 2. The Vγ dimension h for re ression usin Lp (1 1 n=1
p<
) or L
2 1 wn n=1 n
loss functions for hypothesis spaces HA = f (x) = wn n (x) 2 A and y bounded, is nite for γ > 0. If D is the dimensionality of the RKHS, 2 2 +1) )). then h O(min(D (R +1)(A γ2 Proof. Let’s consider rst the case of the L1 loss function. Let B be the upper bound on the loss function. From de nition 1 we can decompose the rules for separatin points as follows: class 1 if yi − f (xi ) or yi − f (xi ) class − 1 if yi − f (xi ) and yi − f (xi )
s+γ −(s + γ) s−γ −(s − γ)
(5)
for some γ s B − γ. For any N points, the number of separations of the points we can et usin rules (5) is not more than the number of separations we
On the Vγ Dimension for Re ression in Reproducin Kernel Hilbert Spaces
111
can et usin the product of two indicator functions with mar in (of hyperplanes with mar in): function (a) :
class 1 class − 1 function (b) : class 1 class − 1
if if if if
yi − f1 (xi ) yi − f1 (xi ) yi − f2 (xi ) yi − f2 (xi )
s1 + γ s1 − γ −(s2 − γ) −(s2 + γ)
(6)
where f1 and f2 are in HA , γ s1 s2 B − γ. This is shown as follows. Clearly the product of the two indicator functions (6) has less separatin power when we add the constraints s1 = s2 = s and f1 = f2 = f . Furthermore, even with these constraints we still have more separatin power than we have usin rules (5): any separation realized usin (5) can also be realized usin the product of the two indicator functions (6) under the constraints s1 = s2 = s and f1 = f2 = f . For example, if y − f (x) s + γ then indicator function (a) will ive +1, indicator function (b) will ive also +1, so their product will ive +1 which is what we et if we follow (5). Similarly for all other cases. As mentioned in the previous section, for any N points the number of ways we can separate them is bounded by the rowth function. Moreover, for products of indicator functions it is known [11] that the rowth function is bounded by the product of the rowth functions of the indicator functions. Furthermore, the indicator functions in (6) are hyperplanes with mar in in the D + 1 dimensional space of vectors n (x) y where the radius of the data is R2 + 1, the norm of the hyperplane is bounded by A2 + 1, (where in both cases we add 1 because of 2 y), and the mar in is at least Aγ2 +1 . The Vγ dimension hγ of these hyperplanes 2
2
+1) min((D + 1) + 1 (R +1)(A ). So the is known [11,3] to be bounded by hγ γ2 rowth function of the separatin rules (5) is bounded by the product of the 2
( helγ )hγ whenever l hγ . If hreg is rowth functions ( helγ )hγ , that is (l) γ reg the Vγ dimension, then hγ cannot be lar er than the lar er number l for which ( helγ )2hγ holds. From this, after some al ebraic manipulations inequality 2l 5 min (D + (take the lo of both sides) we et that l 5hγ , therefore hreg γ (R2 +1)(A2 +1) ) γ2
which proves the theorem for the case of L1 loss functions. For eneral Lp loss functions we can follow the same proof where (5) now needs to be rewritten as:
2
1
class 1 if yi − f (xi ) or f (xi ) − yi class − 1 if yi − f (xi ) and f (xi ) − yi Moreover, for 1 < p < 1 p
1 p
1
, (s + γ) p 1 p
(7)
1
1
γ s p + pB (since γ = (s + γ) p 1 p
+ + (s ) ((s + γ) − s )(((s + γ) ) 1 1 1 1 p p p sp − ((s + γ) − s )(pB) ) and (s − γ) p−1
(s + γ) p 1 (s + γ) p 1 (s − γ) p 1 (s − γ) p
p−1
)
γ pB
1 p
1 p
p
1
− sp
p
=
((s + γ) − s )(B + B) = (similarly). Repeatin the same
112
Theodoros Ev eniou and Massimiliano Pontil
ar ument as above, we et that the Vγ dimension is bounded by 5 min (D + 2 2 2 +1) ). Finally, for the L loss function (5) can be rewritten as: 2 (pB) (R γ+1)(A 2 class 1 if yi − f (xi ) or f (xi ) − yi class − 1 if yi − f (xi ) and f (xi ) − yi
s+γ+ s+γ+ s−γ+ s−γ+
(8)
where callin s0 = s + we can simply repeat the proof above and et the same upper bound on the Vγ dimension as in the case of the L1 loss function. (Notice that the constraint γ s B − γ is not taken into account. Takin this into account may sli htly chan e the Vγ dimension for L . Since it is a constraint, it can only decrease - or not chan e - the Vγ dimension). These results imply that in the case of in nite dimensional RKHS the Vγ 2 2 +1) dimension is still nite and is influenced only by 5 (R +1)(A . In the next γ2 section we present a di erent upper bound on the Vγ dimension in a special case of in nite dimensional RKHS.
4
The V Dimension in a Special Case
Below we assume that the data x are restricted so that for any nite dimensional matrix G with entries Gij = K(xi xj ) (where K is, as mentioned in the previous section, the kernel of the RKHS considered, and xi = xj for i = j) the lar est ei envalue of G is always M 2 for a iven constant M . We consider only the case that the RKHS is in nite dimensional. We note with B the upper bound of L(y f (x)). Under these assumptions we can show that: Theorem 3. The Vγ dimension for re ression usin L1 loss function and for hypothesis space HA = f (x) = for γ > 0. In particular:
1 n=1
wn
1. If b is constrained to be zero, then Vγ 2. If b is a free parameter, Vγ Proof of part 1. Suppose we can nd N >
4
M 2 A2 γ2
n (x)
+b
2 1 wn n=1 n
A2 is nite
M 2 A2 γ2
M 2 A2 γ2
points (x1 y1 )
(xN yN ) that we can
shatter. Let s [γ B − γ] be the value of the parameter used to shatter the points. Consider the followin separation 1: if yi < s, then (xi yi ) belon s in class 1. All other points belon in class -1. For this separation we need: yi − f (xi ) yi − f (xi ) 1
s + γ if yi < s s − γ if yi s
(9)
Notice that this separation mi ht be a trivial one in the sense that we may want all the points to be +1 or all to be -1 i.e. when all yi s or when all yi s respectively.
On the Vγ Dimension for Re ression in Reproducin Kernel Hilbert Spaces
113
This means that: for points in class 1 f takes values either yi + s + γ + i or 0. For points in the second class f takes values either yi − s − γ − i , for i yi + s − γ − i or yi − s + γ + i , for i [0 (s − γ)]. So (9) can be seen as a system of linear equations: 1
wn
n (xi )
= ti
(10)
n=1
with ti bein yi + s + γ + i , or yi − s − γ − i , or yi + s − γ − i , or yi − s + γ + i , dependin on i. We rst use lemma 1 to show that for any solution (so ti are xed now) there is another solution with not lar er norm that is of the form N i=1 i K(xi x). Lemma 1. Amon all the solutions of a system of equations (10) the solution N with the minimum RKHS norm is of the form: i=1 i K(xi x) with = G−1 t. For a proof see the Appendix. Given this lemma, we consider only functions N of the form i=1 i K(xi x). We show that the function of this form that solves the system of equations (10) has norm lar er than A2 . Therefore any other solution has norm lar er than A2 which implies we cannot shatter N points usin functions of our hypothesis space. The solution = G−1 t needs to satisfy the constraint: T
G
= tT G−1 t
A2
tT t . Since be the lar est ei envalue of matrix G. Then tT G−1 t max tT t . Moreover, because of the choice of the separation, M 2 , tT G−1 t max 2 M N γ 2 (for example, for the points in class 1 which contribute to tT t an tT t yi + s > 0, and since γ + i γ > 0, amount equal to (yi + s + γ + i)2 : yi < s then (yi + s + γ + i )2 γ 2 . Similarly each of the other points contribute to tT t at least γ 2 , so tT t N γ 2 ). So: Let
max
tT G−1 t 2
N γ2 > A2 M2
2
since we assumed that N > Mγ 2A . This is a contradiction, so we conclude that we cannot et this particular separation. Proof of part 2. Consider N points that can be shattered. This means that for any separation, for points in the rst class there are i 0 such that f (xi ) + b − yi = s + γ + i . For points in the second class there are i [0 s − γ] such that f (xi ) + b − yi = s − γ − i . As in the case b = 0 we can remove the absolute values by considerin for each class two types of points (we call them type 1 and type 2). For class 1, type 1 are points for which f (xi ) = yi + s + γ + i − b = ti − b. Type 2 are points for which f (xi ) = yi − s − γ − i − b = ti − b. For class 2, type 1
114
Theodoros Ev eniou and Massimiliano Pontil
are points for which f (xi ) = yi + s − γ − i − b = ti − b. Type 2 are points for which f (xi ) = yi − s + γ + i − b = ti − b. Variables ti are as in the case b = 0. Let S11 S12 S−11 S−12 denote the four sets of points (Sij are points of class i type j). Usin lemma 1, we only need to consider functions of the form N f (x) = i=1 i K(xi x). The coe cients i are iven by = G−1 (t − b) there b is a vector of b’s. As in the case b = 0, the RKHS norm of this function is at least 1 (t − b)T (t − b) (11) M2 N
The b that minimizes (11) is N1 ( i=1 ti ). So (11) is at least as lar e as (after N replacin b and doin some simple calculations) 2N1M 2 i j=1 (ti − tj )2 . We now consider a particular separation. Without loss of enerality assume y2 yN and that N is even (if odd, consider N − 1 points). that y1 Consider the separation where class 1 consists only of the even points N N − 2 2 . The followin lemma is shown in the appendix: Lemma 2. For the separation considered, as
2
2
N i j=1 (ti
− tj )2 is at least as lar e
γ (N −4) . 2
Usin Lemma 2 we et that the norm of the solution for the considered separation 2 (N 2 −4) A2 we et that N − is at least as lar e as γ 4N M 2 . Since this has to be 2
2
4 Mγ 2A , which completes the proof (assume N > 4 and i nore additive constants less than 1 for simplicity of notation). 4 N
In the case of Lp loss functions, usin the same ar ument as in the previous section we et that the Vγ dimension in in nite dimensional RKHS is bounded 2 2 2 2 2 2 A A in the rst case of theorem 3, and by 4 (pB) γM in the second by (pB) γM 2 2 case of theorem 3. Finally for L loss functions the bound on the Vγ dimension is the same as that for L1 loss function, a ain usin the ar ument of the previous section. 4.1
Empirical V Dimension
Above we assumed a bound on the ei envalues of any nite dimensional matrix G. However such a bound may not be known a priori, or it may not even exist, in which case the computation is not valid. In practice we can still use the method presented above to measure the empirical Vγ dimension iven a set of l trainin points. This can provide an upper bound on the random entropy of our hypothesis space [11]. More precisely, iven a set of l trainin points we build the l l matrix G as before, and compute it’s lar est ei envalue max . We can then substitute M 2 with max in the computation above to et an upper bound of what we call the empirical Vγ dimension. This can be used directly to et bounds on the random entropy (or number of ways that the l trainin points can be separated usin rules (5)) of our hypothesis space. Finally the statistical properties of our
On the Vγ Dimension for Re ression in Reproducin Kernel Hilbert Spaces
115
learnin machine can be studied usin the estimated empirical Vγ dimension (or the random entropy), in a way similar in spirit as in [13].
5
Conclusion
We presented a novel approach for computin the Vγ dimension of RKHS for Lp and L loss functions. We conclude with a few remarks. First notice that in the computations we did not take into account in the case of L loss function. Takin into account may lead to better bounds. For example, considerin f (x) − y p p > 1 as the loss function, it is clear from the proofs presented that 2 2 2 2 the Vγ dimension is bounded by p (B−γ)2 M A . However the influence of seems to be minor ( iven that << B). An interestin observation is that the ei envalues of the matrix G appear in the computation of the Vγ dimension. In the second computation we took into account only the lar est and smallest ei envalues. If the computation is made to upper bound the number of separations for a iven set of points (random entropy or empirical Vγ dimension) as discussed in section 4.1, then it may be possible that all the ei envalues of G are taken into account. This can lead to interestin relations with the work in [13]. Acknowled ments We would like to thank S. Mukherjee, T. Po useful discussions and comments.
io, R. Rifkin, and A. Verri for
References 1. N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler. Scale-sensitive dimensions, uniform conver ence, and learnability. J. of the ACM, 44(4):615 631, 1997. 2. N. Aronszajn. Theory of reproducin kernels. Trans. Amer. Math. Soc., 686:337 404, 1950. 3. P. Bartlett and J. Shawe-Taylor. Generalization performance of support vector machine and other pattern classi ers. In C. Bur es B. Scholkopf, editor, Advances in Kernel Methods Support Vector Learnin . MIT press, 1998. 4. L. Devroye, L. Gy¨ or , and G. Lu osi. A Probabilistic Theory of Pattern Reco nition. Number 31 in Applications of mathematics. Sprin er, New York, 1996. 5. T. Ev eniou, M. Pontil, and T. Po io. A uni ed framework for re ularization networks and support vector machines. A.I. Memo No. 1654, Arti cial Intelli ence Laboratory, Massachusetts Institute of Technolo y, 1999. 6. F. Girosi, M. Jones, and T. Po io. Re ularization theory and neural networks architectures. Neural Computation, 7:219 269, 1995. 7. L. Gurvits. A note on scale-sensitive dimension of linear bounded functionals in Banach spaces. In Proceedin s of Al orithm Learnin Theory, 1997. 8. M. Kearns and R.E. Shapire. E cient distribution-free learnin of probabilistic concepts. Journal of Computer and Systems Sciences, 48(3):464 497, 1994.
116
Theodoros Ev eniou and Massimiliano Pontil
9. M.J.D. Powell. The theory of radial basis functions approximation in 1990. In W.A. Li ht, editor, Advances in Numerical Analysis Volume II: Wavelets, Subdivision Al orithms and Radial Basis Functions, pa es 105 210. Oxford University Press, 1992. 10. A. N. Tikhonov and V. Y. Arsenin. Solutions of Ill-posed Problems. W. H. Winston, Washin ton, D.C., 1977. 11. V. N. Vapnik. Statistical Learnin Theory. Wiley, New York, 1998. 12. G. Wahba. Splines Models for Observational Data. Series in Applied Mathematics, Vol. 59, SIAM, Philadelphia, 1990. 13. R. Williamson, A. Smola, and B. Scholkopf. Generalization performance of re ularization networks and support vector machines via entropy numbers. Technical Report NC-TR-98-019, Royal Holloway Colle e University of London, 1998.
Appendix Proof of Lemma 1 We introduce the N matrix Ain = wn p . We can write system (10) as follows:
n n (xi )
and the new variable zn =
n
Az = t
(12)
Notice that the solution of the system of equation 10 with minimum RKHS norm, is equivalent to the Least Square (LS) solution of equation 12. Let us denote with z0 the LS solution of system 12. We have: z0 = (A A)+ A t
(13)
where + denotes pseudoinverse. To see how this solution looks like we use Sinular Value Decomposition techniques: A=U V A =V U from which A A = V 2 V and (A A)+ = VN N−2 VN , where N−1 denotes the N N matrix whose elements are the inverse of the nonzero ei envalues. After some computations equation (13) can be written as: z0 = V
−1 N
UN t = (V
N
UN )(UN
−2 N
UN )t = AG−1 t
(14)
Usin the de nition of z0 we have that 1
1
wn0
N
n (x) =
n=1
n n (x)Ani n=1 i=1
Finally, usin the de nition of Ain we et: 1
N
wn0 n=1
which completes the proof.
n (x)
K(x xi )
= i=1
i
i
(15)
On the Vγ Dimension for Re ression in Reproducin Kernel Hilbert Spaces
117
Proof of Lemma 2 yj Consider a point (xi yi ) in S11 and a point (xj yj ) in S−11 such that yi (if such a pair does not exist we can consider another pair from the cases listed below). For these points (ti − tj )2 = (yi + s + γ + i − yj − s + γ + j )2 = 4γ 2 . In a similar way (takin into account the ((yi − yj ) + 2γ + i + j )2 constraints on the i ’s and on s) the inequality (ti − tj )2 4γ 2 can be shown to hold in the followin two cases: (xi yi ) (xi yi )
S11 (xj yj ) S12 (xj yj )
S−12 yi S−12 yi
S−11 S−11
yj yj
(16)
Moreover N i j=1 (ti
− tj )2
2 2
i2S11 i2S12
j2S−11 j2S−11
S−12 yi yj (ti
− tj )2
S−12 yi yj (ti
− tj )
+
2
(17)
since in the ri ht hand side we excluded some of the terms of the left hand side. Usin the fact that for the cases considered (ti − tj )2 4γ 2 , the ri ht hand side is at least 8γ 2 +8γ 2
of points j in class − 1 with yi yj )+ (number of points j in class − 1 with yi yj ) i2S12
i2S11 (number
(18)
Let I1 and I2 be the cardinalities of S11 and S12 respectively. Because of the choice of the separation it is clear that (18) is at least 8γ 2 ((1 + 2 +
+ I1 )) + (1 + 2 +
+ (I2 − 1)))
(for example if I1 = 2 in the worst case points 2 and 4 are in S11 in which case the rst part of (18) is exactly 1+2). Finally, since I1 + I2 = N2 , (18) is at least 2
8γ 2 N16−4 =
γ 2 (N 2 −4) , 2
which proves the lemma.
On the Stren th of Incremental Learnin Ste en Lan e1 and Gunter Grieser2
2
1 Universit¨ at Leipzi , Institut f¨ ur Informatik Au ustusplatz 10 11, 04109 Leipzi , Germany slan
[email protected] .de Technische Universit¨ at Darmstadt, Fachbereich Informatik Alexanderstra e 10, 64283 Darmstadt, Germany
[email protected]
Abs rac . This paper provides a systematic study of incremental learnin from noise-free and from noisy data, thereby distin uishin between learnin from only positive data and from both positive and ne ative data. Our study relies on the notion of noisy data introduced in [22]. The basic scenario, named iterative learnin , is as follows. In every learnin sta e, an al orithmic learner takes as input one element of an information sequence for a tar et concept and its previously made hypothesis and outputs a new hypothesis. The sequence of hypotheses has to conver e to a hypothesis describin the tar et concept correctly. We study the followin re nements of this scenario. Bounded examplememory inference eneralizes iterative inference by allowin an iterative learner to additionally store an a priori bounded number of carefully chosen data elements, while feedback learnin eneralizes it by allowin the iterative learner to additionally ask whether or not a particular data element did already appear in the data seen so far. For the case of learnin from noise-free data, we show that, where both positive and ne ative data are available, restrictions on the accessibility of the input data do not limit the learnin capabilities if and only if the relevant iterative learners are allowed to query the history of the learnin process or to store at least one carefully selected data element. This insi ht nicely contrasts the fact that, in case only positive data are available, restrictions on the accessibility of the input data seriously a ect the capabilities of all types of incremental learnin (cf. [18]). For the case of learnin from noisy data, we present characterizations of all kinds of incremental learnin in terms bein independent from learnin theory. The relevant conditions are purely structural ones. Surprisin ly, where learnin from only noisy positive data and from both noisy positive and ne ative data, iterative learners are already exactly as powerful as unconstrained learnin devices.
1
Introduction
The theoretical investi ations in the present paper derive their motivation to a certain extent from the rapidly developin eld of knowled e discovery in O. Watanabe, T. Yokomori (Eds.): ALT’99, LNAI 1720, pp. 118 131, 1999. c Sprin er-Verla Berlin Heidelber 1999
On the Stren th of Incremental Learnin
119
databases (abbr. KDD). KDD mainly combines techniques ori inatin from machine learnin , knowled e acquisition and knowled e representation, arti cial intelli ence, pattern reco nition, statistics, data visualization, and databases to automatically extract new interrelations, knowled e, patterns and the like from hu e collections of data (cf. [7], for a recent overview). Amon the di erent parts of the KDD process, like data presentation, data selection, incorporatin prior knowled e, and de nin the semantics of the results obtained, we are mainly interested in the particular subprocess of applyin speci c al orithms for learnin somethin useful from the data. This subprocess is usually named data minin . There is one problem when invokin machine learnin techniques to do data minin . Almost all machine learnin al orithms are in-memory al orithms, i.e., they require the whole data set to be present in the main memory when extractin the concepts hidden in the data. However, if hu e data sets are around, no learnin al orithm can use all the data or even lar e portions of it simultaneously for computin hypotheses. Di erent methods have been proposed for overcomin the di culties caused by hu e data sets. For example, instead of doin the discovery process on all the data, one starts with si ni cantly smaller samples, nds the re ularities in it, and uses di erent portions of the overall data to verify what one has found. Lookin at data minin from this perspective, it becomes a true limitin process. That means, the actual hypothesis enerated by the data minin al orithm is tested versus parts of the remainin data. Then, if the current hypothesis is not acceptable, the sample may be enlar ed or replaced and the data minin alorithm will be restarted. Thus, from a theoretical point of view, it is appropriate to look at the data minin process as an on oin , incremental one. For the purpose of motivation and discussion of our research, we next introduce some basic notions. By X we denote any learnin domain. Any collection C of sets c X is called a concept class. Moreover, c is referred to as concept. An al orithmic learner, henceforth called inductive inference machine (abbr. IIM), takes as input initial se ments of an information sequence and outputs, once in a while, a hypothesis about the tar et concept. The set H of all admissible hypotheses is called hypothesis space. The sequence of hypotheses has to conver e to a hypothesis describin the tar et concept correctly. If there is an IIM that learns a concept c from all admissible information sequences for it, then c is said to be learnable in the limit with respect to H (cf. [10]). Gold’s [10] model of learnin in the limit relies on the unrealistic assumption that the learner has access to samples of rowin size. Therefore, we investi ate variations of the eneral approach that restrict the accessibility of the input data considerably. We deal with iterative learnin , k-bounded example-memory inference, and feedback identi cation of indexable concept classes. All these models formalize incremental learnin , a topic attractin more and more attention in the machine learnin community (cf., e. ., [6,9,20,24]). An iterative learner is required to produce its actual uess exclusively from its previous one and the next element in the information sequence presented. Iterative learnin has been introduced in [26] and has further been studied by
120
Ste en Lan e and Gunter Grieser
various authors (cf., e. ., [3,8,13,14,15,18,19]). Alternatively, we consider learners that are allowed to store up to k carefully chosen data elements seen so far, where k is a priori xed (k-bounded example-memory inference). Bounded example-memory learnin has its ori ins in [18]. Furthermore, we study feedback identi cation. The idea of feedback learnin oes back to [26], too. In this settin , the iterative learner is additionally allowed to ask whether or not a particular data element did already appear in the data seen so far. In the rst part of the present paper, we investi ate incremental learnin from noise-free data. As usual, we distin uish the case of learnin from only positive data and learnin from both positive and ne ative data, synonymously learnin from text and informant, respectively. A text for a concept c is an in nite sequence that eventually contains all and only the elements of c. Alternatively, an informant for c is an in nite sequence of all elements of X that are classi ed accordin to their membership in c. Former theoretical studies mostly dealt with incremental concept learnin from only positive data (cf. [3,8,18]). It has been proved that (i) all de ned models of incremental learnin are strictly less powerful than conservative inference (which itself is strictly less powerful than learnin in the limit), (ii) feedback learnin and bounded example-memory inference outperform iterative learnin , and (iii) feedback learnin and bounded example-memory inference extend the learnin capabilities of iterative learners in di erent directions. In particular, it has been shown that any additional data element an iterative learner may store buys more learnin power. As we shall show, the situation chan es considerably in case positive and ne ative data are available. Now, it is su cient to store one carefully selected data element in the example-memory in order to achieve the whole learnin power of unconstrained learnin machines. As a kind of side-e ect, the in nite hierarchy of more and more powerful bounded example-memory learners which has been observed in the text case collapses. Furthermore, also feedback learners are exactly as powerful as unconstrained learnin devices. In contrast, similarly to the case of learnin from positive data, the learnin capabilities of iterative learners are a ain seriously a ected. In the second part of the present paper, we study incremental learnin from noisy data. This topic is of interest, since, in real world-applications, one rarely receives perfect data. There are a lot of attempts to ive a precise notion of what the term noisy data means (cf., e. ., [2,12,19]). In our study, we adopt the notion from [22] (see also [23]) which seems to have become standard when studyin Gold-style learnin (cf. [2,4,5]). This notion has the advanta e that noisy data about a tar et concept nonetheless uniquely specify that concept. Rou hly speakin , correct data elements occur in nitely often whereas incorrect data elements occur only nitely often. Generally, the model of noisy environments introduced in [22] aims to rasp situations in which, due to better simulation techniques or better technical equipment, the experimental data which a learner receives about an unknown phenomenon become better and better over time until they reflect the reality su ciently well.
On the Stren th of Incremental Learnin
121
Surprisin ly, where learnin from noisy data is considered, iterative learners are exactly as powerful as unconstrained learnin machines, and thus iterative learners are able to fully compensate the limitations in the accessibility of the input data. This nicely contrasts the fact that, where learnin from noise-free text and noise-free informant, iterative learnin is strictly less powerful than learnin in the limit. Moreover, it immediately implies that all di erent models of incremental learnin introduced above coincide. Furthermore, we characterize iterative learnin from noisy data in terms bein independent from learnin theory. We show that an indexable class can be iteratively learned from noisy text if and only if it is inclusion-free. Alternatively, it is iteratively learnable from noisy informant if and only if it is discrete.
2
Preliminaries
Let IN = 0 1 2 be the set of all natural numbers. By : IN IN IN we denote Cantor’s pairin function. We write A # B to indicate that two sets A and B are incomparable, i.e., A B = and B A = . Any recursively enumerable set X is called a learnin domain. By (X ) we denote the power set of X . Let C (X ) and let c C. We refer to C and c as to a concept class and a concept. Sometimes, we will identify a concept c with its characteristic function, i.e., we let c(x) = +, if x c, and c(x) = −, otherwise. We deal with the learnability of indexable concept classes with uniformly decidable membership de ned as follows (cf. [1]). A class of non-empty concepts C is said to be an indexable concept class with uniformly decidable membership if there are an e ective enumeration (cj )j2IN of all and only the concepts in C and a recursive function f such that, for all j IN and all x X , it holds f (j x) = +, if x cj , and f (j x) = −, otherwise. We refer to indexable concept classes with uniformly decidable membership as to indexable classes, for short. Next, we describe some well-known examples of indexable classes. First, let denote any xed nite alphabet of symbols and let be the free monoid over . Moreover, let X = be the learnin domain. We refer to subsets + as to lan ua es (instead of concepts). Then, the set of all contextL sensitive lan ua es, context-free lan ua es, re ular lan ua es, and of all pattern n lan ua es form indexable classes (cf. [1,11]). Second, S let Xn = 0 1 be the set of all n-bit Boolean vectors. We consider X = n1 Xn as learnin domain. Then, the set of all concepts expressible as a monomial, a k-CNF, a k-DNF, and a k-decision list constitute indexable classes (cf. [21,25]). Finally, we de ne some useful properties of indexable classes. Let X be the underlyin learnin domain and let C be an indexable class. Then, C is said to be inclusion-free i c # c0 for all distinctive concepts c c0 C. Let (wj )j2IN be the lexico raphically ordered enumeration of all elements in X . For all c X , by c we denote the lexico raphically ordered informant of c, i.e., the in nite C, sequence ((wj c(wj )))j2IN . Then, C is said to be discrete i , for every c there is an initial se ment of c’s lexico raphically ordered informant c , say cx ,
122
Ste en Lan e and Gunter Grieser
that separates c from all other concepts c0 c0 C, if c = c0 then cx = cx .
3 3.1
C. More precisely speakin , for all
Formalizin Incremental Learnin Learnin from Noise-Free Data
Let X be the underlyin learnin domain, let c X be a concept, and let t = (xn )n2IN be an in nite sequence of elements from c such that xn n IN = c. Then, t is said to be a text for c. By Text(c) we denote the set of all texts for c. Alternatively, let = ((xn bn ))n2IN be an in nite sequence of elements from IN = X , xn n IN bn = + = c, and X + − such that xn n IN bn = − = co-c = X c. Then, we refer to as an informant xn n for c. By Info(c) we denote the set of all informants for c. Moreover, let t be a text, let be an informant, and let y be a number. Then, ty and y denote the y , initial se ment of t and of len th y + 1. Furthermore, we set t+ y = xn n + − = x n y b = + , and = x n y b = − . n n n n y y As in [10], we de ne an inductive inference machine (abbr. IIM ) to be an al orithmic mappin from initial se ments of texts (informants) to IN ? . Thus, an IIM either outputs a hypothesis, i.e., a number encodin a certain computer pro ram, or it outputs ?, a special symbol representin the case the machine outputs no conjecture. Note that an IIM, when learnin some tar et class C, is required to produce an output when processin any initial se ment of any admissible information sequence, i.e., any initial se ment of any text (informant) for any c C. The numbers output by an IIM are interpreted with respect to a suitably chosen hypothesis space H = (hj )j2IN . Since we exclusively deal with indexable classes C, we always assume that H is also an indexin of some possibly lar er class of non-empty concepts. Hence, membership is uniformly decidable in H, too. Formally speakin , we deal with class comprisin learnin (cf. [27]). When an IIM outputs some number j, we interpret it to mean that it hypothesizes hj . In all what follows, a data sequence = (dn )n2IN for a tar et concept c is either a text t = (xn )n2IN or an informant = ((xn bn ))n2IN for c. By convention, for all y IN, y denotes the initial se ment ty or y . We de ne conver ence of IIMs as usual. Let be iven and let M be an IIM. The sequence (M ( y ))y2IN of M ’s hypotheses conver es to a number j i all but nitely many terms of it are equal to j. Now, we are ready to de ne learnin in the limit. De nition 1 ([10]) Let C be an indexable class, let c be a concept, and let H = (hj )j2IN be a hypothesis space. An IIM M LimTxt H [LimInf H ] identi es c i , for every data sequence with Text(c) [ Info(c)], there is a j IN with hj = c such that the sequence (M ( y ))y2IN conver es to j. Then, M LimTxt H [LimInf H ] identi es C i , for all c0 C, M LimTxt H [LimInf H ] identi es c0 .
On the Stren th of Incremental Learnin
123
Finally, LimTxt [LimInf ] denotes the collection of all indexable classes C 0 for which there are a hypothesis space H0 = (h0j )j2IN and an IIM M such that M LimTxt H [LimInf H ] identi es C 0 . In the above de nition, Lim stands for limit . Suppose an IIM identi es some concept c. That means, after havin seen only nitely many data of c the IIM reaches its (unknown) point of conver ence and it computes a correct and nite description of the tar et concept. Hence, some form of learnin must have taken place. In eneral, it is not decidable whether or not an IIM has already conver ed on a text t (an informant ) for the tar et concept c. Addin this requirement to the above de nition results in nite learnin (cf. [10]). The correspondin learnin types are denoted by FinTxt and FinInf . Next, we de ne conservative IIMs. Intuitively speakin , conservative IIMs maintain their actual hypothesis at least as lon as they have not seen data contradictin it. De nition 2 ([1]) Let C be an indexable class, let c be a concept, and let H = (hj )j2IN be a hypothesis space. An IIM M Consv Txt H [Consv Inf H ] identies c i M LimTxt H [LimInf H ] identi es c and, for every data sequence with Text (c) [ Info(c)] and for any two consecutive hypotheses k = M ( y ) and j = M ( y+1 ), if k IN and k = j, then hk is not consistent with y+1 .1 M Consv Txt H [Consv Inf H ] identi es C i , for all c0 C, M Consv Txt H [Consv Inf H ] identi es c0 . The learnin types Consv Txt and Consv Inf are de ned analo ously as above. The next theorem summarizes the known results concernin the relations between the standard learnin models de ned so far. Theorem 1 ([10], [16]) (1) For all indexable classes C, we have C LimInf . (2) FinTxt FinInf Consv Txt LimTxt Consv Inf = LimInf . Now, we formally de ne the di erent models of incremental learnin . An ordinary IIM M has always access to the whole history of the learnin process, i.e., it computes its actual uess on the basis of all data seen so far. In contrast, an iterative IIM is only allowed to use its last uess and the next data element in . Conceptually, an iterative IIM M de nes a sequence (Mn )n2IN of machines each of which takes as its input the output of its predecessor. De nition 3 ([26]) Let C be an indexable class, let c be a concept, and let H = (hj )j2IN be a hypothesis space. An IIM M ItTxt H [It Inf H ] identi es c Text (c) [ Info(c)], the i , for every data sequence = (dn )n2IN with followin conditions are ful lled: (1) for all n IN, Mn ( ) is de ned, where M0 ( ) = M (d0 ) as well as Mn+1 ( ) = M (Mn ( ) dn+1 ). (2) the sequence (Mn ( ))n2IN conver es to a number j with hj = c. 1
In the text case,
+ y+1
ck , and, in the informant case,
+ y+1
ck or
− y+1
co-ck .
124
Ste en Lan e and Gunter Grieser
Finally, M It Txt H [It Inf H ] identi es C i , for each c0 [It Inf H ] identi es c0 .
C, M It Txt H
The learnin types It Txt and It Inf are de ned analo ously to De nition 1. Next, we consider a natural relaxation of iterative learnin , named k-bounded example-memory inference. Now, an IIM M is allowed to memorize at most k of the data elements which it has already seen in the learnin process, where k IN is a priori xed. A ain, M de nes a sequence (Mn )n2IN of machines each of which takes as input the output of its predecessor. Clearly, a k-bounded examplememory IIM outputs a hypothesis alon with the set of memorized data elements. De nition 4 ([18]) Let C be an indexable class, let c be a concept, and let H = (hj )j2IN be a hypothesis space. Moreover, let k IN. An IIM M Bem k Txt H [Bem k Inf H ] identi es c i , for every data sequence = (dn )n2IN with Text(c) [ Info(c)], the followin conditions are satis ed: (1) for all n IN, Mn ( ) is de ned, where M0 ( ) = M (d0 ) = j0 S0 such d0 and card (S0 ) k as well as Mn+1 ( ) = M (Mn ( ) dn ) = that S0 dn+1 and card (Sn+1 ) k. jn+1 Sn+1 such that Sn+1 Sn (2) the jn in the sequence ( jn Sn )n2IN of M ’s uesses conver e to a number j with hj = c. Finally, M Bem k Txt H [Bem k Inf H ] identi es C i , for each c0 Bem k Txt H [Bem k Inf H ] identi es c0 .
C, M
For every k IN, the learnin types Bem k Txt and Bem k Inf are de ned analo ously as above. By de nition, Bem 0 Txt = ItTxt and Bem 0 Inf = ItInf . Next, we de ne learnin by feedback IIMs. Informally speakin , a feedback IIM M is an iterative IIM that is additionally allowed to make a particular type of queries. In each learnin sta e n+1, M has access to the actual input dn+1 and its previous uess jn . M is additionally allowed to compute a query from dn+1 and jn which concerns the history of the learnin process. That is, the feedback learner computes a data element d and ets a YES/NO answer A(d) such that A(d) = 1, if d appears in the initial se ment n , and A(d) = 0, otherwise. Hence, M can just ask whether or not the particular data element d has already been presented in previous learnin sta es. De nition 5 ([26]) Let C be an indexable class, let c be a concept, and let X [Q : IN X X, H = (hj )j2IN be a hypothesis space. Let Q : IN X + − ] be a total computable function. An IIM M , with a query where X = X askin function Q, FbTxt H [FbInf H ] identi es c i , for every data sequence = Text(c) [ Info(c)], the followin conditions are satis ed: (dn )n2IN with (1) for all n IN, Mn ( ) is de ned, where M0 ( ) = M (d0 ) as well as Mn+1 ( ) = M (Mn ( ) A(Q(Mn ( ) dn+1 )) dn+1 ). (2) the sequence (Mn ( ))n2IN conver es to a number j with hj = c provided A truthfully answers the questions computed by Q. Finally, M FbTxt H [FbInf H ] identi es C i , for each c0 [FbInf H ] identi es c0 .
C, M FbTxt H
The learnin types FbTxt and FbInf are de ned analo ously as above.
On the Stren th of Incremental Learnin
3.2
125
Learnin from Noisy Data
In order to study iterative learnin from noisy data we have to provide some more notations and de nitions. Let X be the underlyin learnin domain, let c X be a concept, and let t = (xn )n2IN be an in nite sequence of elements from X . Followin [22], t is said to be a noisy text for c provided that every element from c appears in nitely often, i.e., for every x c there are in nitely many n such that xn = x, whereas only nitely often some x c occurs, i.e., xn c for all but nitely many n IN. By NText(c) we denote the collection of all noisy texts for c. For every y IN, y . ty denotes the initial se ment of t of len th y + 1. We let t+ y = xn n + − . Next, let = ((xn bn ))n2IN be any sequence of elements from X Followin [22], is said to be a noisy informant for c provided that every element x of X occurs in nitely often, almost always accomplished by the ri ht classi cation c(x). More formally, for all x X , there are in nitely many n IN such that xn = x and, for all but nitely many of them, bn = c(x). By NInfo(c) we denote the collection of all noisy informants for c. For every y IN, y dey bn = + , and notes the initial se ment of of len th y + 1, + y = xn n − y bn = − . y = xn n In contrast to the noise-free case, now an IIM receives as input nite sequences of a noisy text (noisy informant). When an IIM is supposed to identify some tar et concept class C, then it has to output a hypothesis on every admissible information sequence, i.e., any initial se ment of any noisy text (noisy informant) for any c C. Analo ously to the case of learnin from noise-free data, we deal with class comprisin learnin (cf. [27]). The learnin types LimNTxt, FinNTxt, Consv NTxt, It NTxt, Bem k NTxt, and FbNTxt as well as LimNInf , FinNInf , Consv NInf , ItNInf , Bem k NInf , and FbNInf are de ned analo ously to their noise-free counterparts by replacin everywhere text and informant by noisy text and noisy informant, respectively. The followin theorem summarizes the known results concernin learnin of indexable concept classes from noisy data. Theorem 2 ([22]) (1) FinTxt LimNTxt LimTxt . (2) FinInf LimNInf LimInf . (3) LimNTxt # LimNInf .
4 4.1
Incremental Learnin from Noise-Free Data The Text Case
In this subsection, we briefly review the known relations between the di erent variants of incremental learnin and the standard learnin models de ned above. All the models of incremental learnin introduced above pose serious restrictions on the accessibility of the data provided durin the learnin process. Therefore, one mi ht expect a certain loss of learnin power. And indeed, conservative inference already forms an upper bound for any kind of incremental learnin .
126
Ste en Lan e and Gunter Grieser
Theorem 3 ([18]) (1) It Txt Consv Txt . (2) FbTxt Consv Txt . S (3) k2IN Bem k Txt Consv Txt. Moreover, bounded example-memory inference and feedback learnin enlar e the learnin capabilities of iterative identi cation, but the surplus power ained is incomparable. Moreover, the existence of an in nite hierarchy of more and more powerful bounded example memory learners has been shown. Theorem 4 ([18]) (1) (2) (3) (4) (5)
It Txt FbTxt. It Txt Bem 1 Txt. For all k IN, Bem k Txt Bem k+1 Txt . Bem 1 TxtS FbTxt = . FbTxt k2IN Bemk Txt = .
A comparison of feedback learnin and bounded example-memory inference with nite inference from positive and ne ative data illustrates another di erence between both eneralizations of iterative learnin . Theorem 5 ([18]) S (1) k2IN Bem k Txt # FinInf . (2) FinInf FbTxt. Finally, nite inference from text is strictly less powerful than any kind of incremental learnin . Theorem 6 ([18]) FinTxt 4.2
It Txt.
The Informant Case
Next, we study the stren ths and the limitations of incremental learnin from informant. Our rst result deals with the similarities to the text case. Theorem 7 FinInf
It Inf
LimInf .
Moreover, analo ously to the case of learnin from only positive data, feedback learners and bounded example-memory learners are more powerful than iterative IIMs. But surprisin ly, the surplus learnin power ained is remarkable. The ability to make queries concernin the history of the learnin process fully compensates the limitations in the accessibility of the input data. Theorem 8 FbInf = LimInf . Even more surprisin ly, the in nite hierarchy of more and more powerful k-bounded example-memory learners, parameterized by the number of data elements the relevant iterative learners may store, collapses in the informant case. The ability to memorize one carefully selected data element is also su cient to fully compensate the limitations in the accessibility of the input data. Theorem 9 Bem 1 Inf = LimInf .
On the Stren th of Incremental Learnin
127
Proof. It su ces to show that LimInf Bem 1 Inf . Let C be any indexable class and (cj )j2IN be any indexin of C. By Theorem 1, C LimInf . Let (wj )j2IN denote the lexico raphically ordered enumeration of all elements m wz c and in X . For all m IN and all c X , we set cm = wz z m c = wz z > m wz c . We let the required 1-bounded example-memory learner M output as hypothesis a triple (F m j) alon with a sin leton set containin the one data element stored. The triple (F m j) consists of a nite set F and two numbers m and j. It is used to describe a nite variant of the concept cj , namely the concept m F cm j . Intuitively speakin , cj is the part of the concept cj that de nitely does not contradict the data seen so far, while F is used to handle exceptions. For the sake of readability, we abstain from explicitly de nin a hypothesis space H that provides an appropriate codin of all nite variants h(F m j) = F cm j of concepts in C. Let c C and let = ((xn bn ))n2IN Info(c). M is de ned in sta es. Sta e 0. On input (x0 b0 ) do the followin : Fix m IN with wm = x0 . Determine the least j such that cj is consistent with (x0 b0 ). Set S = (x0 b0 ) . Output (cm j m j) S and oto Sta e 1. Sta e n, n 1. On input (F m j) S and (xn bn ) proceed as follows: Let S = (x b) . Fix z z 0 IN such that wz = x and wz = xn . If z 0 > z, set S 0 = (xn bn ) . Otherwise, set S 0 = S. Test whether h(F m j) = F cm j is consistent with (xn bn ). In case it is, oto (A). Otherwise, oto (B). (A) Output (F m j) S 0 and oto Sta e n + 1. (B) If z 0 m, oto ( 1). If z 0 > m, oto ( 2). xn . If bn = −, set F 0 = F xn . Output ( 1) If bn = +, set F 0 = F 0 0 (F m j) S and oto Sta e n + 1. wr h(F m j) . If ( 2) Determine = max z z 0 and F 0 = wr r xn . If bn = −, set F 00 = F 0 xn . Search bn = +, set F 00 = F 0 for the least index k > j such that ck is consistent with (xn bn ). Then, output (F 00 k) S 0 and oto Sta e n + 1. Due to space limitations, the veri cation of M ’s correctness is skipped.
2
On the other hand, it is well-known that, where learnin from informant, iterative learnin with nitely many anomalies2 is exactly as powerful as learnin in the limit. Hence, Theorems 8 and 9 demonstrate the error correctin power of feedback queries and bounded example-memories. Next, we summarize the established relations between the di erent models of incremental learnin from text and their correspondin informant counterparts. Corollary 10 (1) It STxt ItInf . (2) k2IN Bem k Txt (3) FbTxt FbInf . 2
Bem 1 Inf .
In this settin , it su ces that an iterative learner conver es to a hypothesis which describes a nite variant of the tar et concept.
128
Ste en Lan e and Gunter Grieser
Finally, we want to point out further di erences between incremental learnin from text and informant. FbInf = Bem 1 Inf = L mInf Theorem 11 LimTxt (1) It Inf LimTxt = . (2) Bem 1 Txt It Inf = . (3) FbTxt It Inf = . The ure aside summarizes the observed separations and coincidences. Each learnin type is represented as a vertex in a directed raph. A directed ed e (or path) from vertex A to vertex B indicates that A is a proper subset of B. Moreover, no ed e (or path) between these vertices imply that A and B are incomparable.
5 5.1
: 6 6 Consv Txt ItInf 6 XyXSXXX 6XyX Bem Txt ..6 . : 6 FbTxt 6 Bem Txt 6 Bem Txt : XXXItXTxt FinInf XyXXXX6 FinTxt k
k2IN
2 1
Incremental Learnin from Noisy Data Characterizations
In this section, we present characterizations of all models of learnin from noisy text and noisy informant. First, we characterize iterative learnin from noisy text in purely structural terms. Theorem 12 C
ItNTxt i C is inclusion-free.
Since LimNTxt exclusively contains indexable classes that are inclusion-free (cf. [22]), by Theorem 1, and since, by de nition, ItNTxt Bem 1 NTxt and It NTxt FbNTxt, we arrive at the followin insi ht. Theorem 13 (1) It NTxt = Consv NTxtS= LimNTxt. (2) It NTxt = FbNTxt = k2IN Bem k NTxt. Interestin ly, another structural property allows us to characterize the collection of all indexable classes that can be iteratively learned from noisy informant. Theorem 14 C
ItNInf i C is discrete.
Proof. Necessity: Recently, it has been shown that every class of recursive enumerable lan ua es that is learnable in the limit from noisy informant has to be discrete (cf. Stephan [22]; see also Case et al. [5], for the relevant details). Clearly, this result immediately translates in our settin . Now, since, by de nition, It NInf LimNInf , we are done. Su ciency: Let C be an indexable class that is discrete. Informally speakin , the required iterative learner M behaves as follows. In every learnin sta e, M outputs an index for some concept, say c0 , alon with a number k. The number k is an lower bound for the len th of the shortest initial se ment of c0 ’s
On the Stren th of Incremental Learnin
129
lexico raphically ordered informant c that separates c0 from all other concepts in tar et class C. Since M does not know whether a new data element is really correct, M rejects its actual uess only in case the new data element contradicts the information represented in the initial se ment ck . Moreover, since M is supposed to learn in an iterative manner, M has to use the input data to improve its actual lower bound k. We proceed formally. Let (cj )j2IN be any indexin of C. For all j IN, cj denotes the lexico raphically ordered informant of cj . As before, let (wj )j2IN be the lexico raphically ordered enumeration of all elements in X . Moreover, let f be any total recursive function such that, for all z IN, there are in nitely many j IN with f (j) = z. We select a hypothesis space H = (hhj ki )j k2IN that meets, for all j k IN, hhj ki = cf (j) . The required iterative IIM M is de ned in sta es. Let c C and let = ((xn bn ))n2IN be any noisy informant for c. Sta e 0. On input (x0 b0 ) do the followin : Set j0 = 0, set k0 = 0, output j0 k0 , and oto Sta e 1. Sta e n, n 1. On input jn−1 kn−1 and (xn bn ) do the followin : kn−1 , execute Instruction (A). Determine p IN with wp = xn . If p Otherwise, execute Instruction (B). (A) Test whether or not bn = cf (jn−1 ) (xn ). In case it is, set jn = jn−1 and kn = kn−1 . Otherwise, set jn = jn−1 + 1 and kn = 0. Output jn kn and oto Sta e n + 1. cf (j cf (j c (z) c ) ) = kn−1n−1 and pf (z) = p n−1 . In case (B) For all z p, test whether kfn−1 there is a z successfully passin this test, set jn = jn−1 and kn = kn−1 +1. Otherwise, set jn = jn−1 and kn = kn−1 . Output jn kn and oto Sta e n + 1.
Due to space limitations, the veri cation of M ’s correctness is skipped. 2 Analo ously to the text case, all models of learnin from noisy informant coincide, except FinNInf . Theorem 15
(1) It NInf = Consv NInfS= LimNInf . (2) It NInf = FbNInf = k2IN Bem k NInf . Furthermore, the collection of all indexable classes that can be nitely learned from noisy text (noisy informant) is easily characterized as follows. Proposition 1 Let C be an indexable class. Then, the followin statements are equivalent: (1) C FinNTxt. (2) C FinNInf . (3) C contains at most one concept. 5.2
Comparisons with Other Learnin Types
The characterizations presented in the last subsection form a rm basis for further investi ations. They are useful to prove further results illustratin the relation of learnin from noisy text and noisy informant to all the other types of
130
Ste en Lan e and Gunter Grieser
learnin indexable classes de ned. Subsequently, ItNTxt and ItNInf are used as representatives for all models of learnin from noisy data, except nite inference. The next two theorems sharpen the upper bounds for learnin from noisy data established in [22] (cf. Theorem 2 above). Theorem 16 ItNTxt Consv Txt. The next theorem puts the weakness of learnin from noisy informant in the ri ht perspective. Theorem 17 ItNInf LimTxt . The reader should note that Theorem 17 cannot be sharpened to It NInf Consv Txt, since there are discrete indexable classes not belon in to Consv Txt (cf. [16]). Since the class of all nite concepts is Consv Txt identi able and obviously not discrete, we may conclude: Theorem 18 ItNInf # Consv Txt . The next theorem provides us the missin piece in the overall picture. Theorem 19 ItNTxt # FinInf . LimInf The ure aside displays the established relations between the di erent models of learnin from LimTxt noisy data and the standard models of learnin in the noise-free settin . The semantics of this - ItNInf Consv Txt ure is the same as that of the ure in the previous section. The displayed relations between the FinInf It NTxt learnin models FinNTxt, FinNInf , and FinTxt FinTxt are rather trivial. On the one hand, every sin leton concept class is obviously FinTxt identi able. FinNTxt = FinNInf On the other hand, FinTxt also contains richer indexable classes. Recall that Assertion (3) of Theorem 2 rewrites into ItNTxt # It NInf . As we shall see, this result eneralizes as follows: All models of iterative learnin are pairwise incomparable, except It Txt and It Inf . Theorem 20 (1) ItNTxt # ItTxt . (2) ItNTxt # ItInf . (3) ItNInf # It Txt. (4) ItNInf # ItInf .
6 : 6 6: 6 XyXXXX6 6
Acknowled ment We thank the anonymous referees for their careful readin and comments which improved the paper considerably.
References 1. D. An luin. Inductive inference of formal lan ua es from positive data. Information and Control, 45:117 135, 1980. 2. J. Case and S. Jain. Synthesizin learners toleratin computable noisy data. In Proc. 9th ALT, LNAI 1501, pp. 205 219. Sprin er-Verla , 1998.
On the Stren th of Incremental Learnin
131
3. J. Case, S. Jain, S. Lan e, and T. Zeu mann. Incremental concept learnin for bounded data minin . Information and Computation. to appear. 4. J. Case, S. Jain, and A. Sharma. Synthesizin noise-tolerant lan ua e learners. In Proc. 8th ALT, LNAI 1316, pp. 228 243. Sprin er-Verla , 1997. 5. J. Case, S. Jain, and F. Stephan. Vacillatory and BC learnin on noisy data. In Proc. 7th ALT, LNAI 1160, pp. 285 289. Sprin er-Verla , 1997. 6. A. Cornuejols. Gettin order independence in incremental learnin . In Proc. ECML, LNAI 667, pp. 196 212. Sprin er-Verla , 1993. 7. U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors. Advances in Knowled e Discovery and Data Minin . MIT Press, 1996. 8. M. Fulk, S. Jain, and D.N. Osherson. Open problems in systems that learn. Journal of Computer and System Sciences, 49:589 604, 1994. 9. R. Godin and R. Missaoui. An incremental concept formation approach for learnin from databases. Theoretical Computer Science, 133:387 419, 1994. 10. M.E. Gold. Lan ua e identi cation in the limit. Information and Control, 10:447 474, 1967. 11. J.E. Hopcroft and J.D. Ullman. Formal Lan ua es and their Relation to Automata. Addison-Wesley, 1969. 12. S. Jain. Pro ram synthesis in the presence of in nite number of inaccuracies. In Proc. 5th ALT, LNAI 872, pp. 333 348. Sprin er-Verla , 1994. 13. K.P. Jantke and H.R. Beick. Combinin postulates of naturalness in inductive inference. Journal of Information Processin and Cybernetics, 17:465 484, 1981. 14. E. Kinber and F. Stephan. Mind chan es, limited memory, and monotonicity. In Proc. 8th COLT, pp. 182 189. ACM Press, 1995. 15. S. Lan e and R. Wieha en. Polynomial-time inference of arbitrary pattern lanua es. New Generation Computin , 8:361 370, 1991. 16. S. Lan e and T. Zeu mann. Lan ua e learnin in dependence on the space of hypotheses. In Proc. 6th COLT, pp. 127 136. ACM Press, 1993. 17. S. Lan e and T. Zeu mann. Learnin recursive lan ua es with bounded mind chan es. Int. Journal of Foundations of Computer Science, 4:157 178, 1993. 18. S. Lan e and T. Zeu mann. Incremental learnin from positive data. Journal of Computer and System Sciences, 53:88 103, 1996. 19. D. Osherson, M. Stob, and S. Weinstein. Systems that Learn, An Introduction to Learnin Theory for Co nitive and Computer Scientists. MIT Press, 1986. 20. S. Porat and J.A. Feldman. Learnin automata from ordered examples. In Proc. 1st COLT, pp. 386 396. Mor an Kaufmann Publ., 1988. 21. R. Rivest. Learnin decision lists. Machine Learnin , 2:229 246, 1988. 22. F. Stephan. Noisy inference and oracles. In Proc. 6th ALT, LNAI 997, pp. 185 200. Sprin er-Verla , 1995. 23. F. Stephan. Noisy inference and oracles. Theoretical Computer Science, 185:129 157, 1997. 24. L. Tor o. Controlled redundancy in incremental rule learnin . In Proc. ECML, LNAI 667, pp. 185 195. Sprin er-Verla , 1993. 25. L.G. Valiant. A theory of the learnable. Communications of the ACM, 27:1134 1142, 1984. 26. R. Wieha en. Limes-Erkennun rekursiver Funktionen durch spezielle Strate ien. Journal of Information Processin and Cybernetics, 12:93 99, 1976. 27. T. Zeu mann and S. Lan e. A uided tour across the boundaries of learnin recursive lan ua es. In Al orithmic Learnin for Knowled e-Based Systems, LNAI 961, pp. 193 262. Sprin er-Verla , 1995.
Learnin from Random Text Peter Rossmanith Institut f¨ ur Informatik Technische Universit¨ at M¨ unchen 80333 M¨ unchen, Fed. Rep. of Germany rossman @ n.tum.de
Abs rac . Learnin in the limit deals mainly with the question of what can be learned, but not very often with the question of how fast. The purpose of this paper is to develop a learnin model that stays very close to Gold’s model, but enables questions on the speed of conver ence to be answered. In order to do this, we have to assume that positive examples are enerated by some stochastic model. If the stochastic model is xed (measure one learnin ), then all recursively enumerable sets are identi able, while strayin reatly from Gold’s model. In contrast, we de ne learnin from random text as identifyin a class of lan ua es for every stochastic model where examples are enerated independently and identically distributed. As it turns out, this model stays close to learnin in the limit. We compare both models keepin several aspects in mind, particularly when restricted to several strate ies and to the existence of lockin sequences. Lastly, we present some results on the speed of conver ence: In eneral, conver ence can be arbitrarily slow, but for recursive learners, it cannot be slower than some ma ic function. Every lan ua e can be learned with exponentially small tail bounds, which are also the best possible. All results apply fully to Gold-style learners, since his model is a proper subset of learnin from random text.
1
Introduction
Learnin in the limit as de ned by Gold [5] has attracted much attention. It can be described as follows: The words of a lan ua e are presented to a learner in some order, where duplicates are allowed. At each point of time the learner has seen only a nite subset of the lan ua e and thus ets an increasin ly improved idea of the lan ua e. The learner issues hypotheses that at some point have to conver e to a correct one (and never to be chan ed afterwards). The representation of a lan ua e as an in nite sequence, containin exactly the words of the lan ua e, is called a text. We will use the terms learnin in the limit, identi cation in the limit, and learnin from text synonymously. Much work on learnin from text has addressed the question concernin what classes of lan ua es are learnable and of what restrictions and combinations of restrictions for learners do restrict the power of identi cation. A set of learners is called a strate y. The most important strate y is recursive learners. In this paper we will deal with recursive as well as non-recursive learners. Another type O. Watanabe, T. Yokomori (Eds.): ALT’99, LNAI 1720, pp. 132 144, 1999. c Sprin er-Verla Berlin Heidelber 1999
Learnin from Random Text
133
of a strate y is consistency [1]. A learner is consistent if all hypotheses denote a lan ua e that at least contains all words thus far seen. It is well known that consistency does not enerally restrict learnin power, but it does restrict the power of recursive learners. What has not been reatly adressed is how fast learnin takes place. The reason for this is simple: With the exception of trivial cases there exists no upper bound on the learnin time because texts can always be padded with useless information at the be innin . Therefore, this leaves avera e case analysis as a possible way to proceed. A prerequisite for this is a stochastic model for the texts. In eneral these models form a text by eneratin every word in the text independently from the others and with an equal probability distribution, which is well motivated by several scienti c contexts. Results on learnin pattern lan ua es [3,13,10] and monomials [9] emer ed as a consequence. These results are bounds on the avera e number of examples or on the time required to learn on avera e accordin to a commonly lar e class of probability distributions. Some of the papers also deal with aspects other than the avera e time or number of examples: They present tail bounds on the probability of conver ence. More speci cally, they show that some al orithm have exponentially small tail bounds [10]. There is also one eneral result: Every learner that is conservative and set driven automatically has small tail bounds as concerns every probability distribution [10]. Knowled e about tail bounds enables a learner to stop after a nite time and to announce his hypothesis as correct with hi h probability. There is a direct connection to stochastically nite learnin , which was introduced in [10] (see also additional remarks in [9]). In eneral, xin probability distributions for each lan ua e in some class leads to a learnin model that is much more powerful than that of identi cation in the limit: The whole class of recursively enumerable lan ua es becomes learnable (see [8]). This model is called measure one learnin . Kapur and Bilardi present an ele ant way to construct simple learners for a collection of lan ua es if there is some minimal knowled e about the underlyin probability distributions [6]. They also show how knowled e about distributions provides indirect ne ative evidence. In this paper we introduce the model learnin from random text, which overcomes the di culties above: It does not su ce to learn for xed probability distributions, every reasonable distribution must be considered. In other words, a learner identi es a lan ua e from random text, if he conver es to a correct hypothesis with probability one for every probability distribution for which the probability of a word is positive i the word is a member of the lan ua e. This de nition overcomes the di culties mentioned above. Formal de nitions are contained in the next section. (See also Kapur and Bilardi [7] for learnin of indexed lan ua es.) In the rst part of this paper we investi ate some basic properties of this model. In particular, if a learner learns from text he also learns from random text, but not necessarily vice versa. With this in mind we eneralize the classical model, due to the fact that a learner is now allowed to fail on certain texts.
134
Peter Rossmanith
These texts, which we call the failure set of the learner, must have measure zero for every admissible distribution. However, if a class is learnable from random text, it can also be learned from text. This is also valid if we restrict ourselves to recursive learners. This equivalence can be translated to all strate ies holdin no restriction on identi cation in the limit, for which are many. To further compare the two models, we investi ate some strate ies that act as a restriction for recursive learnin in the limit. Set-driven learners (whose hypotheses depend only on the set of examples seen, not on their multiplicity or order), con dent learners (who conver e on all texts for all lan ua es), and memory limited learners (who can remember only a constant number of examples back in time). It will prove to be the case that the qualities of bein set-driven or con dent are restrictions for recursive learnin from random text, too. The characteristic of limited memory, on the other hand, does not act as a restriction. Every learner that learns in the limit has a lockin sequence [2], whose existence plays a crucial part in many proofs. After readin a lockin sequence as a pre x of a text, a learner never a ain chan es his hypothesis. Blum and Blum showed that every learner has a lockin sequence and that every initial pre x of a text is a pre x of a lockin sequence [2]. When learnin from random text, lockin sequences do not necessarily exist. We o er a simple characterization in terms of topolo ical properties of the failure set, showin whether or not lockin sequences nevertheless exist. The topolo y concerned is the natural topolo y on the sequences of words from the lan ua es to be learned: A lockin sequence exists i (1) the failure set is not comea er and i (2) it is not dense. Similarly, every pre x of a text is aa pre x of a lockin sequence i (1) the failure set is mea er and i (2) it is nowhere dense. In the second part of this paper we investi ate the functions that map n to the probability that the learner has not conver ed to the correct hypothesis after readin n examples. This function will be called conver ence indicator. The limit of a conver ence indicator is zero for every admissible probability distribution, as to be expected. We show that, except for trivial cases, a conver ence indicator cannot be smaller than exponential, i.e., it is always Ω(1)n . This is true re ardless of the class of lan ua es to be learned and the underlyin probability distribution; hence we have a lower bound. Exists an upper bound as well? In other words, is it possible to learn arbitrarily slowly? The answer is enerally ‘yes’, but for recursive learners ‘no’. There exists some function f such that every recursive learner’s conver ence indicator is o(f (n)). This forms a kind of ma ical barrier: Either you can learn faster or not at all. After ndin lower and upper bounds for the conver ence behavior of individual learners, we have to consider the more eneral task of ndin the best conver ence behavior amon all learners that learn a class of lan ua es: We can always construct a learner that learns with exponentially small tail bounds, i.e, its conver ence indicator is O(1)n . We also show that if a learner is set-driven, then his tail bounds are automatically exponentially small. This phenomenon has already been reco nised for when a learner is simultaneously set-driven and conservative [10].
Learnin from Random Text
2
135
Preliminaries
This section contains fundamental de nitions. The notation is almost identical to that in Osherson, Stob, and Weinstein’s textbook [8]. The natural numbers are denoted by N = 1 2 3 . A lan ua e is a subset of natural numbers. Let W be the lan ua e accepted by the ith Turin machine and RE = W1 W2 the recursively enumerable lan ua es. A text is a sequence of natural number. If t ). The ran e rn (t) is a text, then t denotes its ith component, i.e., t = (t1 t2 . We say t is a of a text t is the set of all its components, i.e., rn (t) = t1 t2 text for L if rn (t) = L. The set of all texts for L is denoted by TL and T is the set tn ). of all texts. The pre x of len th n of a text t is denoted by tn = (t1 t2 Let F be the set of all partial functions from N to N. We will identify nite F and limn!1 (tn ) = i, i.e., (tn ) = i sequences with natural numbers. If for all but nitely many n, we say that conver es on t to i and write (t) = i. We say conver es on t if it conver es on t to some i. If W (t) = L for every text for L, then we say that identi es L (in the limit or from text). If L RE and identi es every L L then we say that identi es L. Let S F be a strate y. Then [S] denotes the set of all classes L RE that S. In particular, [F] are all learnable classes and [Frec ] are learnable by some all recursively learnable classes. We de ne a topolo y TL = (TP(L) OL ), where TP(L) is the set of all texts for all subsets of L. The open sets OL are de ned by a base consistin of all sets B L = t TP(L) tn = for n = lh( ) , i.e., all texts for all subsets of L that have a pre x . (lh( ) is the len th of .) The induced topolo y is the sequence space over L. More details can be found in [8]. With the help of TL we can de ne a probability space (TP(L) A ML), where A, the measurable sets, is the smallest -Al ebra that contains all basic open sets B L and ML is an admissible probability measure de ned via ML (B L ) = lh( ) =1 mL ( ) where mL is a probability measure on L and must ful ll mL (n) > 0 i n L and mL (L) = 1. The random variable T denotes a random text for L; technically, T is the identity function on T . If P is a predicate for texts, we use the usual abbreviation ML [P ] = ML ( t P (t) ). For example, ML [T B L ] = ML ( t T T (t) B L ) = ML ( t T t B L ) = ML (B L ). We are now ready to de ne learnin from random text formally.
F identi es a lan ua e L RE from random text De nition 1 A learner if ML [W (T ) = L] for all admissible probability measures ML . He indenti es a class L from random text if he identi es all L L from random text. The set of all classes identi ed by learners in S from random text is denoted by [[S]]. 3
Relations to Identi cation in the Limit
In this section we compare identi cation in the limit and identi cation from random text. The rst result is that learnin from random text is at least as powerful as learnin from text for all strate ies. Then we show that [F] = [[F]]
136
Peter Rossmanith
and [Frec ] = [[Frec ]], i.e., that eneral learners, resp. recursive learners as a whole are equally powerful in both models. The next theorem is based on the simple fact that a random text is a text with probability one. Theorem 1 [S]
[[S]] for every set of strate ies
S F.
S identi es a lan ua e L from text then it identi es every text Proof. If for L. We have to show that identi es L on a random text with probability one. Let us x some measure ML . We want to show that ML (TL) = 1. We can write TL = TP(L) − 2L TL−f g, i.e., the texts for L are all texts whose ran e is a subset of L minus all texts whose ran e is a proper subset of L. But TP(L) is the sure event and therefore ML (TP(L)) = 1. All that remains is to show that ML (TL−f g) = 0 for every i L. We can write TL−f g
Xn =
B L lh( ) = n and
contains no i 1
The and ML (Xn ) = (1 − mL (i))n and consequently n=1 ML (Xn ) < Borel Cantelli Lemma implies ML (lim sup Xn ) = 0 and ML (TL−f g) = 0, since TL−f g = lim sup Xn . (lim sup Xn is the set of all elements that appear in innitely many Xn .) Consequently ML (TL) = 1. We have shown that identi es L on all texts except on a measure zero set. Consequently it identi es L from random text. The di culty in the next theorem is to nd a learner that fails on no text, if we only know that a learner exists that fails only on a set of measure zero. Theorem 2 [F] = [[F]]. Proof. The -part acts as a special case of Theorem 1. F identify a class of lan ua es L from To prove the opposite direction let random text. We can assume w.l.o. . that ( ) = ( ) whenever W ( ) = W ( ) . F that identi es L. We construct a The purpose of usin is to simulate (t) on a random text constructed from t with a particular probability distribution. The followin is a detailed de nition of : nm ) be some sequence. Here the measure m is de ned Let = (n1 n2 m and m (n) = 0 if n rn ( ). In this as m (n ) = nj =ni 2−j for i = 1 way we obtain a sequence mtn of measures. There is a limit distribution mt (i) = limn!1 mtn (i) that is a probability measure. The correspondin probability ) measure on texts is Mt , which is de ned as Mt (B L ) = lh( =1 mt ( ). We know that limn!1 Mt [W (Tn ) = L] = 1 if t is a text for L from Theorem 1 because Mt is admissible for L. Let us x the lan ua e L and a text t for L. We can then nd for every 13 > > 0 an N N, so that Mt [W (Tn ) = L] 1 − for all n N . To compute (tn ) it su ces to look at all ( ) where lh( ) = n and ( ) Mt (B L ) > 0 and lh( ) = n , so that mt ( ) > 0. There must be an i Mt (B L )
1−
Learnin from Random Text
137
where the sum is taken over all such that Mt (B L ) > 0, lh( ) = n, and ( ) = i. Moreover, W = L. We cannot de ne (tn ) as that i because mt is not known. However, mtn and mt are very similar. If contains no j rn (tn ), then clearly Mtn (B L ) = Mt (B L ). The Mt -measure of all B L with lh( ) = n and rn ( ) rn (tn ) is at most n2−n . Hence, Mtn (B L )
1−
2
Mt (B L )
1−
Mtn (B L )
1−
3 2
for su ciently lar e n (such that n2−n < 2). Consequently, we can de ne (tn ) as that i for which the sum of all Mtn (B L ) over all with rn ( ) = rn (tn ), lh( ) = n, and ( ) = i is bi er than 1 2. If such an i does not exist, then (tn ) is unde ned. It is not hard to see that identi es all L L. The same is valid for recursive learners: Theorem
[Frec ] = [[Frec ]].
Proof. The construction in Theorem 2 of (tn ) is not e ective (this refers to Frec identify L from random text. We construct the w.l.o. . -section). Let rec F that also identi es L from random text, but that has an a di erent additional property: If identi es a lan ua e L from random text then there is a sin le index i for each admissible ML makin ML [ (T ) = i] = 1 and W = L. In order to de ne we decompose T into a countable number of disjoint texts T k for example by Tnk = Thk ni , where k n = 2k 3n . Then (Thk ni ) = min (Tn ) 1 i k . (If m cannot be written as k n then (Tm ) = (Thk ni ) where k n is the next smaller number of this form.) For each i with ML [ (T ) = i] > 0 the probability that (T k ) = i for some k is one. Therefore with probability one (T ) = i where i the minimal number with ML [ (T ) = i] > 0. If we use instead of in the construction of Theorem 2, we can avoid decidin whether W = Wj . Then the construction becomes e ective. The remainder of this section deals with three strate ies: set-driven, memory limited, and con dent learners. If [S] = [F], then [[S]] = [[F]] (a simple corollary of the above theorem). If, however, [S] [F], then both [[S]] = [[F]] and [[S]] [[F]] remain possible (the same applies for Frec instead of F). It is known that [Frec Fset−driven ] [Frec ] [11,4]. A learner is set-driven if each hypothesis depends only on the set of examples so far seen, but not on their order or multiplicity. The next theorem implies [Frec Fset−driven ] = [[Frec Fset−driven ]], since it states both models are equivalent for set-driven learners. If a learner is set-driven he cannot fail on a sin le text when learnin from random text. Surprisin ly, this is not true for rearran ement-independent learners, whose hypotheses depend only on the set of examples so far seen and on their number. Theorem 4 Let L random text. Then
RE be a lan ua e where identi es L.
Fset driven
identi es L from
138
Peter Rossmanith
Proof. If L is nite then locks as soon as it has seen all words form L. Otherwise it would conver e on no text at all to the ri ht hypothesis. Hence, it conver es on all texts. ) for L. We If L is in nite, then take an arbitraty text t = (n1 n2 n3 can assume that all n are pairwise distinct, since is set-driven. De ne M = nm m N and S = t T rn (tn ) M for all n N . n 1 n2 k Then there is a measure mL such that ML (S) > 0, e. ., mL (nk ) = ce−4 with 1 −4k (a short computation shows ML (S) > 1 2). Since is setc = 1 k=1 e driven it conver es on all texts from S to the same hypothesis or it diver es on all of them. The latter cannot happen, since S does not have a measure of zero. Therefore conver es to the correct hypothesis on all of S. Since t S, identi es t. Corollary 1 [S] = [[S]] for all
S Fset-driven.
The same result does not hold for rearran ement-independent learners which shows that the number of examples plays a crucial role in this context. Theorem 5 There is a Proof. Let
L=
( )=
S Frearran ement independent
1 N and
such that [[S]] = [S].
de ned via
i if = (1 1 1 j otherwise
1) or rn ( ) = 1 2 3
lh( )
Obviously, is rearran ement independent (but not set-driven). While iden. ti es L from random text, it does not identify L. Now choose S = It is easy to see that identi es N from random text since identi es every text for N unless t1 = t for all i > 1. This happens with probability limn!1 (1 − mL (1))n = 0. Memory limited learners are an example of a natural strate y causin no restriction on learnin from random text, but causin a restriction on learnin in the limit. A learner is memory limited if there is a number n such that ( ) , lh( )−n for all depends only on ( 1 2 lh( )−1 ) and lh( ) , lh( )−1 , sequences [12]. A memory limited learner remembers only his last hypothesis and the last n examples. Theorem 6 [[Fmemory
limited
]] = [[F]].
Proof. First we show that every random text is a fat text (a text that contains everythin in nitely often) with probability one. Let f : N N N be an ) is a random arbitrary bijective function. Then T( ) = (Tf ( 1) Tf ( 2) Tf ( 3) text for every i N. We have already shown that every random text is a text with probability one. Hence, T contains an in nite number of texts T( ) with probability one.
Learnin from Random Text
139
Let L RE and let identify L from random text. Then there is a identi es that identi es L from text (Theorem 2) and we can assume that L memory limited from fat text [8]. Since every random text is a fat text with probability one, identi es L from random text, too. Con dency is an example, where old proof techniques partially transfer to learnin from random text. In most cases this is not possible, in particular if the proofs are based on lockin sequence ar uments. A learner is con dent if he conver es on all texts, even on texts for a lan ua e that he does not identify. Theorem 7 [[Fcon
dent
]]
[[F]].
Proof. We can adapt the proof of [Fcon dent ] [F] of Osherson, Stob, and Weinstein [8]. While the basic principle remains the same, the proof for learnin from random text is much more involved as they use failure on sin le texts, while we F identi es RE n (the have to provide sets with non-zero measure. Suppose nite lan ua es) from random text. Let 0 be the shortest sequence of zeros ) is the only text for 0 such that W ( 0 ) = 0 . Such 0 exists, since (0 0 and therefore every random text coincides with it with probability 1. Let 1 be the shortest sequence of zeros and ones such that W ( 0 1 ) = 0 1 . Let L = N f0 1g f0g and ML be admissible. A ain, this 1 exists because ML (B 0 − B 0 ) > 0 0 and on which even idenand therefore are many sequences startin with n ti es 0 1 . Generally we de ne n to be the shortest sequence of 1 n . For analo ous reasons as above n must such that W ( 1 2 n ) = 1 2 exist. and is therefore not However, obviously does not conver e on 1 2 3 con dent. Since RE n [[F]] the claim follows.
4
Lockin Sequences
Lockin sequences play a crucial role in many proofs. In eneral, no lockin sequences may exist when learnin from random text. The followin theorems ive simple characterizations, when lockin sequences exist and when not. Lemma 1
TL
B L and B L are equivalent modula a mea er set in
TL.
and B is nowhere dense if i L: Proof. TL B L = B L − 2L B Every non-void open set contains some basic open set B L that contains then L−f g . itself B L , which is disjoint from B L−f g
L−f g
Ftotal identify L RE from random text. Let M be the Theorem 8 Let measure zero set of texts for L on which does not identify L. Then the followin three statements are equivalent: (1) Every with rn ( ) L is pre x of a lockin sequence for L. (2) M is mea er in TL . (3) M is nowhere dense in TL .
140
Peter Rossmanith
Proof. We show 2 1 3 2. Fix some admissible ML and let rn ( ) Since identi es L on all texts in TL − M ,
TL
BL
F −1 ( t ) t is stabilized on L and
t
M
L.
(1)
Since B L and TL B L are equivalent modulo a mea er set, TL B L is not contained in a mea er set by the Baire cate ory theorem. In particular, the ri ht hand side of (1) is not mea er and since M is mea er, some F −1 ( t ) cannot be nowhere dense if t is stabilized on L and t. Since F −1 ( t ) is closed this means that = Int(F −1 ( t )) F −1 ( t ). As a non-void closed set F −1 ( t ) contains some basic open set B L . The correspondin sequence is a lockin sequence for on L. 1 3: Let us assume that t is an accumulation point of M and let < t. Then every open subset of B L is not disjoint from M . Hence is not not a lockin sequence. Since we can choose arbitrarily, no pre x of t is a lockin sequence. 3 2: Every set that is nowhere dense is mea er.
Ftotal identify L RE from random text. Let M be the Theorem 9 Let measure zero set of texts for L on which does not identify L. Then the followin conditions are equivalent: (1) has a lockin sequence for L. (2) M is not dense in TL . (3) M is not comea er in TL . Proof. 1 2: Assume M is dense and is a lockin sequence. This is not L because M is dense. possible since B L M = for every with rn ( ) Since is a lockin sequence, however, B L M = . 2 3: If a set is not dense, it is not comea er. 3 1: If M is not comea er, then TL − M is not mea er because TL is itself comea er. As before we can ar ue as follows:
TL − M
F −1 (t) t is stabilized on L
Since TL − M is not mea er some F −1 (t) is not nowhere dense and contains as a closed set some basic open set B L and is a lockin sequence. The above two theorems are stated for total functions. The reason is that and not to the proofs use F , which is a function that maps texts to texts partial texts. It is possible, but very technical to modify the proofs for partial functions. A simpler way is the followin , from which the same eneralization immeadiately follows:
F identify L RE from random text. Then there is a Theorem 10 Let Ftotal that identi es also L from random text and has the same failure sets have the same lockin sequences for as for every L L. Moreover and each L L.
Learnin from Random Text
Proof. Let W
L. De
ne
141
as
( )=
( ) i
if ( ) is de ned, otherwise.
Obviously, and share the same lockin sequences. If diver es on a text, diver es, too, or conver es to i thus failin to identify an L L. If identi es some L L then conver es to the same hypothesis. Hence, the failure sets are identical.
F and Ftotal share conditions (1), (2), and (3) in Theorems 8 Since F. and 9, the three conditions are also equivalent for a partial function Example how to use these characterizations: Can a learner fail on t i t consists only of 1’s with nitely many exceptions? The answer would be no, as there is obviously no lockin sequence. Nevertheless the failure set is dense. Another application is the followin theorem, which uarantees that con dent learners have lockin sequences. Nevertheless, con dent learners can have nonvoid failure sets, but they cannot be dense. Fcon dent that identi es a lan ua e L from random Theorem 11 Every text has a lockin sequence for L. Proof. Since
is con dent it conver es on every text. Therefore
TL
F −1 (t) t is stabilized
and a ain because of Baire’s cate ory theorem some F −1 (t) is not nowhere dense and contains a basic open set B L . This is a lockin sequence. This theorem can also be easily eneralized such that even every sequence is pre x of a lockin sequence.
5
Tail Bounds
It can be shown that eneral learners can learn arbitrarily slowly, i.e., the probability that they still fail after n rounds (the conver ence indicator) conver es arbitrarily slowly towards zero (the proof is based on i norin lar er and lar er se ments of the text, which slows down learnin ). However, the next theorem shows that recursive learners cannot learn arbitrarily slowly: Either they conver e fast or they cannot learn at all. Theorem 12 Let L RE and let some mL be xed. There is then a function h: N Q such that f (n) = o(h(n)) for every f that is a conver ence indicator Frec that identi es L with probability one. for a
142
Peter Rossmanith
Proof. We de ne an oracle O : N Q+ Q such that O(i ) − mL (i) < , i.e., O(i ) is a rational number that is very near at mL (i), but still remains a rational number. Moreover, let O(i ) = 0 i mL (i) = 0, i.e., i i L. The oracle O is not uniquely determined; we just choose some O with these properties. An oracle Turin machine that has access to O and additionally access to the haltin problem for Turin machines with oracle O can compute a lower bound for f (n) (the details are omitted). Hence, whenever measure one identi es a lan ua e L with conver ence indicator f , then there is a function with (n) > f (n). Moreover, this is computable by some kind of oracle Turin machine whose oracle depends only on mL , but not on . Therefore, let h be a function shrinkin so slowly towards zero for n such that no oracle Turin machine as de ned above can compute a function that shrinks slower (it can be easily constructed by dia onalization). This h is the claimed function. The followin theorem states that for each nontrivial learnin problem exponential tail bounds can always be achieved and are also the best bounds possible. Theorem 1 If L [[F]] (resp. L [[Frec ]]) and L contains at least two lanua es that are not disjoint, then the conver ence indicator for some L L is F (resp. Frec ) that always Ω(1)n . On the other hand there is always a learns L from random text and whose conver ence indicator for L is O(1)n . Proof. The lower bound follows from a text (i i i ) where W is contained in two lan ua es of L. Each learner fails on at least one of them. The upper bound follows from rearran ement-independent learners: If L is identi ed by a rearran ement-independent learner, its conver ence indicator is O(1)n , since he conver es when the examples read contain a lockin set. This proof also shows that every rearran ement-independent learner (and thus every conservative one) have automatically exponentially small tail bounds. It is already known that this is the case for learners that are simultaneously conservative (or rearran ement-independent) and conservative [10]. Then, however, there exists a ti ht relationsship between the tail bounds and the expected learnin time, which is lackin if the learner is not conservative.
6
Conclusion
A stochastic model is a prerequisite to study the speed of learnin in inductive inference, which was the main objective to start this line of research. There are several stochastic models available, where positive examples are enerated independently and identically distributed accordin to a distribution. Kapur and Bilardi show how to construct learners in a uniform way [6]. If the same learner must identify a lan ua e for all reasonable probability distributions, then the only lan ua es that are learnable are those that are also learnable in Gold’s model of learnin in the limit [5]. We call the latter stochastic
Learnin from Random Text
143
model learnin from random text. This model captures all classes of lan ua es learnable in Gold’s model and none else. However, learners restricted in some ways can learn more classes from random text than from text. An example are memory limited learners. On the other hand, for many strate ies the two models coincide. While there exist always lockin sequences for Gold-style learners, this is not necessarily the case for learnin from random text. The existence of lockin sequences is closely related to the topolo ical properties of the failure sets. The eneral results on the speed of learnin are as follows. One problem of inductive inference is a learner does never know whether he already conver ed or whether he will have to chan e his hypothesis somewhere in the future. Exponentially small tail bounds let the probability of the latter drop very fast, so exponentially small tail bounds are a useful property of a learner. We have seen that everythin that can be learned at all can also be learned with exponentially small tail bounds, but not better. In particular, stochastically nite learnin [10] is always possible in principle.
References 1. D. An luin. Findin patterns common to a set of strin s. Journal of Computer and System Sciences, 21(1):46 62, 1980. 2. L. Blum and M. Blum. Toward a mathematical theory of inductive inference. Information and Control, 28:125 155, 1975. 3. T. Erlebach, P. Rossmanith, H. Stadtherr, A. Ste er, and T. Zeu mann. Learnin one-variable pattern lan ua es very e ciently on avera e, in parallel, and by askin queries. In M. Li and A. Maruoka, editors, Proceedin s of the 8th International Workshop on Al orithmic Learnin Theory, number 1316 in Lecture Notes in Computer Science, pa es 260 276. Sprin er-Verla , October 1997. 4. M. A. Fulk. Prudence and other conditions on formal lan ua e learnin . Information and Computation, 85:1 11, 1990. 5. E. M. Gold. Lan ua e identi cation in the limit. Information and Control, 10:447 474, 1967. 6. S. Kapur and G. Bilardi. Lan ua e learnin from stochastic input. In Proceedin s of the 5th International Workshop on Computational Learnin Theory, pa es 303 310. ACM, 1992. 7. S. Kapur and G. Bilardi. Learnin of indexed families from stochastic input. In The Australasian Theory Symposium (CATS’96), pa es 162 167, Melbourne, Australia, January 1996. 8. D. Osherson, M. Stob, and S. Weinstein. Systems That Learn: An Introduction for Co nitive and Computer Scientists. MIT Press, Cambrid e, Mass., 1986. 9. R. Reischuk and T. Zeu mann. A complete and ti ht avera e-case analysis of learnin monomials. In C. Meinel and S. Tison, editors, Proceedin s of the 16th Symposium on Theoretical Aspects of Computer Science, number 1563 in Lecture Notes in Computer Science, pa es 414 423. Sprin er-Verla , 1999. 10. P. Rossmanith and T. Zeu mann. Learnin k-variable pattern lan ua es e ciently stochastically nite on avera e from positive date. In V. Honavar and G. Slutzki, editors, Proceedin s of the 4th International Colloquium on Grammatical Inference,
144
Peter Rossmanith
number 1433 in Lecture Notes in Arti cial Intelli ence, pa es 13 24, Ames, Iowa, jul 1998. Sprin er-Verla . ¨ 11. G. Sch¨ afer. Uber Ein abeabh¨ an i keit und Komplexit¨ at von Inferenzstrate ien. PhD thesis, Rheinisch Westf¨ alische Technische Hochschule Aachen, 1984. In German. 12. K. Wexler and P. Culicover. Formal Principles of Lan ua e Acquisition. MIT Press, Cambrid e, Mass., 1980. 13. T. Zeu mann. Lan e and Wieha en’s pattern learnin al orithm: An avera ecase analysis with respect to its total learnin time. Annals of Mathematics and Arti cial Intelli ence, 23(1 2):117 145, 1998.
Inductive Learnin with Corroboration Phil Watson Department of Computer Science, University of Kent at Canterbury, Canterbury, Kent CT2 7NZ, United Kin dom. P.R.Wa
[email protected]
Abs rac . The basis of inductive learnin is the process of eneratin and refutin hypotheses. Natural approaches to this form of learnin assume that a data item that causes refutation of one hypothesis opens the way for the introduction of a new (for now unrefuted) hypothesis, and so such data items have attracted the most attention. Data items that do not cause refutation of the current hypothesis have until now been lar ely i nored in these processes, but in practical learnin situations they play the key role of corroboratin those hypotheses that they do not refute. We formalise a version of K.R. Popper’s concept of de ree of corroboration for inductive inference and utilise it in an inductive learnin procedure which has the natural behaviour of outputtin the most stron ly corroborated (non-refuted) hypothesis at each sta e. We demonstrate its utility by providin characterisations of several of the commonest identi cation types in the case of learnin from text over class-preservin hypothesis spaces and provin the existence of canonical learnin strate ies for these types. In many cases we believe that these characterisations make the relationships between these types clearer than the standard characterisations. The idea of learnin with corroboration therefore provides a unifyin approach for the eld.
Keywords: De ree of Corroboration; Inductive Inference; Philosophy of Science.
1
Introduction
The eld of machine inductive inference has developed in an ad hoc manner, in particular in the characterisations of identi cation types which have been achieved. In this paper we wish to propose a new unifyin framework for the eld based on the philosophical work of K. R. Popper, and in particular his concept of de ree of corroboration. We will demonstrate that many of the existin identi cation types in the case of learnin from text allow an alternative characterisation usin the concept of learnin with corroboration; in particular this approach reveals the existence of canonical learnin al orithms for the various types. O. Watanabe, T. Yokomori (Eds.): ALT’99, LNAI 1720, pp. 145 156, 1999. c Sprin er-Verla Berlin Heidelber 1999
146
Phil Watson
We will be concerned with learnin indexable recursive families of recursive lan ua es from text. We restrict our attention to the standard case of classfor preservin hypothesis spaces, i.e. those indexed recursive families H1 H2 C such that for every L C there exists at least one (and possibly many) i such that Hi describes L and every Hi describes some L in C. We assume the standard de nitions in the eld of machine inductive inference [Go67, An80, AS83]. De nitions of other concepts used in this paper may be found as follows: stron monotonic learnin (SMON-TXT) [Ja91]; refutin inductive inference machines (RIIMs) [MA93, LW94]; justi ed refutin learnin (JREF-TXT) [LW94]; set-driven learnin (s-*-TXT) [WC80, LZ94]. Our notation will mostly be standard. We mention the followin points. IN will be the natural numbers 0 1 2 while IN+ will be the positive inte ers 1 2 3 . Our lan ua es L will be non-empty sets of words over a xed nite . We will write ( ) for the space of all nite alphabet ; therefore L ; therefore if t is a text for lan ua e L, we have and in nite sequences from t ( ). We write tm for the nite initial subsequence of t of len th m + 1, then t+ i m . and t+ m for the content of tm , i.e. if t = s0 s1 s2 m = si Index(C) will be the set of all class-preservin recursive indexin s L of class C of recursive lan ua es; such indexin s will be our hypothesis spaces. If hypothesis H L describes lan ua e L C, where L Index(C), then we abuse notation sli htly by writin H = L. Similarly if H1 H2 L describe the same L C we H. The set of all texts for write H1 = H2 . We say that tm refutes H i t+ m H will be written T xt(H) while T xts(L) will be the set of all texts t such that ( H L)t T xt(H).
2 2.1
De ree of Corroboration Popper’s ‘Lo ic of Scienti c Discovery’
The philosopher K.R. Popper [Po34, Po54, Po57, Po63] de ned a philosophical and lo ical system coverin the epistemolo y and practice of science. A central plank of this system was the concept of de ree of corroboration, C(x y), meanin the de ree to which a theory x receives support from the available evidence y. Evidence supportin x causes C(x y) to increase in value, while evidence underminin x causes C(x y) to decrease. A set of ten desiderata [Po34, Po57] de ned C(x y). Space precludes a full discussion of these desiderata here; the reader is referred to [Wa99]. Popper’s de ree of corroboration is a practical measure enablin us to choose between unrefuted theories, iven that a scienti c theory is, by its very nature, incaple of proof (another major strand of Popper’s work was concerned with settlin this point). We should tentatively believe the best-corroborated hypothesis at any iven time. An interestin recent discussion of Popper’s work is to be found in Gillies [Gi93, Gi96]. There a lo ical system is characterised as one with both inferential and control elements; in both Popper’s work and the present paper corroboration plays the role of control. Indeed iven the characterisations usin canonical
Inductive Learnin with Corroboration
147
learners which we have obtained for standard learnin types (Section 4) we may say that de ree of corroboration is the only control element necessary in machine inductive inference. A more detailed discussion of Gillies’s work may be found in [Wa99]. 2.2
Our Di erences from Popper’s Approach - Discussion
Restricted Domain We wish to de ne a corroboration function analo ous to Popper’s but for use in the domain of inductive learnin theory. This restricted domain enables us to make a number of simplifyin assumptions compared to Popper’s version. First we note that we always wish to state how well a hypothesis is corroborated by data. This is already more speci c than Popper’s approach, in which he speci cally allows the corroboration of, for example, one theory by another. Our hypotheses will be those of an inductive inference machine and will come from a particular hypothesis space, within which we aim to nd a true description of the phenomenon producin the data, which will be a recursive lan ua e. The data will be a sequence of examples formin a text (or strictly speakin , formin at any particular time an initial se ment of a text) for the phenomenon. c(H t) will be the de ree to which example text t corroborates hypothesis H. Lower case is used to distin uish our versions of Popper’s functions. Fixed Values We assume that data is free of noise, and that we aim to nd a hypothesis which exactly describes or explains the concept producin the data. Now the idea that data undermines (Popper’s choice of word) a theory can be replaced by outri ht refutation in the case that data disa rees with the predictions of the theory. Thus all the possible ne ative values in Popper’s scheme may be replaced in ours by −1, the corroboration value of refuted hypotheses. Similarly the value 0, reserved by Popper for the de ree of corroboration o ered to x by an independent theory y, subtly chan es its meanin when we restrict ourselves to corroboration of hypotheses by data. The value 0 is now the corroboration iven to any theory by the empty data set , by vacuous data which ives us no help in choosin between competin hypotheses in our space, or in the case that the theory itself is tautolo ical, metaphysical or otherwise not lo ically refutable. References to Probability For historical reasons, Popper’s desiderata are tied closely to de nitions in probability; speci cally, Popper sets out to demonstrate that de ree of corroboration is in no sense a measure of probability. For our purposes, we have no need of any directly de ned probabilistic measures. In a powerful ar ument, Popper identi ed the maximum de ree of corroboration possible for a hypothesis with its lo ical improbability, and therefore with its scienti c interest. Similarly, we use c(H) to mean the hi hest de ree of corroboration of which H is capable; however we drop the reference to P (x) in Popper’s de nition of C(x) and instead add some natural restrictions on c(H).
148
Phil Watson
Popper’s dependence on probabilistic de nitions leads him to restrict the maximum de ree of corroboration in any case to the value 1. Objections to this unnecessary restriction led him to drop it in [Po57], and we do likewise. Further, we may drop the restriction of de rees of corroboration to real number values alto ether, and use any partially ordered set S with a minimum element −1 such that S − −1 has a minimum element 0, and decidable (recursive) relations and ./. 2.3
Our De nition of De ree of Corroboration
Let H ran e over hypotheses from our space L, and t over texts and nite initial se ments of texts. We assume that c(H t) ran es over some partially ordered set S with minimum element −1 and an element 0 minimal in S − −1 . We write c(H) for the maximum de ree of corroboration possible for H. Falsi ers(H) is the set of potential data items in which refute H. If Falsi ers(Hi ) Falsi ers(Hj ) Hi to capture the natural Popperian sense that Hj is then we will write Hj more easily refuted (potentially more stron ly corroborable) than Hi . Our model of learnin requires that c(H t) and comparison ( ) between de rees of corroboration are both recursive, but not necessarily that c(H) is recursive or that c(Hi ) c(Hj ) is decidable. First we formally de ne our corroboration functions. S over L maps hyDe nition 1. A corroboration function c : L ( ) potheses and texts to some set S with minimum element −1 and an element 0 minimal in S − −1 such that S has a decidable partial orderin , and satis es the followin desiderata for all hypotheses H H 0 L and all texts t t0 ( ): 1. c(H t) = −1 i there exists data in t which refutes H. 2. c(H t) 0 i t does not refute H 3. c(H t) = 0 if t is empty or contains no data capable of refutation of any hypothesis in our space. 4. c(H) = max Limn!1 c(H tn ) t is a text for H is uniquely de ned 5. c(H) c(H 0 ) if H H 0 6. If t is a nite initial subsequence of t0 then either c(H t) c(H t0 ) or 0 c(H t ) = −1 Note that item 5 in the de nition implies that if H = H 0 then c(H) = c(H 0 ). Our de nition of de ree of corroboration is simpler than Popper’s because we have dropped all reference to probability and this ives us reater freedom when actually assi nin values to our functions c(H) and c(H t). We will see in the next section that certain inductive learnin identi cation criteria will require corroboration functions with additional properties to those speci ed above.
3
Learnin with Corroboration
In this section we cover the remainin assumptions and de nitions necessary to de ne a theory of inductive learnin with corroboration.
Inductive Learnin with Corroboration
3.1
149
Hypotheses and Hypothesis Spaces
All forms of inductive inference su er from the problem that the learner is required to choose one from amon (typically) in nitely many hypotheses at each sta e. Clearly no learner can consider all these hypotheses before it outputs a hypothesis or requests further data, so in e ect there are only a limited number of hypotheses in play at any iven time. Most authors loss over this question as a matter of detail, or deal with it implicitly, but as we intend to propose a new unifyin model for machine inductive inference, we feel constrained to deal with it explicitly. we have We therefore assume that alon with our hypothesis space H1 H2 a recursive, monotonically increasin function ip : IN IN with Limn!1 ip(n) = which ives the number of hypotheses in play at sta e n of any learnin procedure with this hypothesis space. This leads to one sli ht concession with respect to our desiderata: hypotheses Hj which are not yet in play at sta e n need not be considered to be either refuted or corroborated by tn , the examples seen to that sta e - we therefore arbitrarily assi n c(Hj tn ) = 0 for such n j. This cannot cause confusion as these hypotheses are (by de nition) not considered by any al orithm; it serves only to simplify some al orithms de ned in the proofs. 3.2
Corroboration Functions and Canonical Learners with Corroboration
In the followin section (Section 4) we examine the use of corroboration in inductive learnin and prove that many of the most natural inductive learnin identi cation types can be characterised by an existence condition for a suitable corroboration function over the hypothesis space. Our intention is that this corroboration function (which is invariably recursive so no undecidability results are implied, nor is any additional computin power ained illicitly) will be used as an oracle by a canonical learner for the appropriate type; this demonstrates that there is e ectively a sin le best learnin strate y for each identi cation type, and only the details of the corroboration function chan e dependin on the hypothesis space. The behaviour of a learner with corroboration is de ned as follows. De nition 2. Turin machine M, with oracle c(H t) is called a learner with corroboration if c(H t) is a recursive corroboration function and on input t with Hp in play, M outputs some i p such that c(Hi t) > 0 hypotheses H1 p, if de ned, and requests more input is maximal amon the c(Hj t) j = 1 otherwise. If additionally M learns within identi cation type , we call M a -learner with corroboration. Clearly such a learner is consistent with Popper’s dictum that we should prefer the most stron ly corroborated hypothesis amon competin hypotheses.
150
4
Phil Watson
Characterisin TXT-Identi cation Types in Learnin with Corroboration
In this section we are concerned only with learnin from text, and often abbreviate the names of identi cation types by droppin the -TXT. Our learners always work with respect to class-preservin hypothesis spaces. Lack of space precludes the inclusion of most proofs. The proofs of the Theorems follow the form of the proof of Theorem 1 with additional details for the more complex learnin types. The Corollaries concernin canonical learners rely on the observation that in each case the learner de ned in the part of the proof of the precedin Theorem depends on C only via c. All proofs may be found in [Wa99]. 4.1
LIM- and s-LIM-Learnin
De nition 3. A corroboration function c over L is called limitin i ( n)( m
( H L)( t T xt(H))( i)[Hi = H n)( j)[c(Hi tm ) > c(Hj tm ) [c(Hi tm ) c(Hj tm )
i
j]]]
Theorem 1. C LIM-TXT i there exists L Index(C) such that there is a recursive limitin corroboration function c over L. Proof. ( ) We de ne a learner M which uses such a recursive limitin c to LIM-learn any H L. Hp . At the Let t be a text. Let the hypotheses in play at sta e m be H1 (m + 1)th sta e (i.e. on input tm ) M behaves as follows. M(tm )
= min(Bestm ) if de ned requests more input otherwise
where Bestm = i i
p
c(Hi tm ) > 0
( j
p)c(Hi tm )
c(Hj tm )
p and forms M is recursive: M recursively computes c(Hi tm ) for i = 1 the nite set of those i for which c(Hi tm ) is maximal under the recursive relation . M now outputs the minimum such i, unless the set is empty, in which case it requests more input. On presentation of a text t for H, M conver es to some j such that Hj = H: x t, an arbitrary text for H. Let n be that sta e de ned in De nition 3. Now there is some j with Hj = H such that at sta e n and all subsequent sta es m M will output j because j = min(Bestm ) by assumption that c is a limitin corroboration function and the de nition of M.
Inductive Learnin with Corroboration
151
( ) Suppose M is an inductive learnin machine which LIM-learns C w.r.t. L. We de ne a recursive c which produces values (for de ree of corroboration) ran in over IN −1 . Let c(Hj tm ) =
−1 if tm refutes Hj m + 1 if M(tm ) = j m otherwise
c is recursive: it is decidable for any j whether tm refutes Hj , and by assumption M is an IIM. c is a limitin corroboration function over L: it is easily checked that c satis es the conditions of De nition 1 and so c is a corroboration function. Let t be any text for H C. By assumption there exists an index j such that Hj = H and a sta e n after which M always outputs j. Therefore at all sta es m n we have c(Hj tm ) > c(Hk tm ) for all k = j, which satis es the requirements of De nition 3. Corollary 1. If C LIM-TXT then there exists L Index(C) such that there is a recursive limitin corroboration function c over L with the property that ( H L)( t T xt(H))( i)[Hi = H ( n)( m
n)( j = i)c(Hi tm ) > c(Hj tm )]
Corollary 2. There is a canonical LIM-learner with corroboration which will learn any C LIM-TXT w.r.t. any L Index(C) usin any recursive limitin corroboration function c over L as an oracle. When considerin the philosophical back round for our model of learnin , it seems clear that the order in which examples are presented to the learner, or the number of times the same example is repeated, has no si ni cance. This leads us to the followin de nition. is called natural De nition 4. A corroboration function c over L = H1 H2 + ( i)c(Hi tm ) = c(Hi un ). if on all texts t u, for all m n we have t+ m = un It mi ht be objected that corroboration functions lackin the naturalness property should be disallowed. However, they are no more unnatural than nonset-driven learners (it is known [LZ94] that s-LIM-TXT LIM-TXT). Theorem 2. C s-LIM-TXT i there exists L Index(C) such that there exists a recursive natural limitin corroboration function c over L. Corollary 3. There is a canonical s-LIM-learner with corroboration which will learn any C s-LIM-TXT w.r.t. any L Index(C) usin any recursive natural limitin corroboration function c over L as an oracle.
152
4.2
Phil Watson
Conservative and Stron Monotonic Learnin
De nition 5. A corroboration function c : L attainin if ( H ( H0
(
)
S over L is called
L)( t
T xt(H))[( j)( n)[Hj = H c(Hj tn ) = c(Hj )] ( i)( m)[c(Hi tm ) = c(Hi ) L)[[tm refutes H 0 Hi H 0 ] c(H 0 tm ) > c(Hi tm )]]
c is a recursive attainin corroboration function if both c and cf : L S are total and recursive, where: cf (Hi s) =
0 1
1 if s = c(Hi ) 0 otherwise
Note that c(H ) = 0 implies ( i)c(Hi )
0.
Theorem 3. C CONSERV-TXT i there exists L Index(C) such that there exists a recursive attainin corroboration function c over L. Corollary 4. There is a canonical CONSERV-learner with corroboration which will learn any C CONSERV-TXT w.r.t. any L Index(C) usin as an oracle any recursive attainin corroboration function c over L. De nition 6. A corroboration function c(H t) over L = H1 H2 strict if ( Hi
L)( t
T xt(Hi ))( n)[c(Hi tn ) = c(Hi )
( Hj
t+ n )Hj
is called
Hi ]
c is called a recursive strict corroboration function if both c and cf are total and recursive, where cf is as de ned in De nition 5. Theorem 4. C SMON-TXT i there exists L Index(C) such that there exists a recursive strict attainin corroboration function c over L. Corollary 5. There exists a canonical SMON-learner with corroboration which SMON-learns any C SMON-TXT w.r.t. any L Index(C) usin any recursive strict attainin corroboration function over L as an oracle. Corollary 6. There is a canonical (CONSERV SMON)-learner with corroboration which will CONSERV-learn any C CONSERV-TXT w.r.t. any L Index(C) usin any recursive attainin corroboration function c over L as an oracle and will SMON-learn any C SMON-TXT w.r.t. any L Index(C) usin any recursive strict attainin corroboration function c for L as an oracle.
Inductive Learnin with Corroboration
4.3
153
FIN- and Refutin Learnin
be a hypothesis space. Then f : ( De nition 7. Let L = H1 H2 0 1 is called a su ciency function over L if ( t)( m)( n)[f (tm n) = 1 [( j)tm refutes Hj Hi ( k)[Hk = Hi ( i n)[t+ m and ( t)( j)( k
j)( n)( m
n)[f (tj n) = 1
) IN
tm refutes Hk ]]]] f (tk m) = 1]
De nition 8. Let f be a su ciency function over L. f is called an inner su ciency function over L if it additionally holds that for every text t T xts(L), ( m n)f (tm n) = 1 If instead it holds that for every text t T xts(L), ( m n)f (tm n) = 1, then f is called an outer su ciency function over L. Naturally the existence of a recursive (inner or outer) su ciency function over L is a very stron condition and allows particularly stron forms of learnin . Theorem 5. C FIN-TXT i there exists L a recursive inner su ciency function over L.
Index(C) such that there exists
Corollary 7. There exists a canonical FIN-learner which FIN-learns any C FIN-TXT w.r.t. any L Index(C) usin any recursive inner su ciency function over L as an oracle. We may use a su ciency function to de ne a particularly stron form of corroboration function. De nition 9. c(H t) is called a su cient corroboration function over L if there exists an inner su ciency function f (t n) over L such that: ( t)( i)( m)[[c(Hi tm ) > 0
c(Hi tm ) = c(Hi )]
f (tm i) = 1]
and ( t)( m)( n)[f (tm n) = 1
( i
n)c(Hi tm ) = c(Hi )]
c is called a recursive su cient corroboration function if both c and cf are total and recursive, where cf is as de ned in De nition 5. Theorem 6. C FIN-TXT i there exists L Index(C) such that there exists a recursive su cient corroboration function c over L. Corollary 8. There exists a canonical FIN-learner with corroboration which FIN-learns any C FIN-TXT w.r.t. any L Index(C) usin any recursive su cient corroboration function over L as an oracle.
154
Phil Watson
Theorem 7. C JREF-TXT i there exists L Index(C) such that there exists a recursive outer su ciency function f over L and a recursive limitin corroboration function c over L. Corollary 9. There exists a canonical JREF-learner with corroboration which JREF-learns any C JREF-TXT w.r.t. any L Index(C) usin any recursive outer su ciency function and any recursive limitin corroboration function over L as oracles.
5
Example
The corroboration functions constructed in the proofs in Section 4 were simplistic. However in practical use, the existence or non-existence of appropriate corroboration functions may be su ested naturally by the space of hypotheses in use. We ive an example of the use of corroboration functions to prove the learnability under certain identi cation criteria of a simple class. Our example lan ua es will be sets of points in the rational plane Q2 , so = (a b) a b Q . Example 1. Let C be the set of all closed circles of nite radius. Let > be >> a xed recursive a xed recursive bijection between Q2 and IN+ and is iven bijection between Q2 and Q. A suitable hypothesis space L = H1 H2 by H
a b>
= (p q) a =
x y >>
(p − x)2 +(q − y)2
b2
It is easily seen that L is a class-preservin recursive indexin of C. Q , Consider the followin corroboration function c : L ( ) which is based on the naturalistic idea that the further away a point is from a, the more severe a test it is of hypothesis H a b> . For circles of non-zero radius b we also include a scalin multiplier of 1 b2 into the corroboration function, so that smaller circles are potentially more hi hly corroborable than lar e ones. 0 −1 c(H
a b>
tm ) = 1 b2
if t+ M = if tm refutes H a b> i.e. [a = x y >> 2 2 2 ( (c d) t+ m )[(c − x) + (d − y) > b ]] + if b = 0 a = x y >> tm = (x y) max(a b tm ) otherwise
where max(a b tm ) = max ((c − x)2 + (d − y)2 ) b2 a =
x y >>
(c d) t+ m
With a little checkin we see that c is indeed a corroboration function under De nition 1, and is recursive and natural. c is limitin because on any text t for Hi we have a sta e m at which tm contains two diametrically opposed points on the circumference of the circle de ned by Hi . Then if we let i = a b >:
Inductive Learnin with Corroboration
155
( j)[[j = c d > d2 b2 ] [tm refutes Hj c(Hj tm ) = −1]] ( j)[[j = c d > d2 > b2 ] ( n m)c(Hi tn ) = 1 b2 > 1 d2 c(Hj tn ) ( j)[[j = c d > d2 = b2 ] [c = a tm refutes Hj ]] These are the only cases, so at all sta es n m we have that H a b> is the most stron ly corroborated hypothesis (except for H a −b> , which is equally stron ly corroborated and describes the same circle). c is also attainin because if b = 0 then ( a)c(H if b = 0 then ( a)c(H
a 0> ) a b> )
= = 1 b2
and for example c(H a b> t0 ) = c(H a b> ) where t = (x + b y) is a text for x y >>. H a b> and a = The above su ces to prove that C s-CONSERV-TXT, by Theorem 3. Finally we can see that c is not strict because for example (let b > 0) 2 t = (x + b y) results in c(H x y>> b> t0 ) = 1 b = c(H x y>> b> ) alHj remain unrefuted. Neverthou h many hypotheses Hj with H x y>> b> theless it is possible to nd a recursive, strict, attainin , limitin , set-driven corroboration function over L by requirin that two diametrically opposed points on the circumference of Hi must appear in the text before we set c(Hi tm ) = c(Hi ). This proves that C s-SMON-TXT. The details are left as an exercise for the reader.
6
Conclusions and Future Work
We have proposed a unifyin model for machine inductive inference based on the philosophical work of K.R. Popper, and obtained characterisations of many of the standard identi cation types in learnin indexed families of recursive lan ua es from text. In our model canonical learners use recursive oracles which compute a version of Popper’s de ree of corroboration. These learners then follow the natural strate y of preferrin the most stron ly (or at least a maximally stron ly) corroborated hypothesis at any iven time. Membership of a class of concepts within a particular identi cation criterion is then equivalent to the existence of a recursive corroboration function with certain properties dependin on the identi cation type. We intend to extend this unifyin model of learnin to include lan ua e learnin from informant and related problems such as learnin of partial recursive functions. An extension of our approach to learnin from noisy data would be particularly interestin ; in this case it is no lon er certain that a sin le adverse data item refutes a hypothesis and we would be obli ed to allow ne ative corroboration values other than −1, as in Popper’s ori inal model. Given the crucial role played by the hypothesis space in our model, it would also be interestin to extend this approach to cover exact and class comprisin learnin . Another interestin direction is to drop the requirement that our corroboration functions are recursive, thus obtainin a structure of ‘de rees of unlearnability’ analo ous to the de rees of unsolvability of classical recursion theory.
156
Phil Watson
Acknowled ements The author wishes to thank Prof. Dr. Ste en Lan e of Universit¨ at Leipzi for his comments on an earlier draft, and also the anonymous referees.
References [An80] D. An luin, Inductive inference of formal lan ua es from positive data, Information and Control 45, 117-135, 1980. [AS83] D. An luin, C.H. Smith, Inductive inference: theory and methods, Computin Surveys 15, 237-269, 1983. [Gi93] D. Gillies, Philosophy of Science in the Twentieth Century, Blackwell, 1993. [Gi96] D. Gillies, Arti cial Intelli ence and Scienti c Method, Oxford University Press, 1996. [Go67] E.M. Gold, Lan ua e identi cation in the limit, Information and Control 10, 447-474, 1967. [Ja91] K.P. Jantke, Monotonic and non-monotonic inductive inference, New Generation Computin 8, 349-460. [LW94] S. Lan e, P. Watson, Machine discovery in the presence of incomplete or ambi uous data, in S. Arikawa, K.P. Jantke (Eds.) Al orithmic Learnin Theory, Proc. of the Fifth International Workshop on Al orithmic Learnin Theory, Reinhardsbrunn, Germany, Sprin er LNAI 872, 438-452, 1994. [LZ94] S. Lan e, T. Zeu mann, Set-driven and rearran ement-independent learnin of recursive lan ua es, in S. Arikawa, K.P. Jantke (Eds.) Al orithmic Learnin Theory, Proc. of the Fifth International Workshop on Al orithmic Learnin Theory, Reinhardsbrunn, Germany, Sprin er LNAI 872, 453-468, 1994. [MA93] Y. Mukouchi, S. Arikawa, Inductive inference machines that can refute hypothesis spaces, in K.P. Jantke, S. Kobayashi, E. Tomita, T. Yokomori (Eds.), Al orithmic Learnin Theory, Proc. of the Fourth International Workshop on Al orithmic Learnin Theory, Tokyo, Japan, Sprin er LNAI 744, 123-136, 1993. [Po34] K.R. Popper, The Lo ic of Scienti c Discovery, 1997 Routled e reprint of the 1959 Hutchinson translation of the German ori inal. [Po54] K.R. Popper, De ree of con rmation, British Journal for the Philosophy of Science 5, 143 , 334, 359, 1954. [Po57] K.R. Popper, A second note on de ree of con rmation, British Journal for the Philosophy of Science 7, 350 , 1957. [Po63] K.R. Popper, Conjectures and Refutations, Routled e, 1963 (Fifth Edition, 1989). [Wa99] P. Watson, Inductive Learnin with Corroboration, Technical Report no. 6-99, Department of Computer Science, University of Kent at Canterbury, May 1999. Obtainable from http://www.cs.ukc.ac.uk/pubs/1999/782. [WC80] K. Wexler, P. Culicover, Formal Principles of Lan ua e Acquisition, MIT Press, Cambrid e, MA, 1980.
Flattenin and Implication Kouichi Hirata Department of Arti cial Intelli ence, Kyushu Institute of Technolo y, Kawazu 680-4, Iizuka 820-8502, Japan h rata@a .kyutech.ac.jp
Abstract. Flattenin is a method to make a de nite clause functionfree. For a de nite clause C, flattenin replaces every occurrence of a term f( 1 n ) in C with a new variable v and adds an atom pf ( 1 n v) with the associated predicate symbol pf with f to the body of C. Here, we denote the resultin function-free de nite clause from C by flat (C). In this paper, we discuss the relationship between flattenin and implication. For a de nite pro ram and a de nite clause D, it is known that if flat( ) = flat (D) then = D, where flat( ) is the set of flat (C) for each C . First, we show that the converse of this statement does not hold even if = C , that is, there exist de nite clauses C and D such that C = D but flat (C) = flat (D). Furthermore, we investiate the conditions of C and D satisfyin that C = D if and only if flat (C) = flat (D). Then, we show that, if (1) C is not self-resolvin and D is not tautolo ical, (2) D is not ambivalent, or (3) C is sin ly recursive, then the statement holds.
1
Introduction
The purpose of Inductive Lo ic Pro rammin is to nd a hypothesis that explains a iven sample. It is a normal settin of Inductive Lo ic Pro rammin that a hypothesis is a de nite clause or a de nite pro ram and a sample is the set of (labeled) round de nite clauses. In this settin , the word explain is interpreted as either subsume (denoted by ) or imply (denoted by =) . In the latter case, note that the problem of whether or not a de nite clause C implies another de nite clause, called an implication problem, is undecidable in eneral [8]. On the other hand, if C is function-free, then it is obvious that the implication problem is decidable. Flattenin , which has been rst introduced in the context of Inductive Lo ic Pro rammin by Rouveirol [14] (thou h similar ideas had already been used in other elds), is a method to make a de nite clause function-free. For a de tn ) in C nite clause C, flattenin replaces every occurrence of a term f (t1 tn v) with the associated with a new variable v and adds an atom pf (t1 predicate symbol pf with f to the body of C. Additionally, the unit clause This work is partially supported by Japan Society for the Promotion of Science, Grants-in-Aid for Encoura ement of Youn Scientists 11780284. O. Watanabe, T. Yokomori (Eds.): ALT’99, LNAI 1720, pp. 157 168, 1999. c Sprin er-Verla Berlin Heidelber 1999
158
Kouichi Hirata
pf (x1 xn f (x1 xn )) is introduced to the back round theory for each function symbol f in C. We denote the resultin function-free de nite clause by flat (C) and the set of unit clauses by defs(C). Rouveirol [14] has investi ated the several properties of flattenin . Mu leton [9,11] has dealt with flattenin in order to characterize his invertin implication. De Raedt and Dzeroski [2] have analyzed their PAC-learnability of jk-clausal theories by transformin possibly in nite Herbrand models into approximately nite models accordin to flattenin . Recently, Nienhuys-Chen and de Wolf [13] have studied the properties of flattenin with sophisticated discussion. Rouveirol [14] (and Nienhuys-Chen and de Wolf [13]) has shown that flattenin preserves subsumption: Let C and D be de nite clauses. Then, it holds that: C
D if and only if flat (C)
flat (D).
Also Rouveirol [14] (and Nienhuys-Chen and de Wolf [13]) has claimed that Cn flattenin preserves implication: Let be a de nite pro ram C1 flat(Cn ) and defs(C1 ) and D be a de nite clause. We denote flat(C1 ) defs(Cn ) by flat( ) and defs( ), respectively. Then, Rouveirol’s Theorem is described as follows: = D if and only if flat ( )
defs( ) = flat(D).
As the stron er relationship between flattenin and implication than Rouveirol’s Theorem, Nienhuys-Chen and de Wolf [13] have shown the followin theorem: If flat ( ) = flat (D), then
= D.
If the converse of this theorem holds, then the several learnin techniques for propositional lo ic such as [1,3] are directly applied to Inductive Lo ic Pro rammin . On the other hand, if the converse holds, then the implication problem = D is decidable, because flat( ) and flat(D) are function-free. However, it contradicts the undecidability of the implication problem [8,15] or the satis ability problem [5]. In this paper, we show that the converse does not hold even if = C , that is, there exist de nite clauses C and D such that: C = D but flat (C) = flat(D). Furthermore, we investi ate the conditions of C and D satisfyin that C = D if and only if flat (C) = flat (D). Gottlob [4] has introduced the concepts of selfresolvin and ambivalent clauses. A de nite clause C is self-resolvin if C resolves with a copy of C, and ambivalent if there exists an atom in the body of C with the predicate symbol same as one of the head of C. As the corollary of Gottlob’s results [4], we show that, if C is not self-resolvin and D is not tautolo ical, or D is not ambivalent, then the statement holds. Furthermore, note that the C in the counterexample stated above is iven as a doubly recursive de nite clause, that is, the body of C contains two atoms that are uni able with the head of a
Flattenin and Implication
159
variant of C. Then, we show that, if C is sin ly recursive, that is, the body of C contains at most one atom that is uni able with the head of a variant of C, then the statement also holds.
2
Preliminaries
A literal is an atom or the ne ation of an atom. A positive literal is an atom and a ne ative literal is the ne ation of an atom. A clause is a nite set of literals. A unit clause is a clause containin one positive literal. A de nite clause is a clause containin one positive literal. A set of de nite clauses are called a de nite Am , pro ram. Conventionally, a de nite clause is represented as A A1 m) are atoms. where A and A (1 Am . Then, the atom A is called a Let C be a de nite clause A A1 Am of atoms is head of C and denoted by head (C), and the sequence A1 called a body of C and denoted by body(C). Let C and D be de nite clauses. We say that C subsumes D, denoted by C D, if there exists a substitution such that C D, i.e., every literal in C also appears in D. Also we say that C implies D or D is a lo ical consequence of C, denoted by C = D, if every model of C is also a model of D. C is lo ically equivalent to D, denoted by C D, if C = D and D = C. For de nite pro rams and , D, , = D and = are de ned similarly. L Ll and M1 Mj Mm Let C and D be two clauses L1 which have no variables in common. If the substitution is an m u for the set L Mj , then the clause ((C − L ) (D − Mj )) is called a (binary) resolvent of C and D. All of the resolvents of C and D are denoted by Res(C D). Let be a de nite pro ram and C be a de nite clause. An SLD-derivation (Rk Ck−1 k ) such that R0 , of C from is a sequence (R1 C0 1 ) Res(R −1 C −1 ), and Rk = C, C −1 is a variant of an element of , R is an m u of the selected literals of R −1 and C −1 for each 1 k. If an SLD-derivation of C from exists, we write C. In particular, C D is denoted by C D. Theorem 1 ((Subsumption Theorem [13])). Let be a de nite pro ram and D be a de nite clause. Then, = D if and only if there exists a de nite clause E such that E and E D. For a de nite clause C, the lth self-resolvin closure of C, denoted by S l (C), is de ned inductively as follows: 1. S 0 (C) = C , 2. S l (C) = S l−1 (C)
R
Res(C D) D
S l−1 (C) (l
1).
Here, the lo ically equivalent clauses are re arded as identical. Note that C if and only if D S l (C) for some l 0. Then:
D
Corollary 2 ((Implication between De nite Clauses [12])). Let C and D be de nite clauses. Then, C = D if and only if there exists a de nite clause E such that E S l (C) and E D for some l 0.
160
Kouichi Hirata
For each n-ary function symbol f , the associated (n+1)-ary predicate symbol pf , called a flattened predicate symbol (on f ), is introduced uniquely in the process of flattenin . Also we call a de nite clause C or a de nite pro ram re ular if C or contains no flattened predicate symbols. Let C be a de nite clause, t be a term appearin in C and v be a variable not appearin in C. Then, C vt denotes the de nite clause obtained from C by replacin all occurrences of t in C with v. There exist several variants (but equivalent) of the de nition of flattenin : 1. Do we introduce an equality theory [9,14] or not [2,13]? 2. Do we transform a constant symbol to an atom with an unary predicate symbol [2,14] or not [13]? As the de nition of flattenin , we adopt the de nition similar as De Raedt and Dzeroski [2] that does not introduce an equality theory and does not transform a constant symbol. Let C be a de nite clause. Then, the flattened clause flat (C) of C is de ned as follows: C if C is function-free flat (C) = tn )(n 1) appears in C flat (C 0 ) if t = f (t1 pf (t1 tn v) and each t (1 n) is a variwhere C 0 = C vt xn f (x1 xn )) able or a constant. Also defs(C) is the set pf (x1 tn ) appears in C of unit clauses. Furthermore, the number of calls of f (t1 flat that is necessary to obtain the function-free clause flat (C) of C is called a rank of C and denoted by rank (C). Cn , we de ne flat ( ) and defs( ) as For a de nite pro ram = C1 follows: flat(Cn ) flat ( ) = flat(C1 ) defs(Cn ) defs( ) = defs(C1 )
3
Flattenin and Implication
As the relationship between flattenin and subsumption, Rouveirol [14] (and Nienhuys-Chen and de Wolf [13]) has shown the followin theorem: Theorem 3 ((Rouveirol [14], Nienhuys-Chen & de Wolf [13])). Let C and D be re ular de nite clauses. Then, C D if and only if flat(C) flat (D). Also Rouveirol [14] (and Nienhuys-Chen and de Wolf [13]) has proposed the followin relationship between flattenin and implication. Let be a re ular de nite pro ram and D be a re ular de nite clause. Then, Rouveirol’s Theorem is described as follows: Theorem 4 ((Rouveirol [14], Nienhuys-Chen & de Wolf [13])). Let be a re ular de nite pro ram and D be a re ular de nite clause. Then, =D if and only if flat( ) defs( ) = flat (D). In Appendix, we discuss the proof of Rouveirol’s Theorem.
Flattenin and Implication
161
Furthermore, Nienhuys-Chen and de Wolf [13] have shown the followin theorem, which is a stron er relationship between flattenin and implication than Rouveirol’s Theorem: Theorem 5 ((Nienhuys-Chen & de Wolf [13])). Let be a re ular definite pro ram and D be a re ular de nite clause. If flat ( ) = flat (D), then = D. On the other hand, the converse of Theorem 5 does not hold even if
= C :
Theorem 6. There exist re ular de nite clauses C and D such that C = D but flat (C) = flat (D). Proof. Let C and D be the followin re ular de nite clauses: p(x1 x3 ) p(x3 x2 ) C = p(f (x1 ) f (x2 )) p(x1 x3 ) p(x3 x4 ) p(x4 x5 ) p(x5 x2 ) D = p(f (f (x1 )) f (f (x2 ))) By resolvin C to a copy of C itself twice, it holds that C Hence, it holds that C = D.
C = p(f(x1),f(x2))← p(x1,x3),p(x3,x2)
D as Fi ure 1.
C = p(f(y1),f(y2))← p(y1,y3),p(y3,y2)
{f(y1)/x1,f(y2)/x3} p(f(f(y1)),f(x2))← p(f(y2),x2),p(y1,y3),p(y3,y2) C = p(f(z1),f(z2))← p(z1,z3),p(z3,z2)
{z1/y2,f(z2)/x2} p(f(f(y1)),f(f(z2)))← p(y1,y3),p(y3,z1),p(z1,z3),p(z3,z2) = p(f(f(x1)),f(f(x2)))← p(x1,x3),p(x3,x4),p(x4,x5),p(x5,x2)
Fi . 1. The SLD-derivation of D from C On the other hand, flat(C) and flat (D) are constructed as follows: flat (C) = p(x1 x2 ) flat (D) = p(x1 x2 )
p(x3 x4 ) p(x4 x5 ) pf (x3 x1 ) pf (x5 x2 ) p(x3 x4 ) p(x4 x5 ) p(x5 x6 ) p(x6 x7 ) pf (x3 x8 ) pf (x8 x1 ) pf (x7 x9 ) pf (x9 x2 )
The rst and second self-resolvin closures of flat (C) are constructed as Fi ure 2. flat(D). Then, there exists no de nite clause E S 2 (flat (C)) such that E By payin our attention to the number of atoms with the predicate pf and its relation, it holds that flat(C) = flat (D) if and only if there exists a de nite clause
162
Kouichi Hirata 1
(flat(C)) = flat (C) p(x1 x2 ) p(x3 x4 ) p(x5 x6 ) p(x6 x7 ) pf (x5 x8 ) pf (x8 x1 ) pf (x4 x2 ) pf (x7 x3 ) p(x1 x2 ) p(x3 x4 ) p(x5 x6 ) p(x6 x7 ) pf (x3 x1 ) pf (x7 x8 ) pf (x8 x2 ) pf (x5 x4 ) 2 (flat(C)) = 1 (flat (C)) p(x1 x2 ) p(x3 x4 ) p(x4 x5 ) p(x6 x7 ) p(x7 x8 ) pf (x3 x9 ) pf (x9 x1 ) pf (x8 x10 ) pf (x10 x2 ) pf (x5 x11 ) pf (x6 x11 ) p(x1 x2 ) p(x3 x4 ) p(x5 x6 ) p(x7 x8 ) p(x8 x9 ) pf (x7 x10 ) pf (x10 x11 ) pf (x11 x1 ) pf (x4 x2 ) pf (x6 x3 ) pf (x9 x5 ) p(x1 x2 ) p(x3 x4 ) p(x5 x6 ) p(x7 x8 ) p(x8 x9 ) pf (x5 x10 ) pf (x10 x1 ) pf (x4 x2 ) pf (x7 x6 ) pf (x9 x11 ) pf (x11 x3 ) p(x1 x2 ) p(x3 x4 ) p(x5 x6 ) p(x7 x8 ) p(x8 x9 ) pf (x3 x1 ) pf (x6 x10 ) pf (x10 x2 ) pf (x7 x11 ) pf (x11 x4 ) pf (x9 x5 ) p(x1 x2 ) p(x3 x4 ) p(x5 x6 ) p(x7 x8 ) p(x8 x9 ) pf (x3 x1 ) pf (x9 x10 ) pf (x10 x11 ) pf (x11 x2 ) pf (x5 x4 ) pf (x7 x6 )
Fi . 2. The rst and second self-resolvin closures of flat (C) E S 2 (flat (C)) such that E flat (D) by Corollary 2. Hence, we can conclude that there exists no de nite clause E S 2 (flat (C)) such that E flat (D), so it holds that flat (C) = flat(D). For the de nite clauses C and D iven in Theorem 6, it holds that flat(C) defs(C) flat(D) as Fi ure 3, so it holds that flat(C) defs(C) = flat (D).
4
Improvement
In this section, we investi ate the conditions of de nite clauses C and D satisfyin that C = D if and only if flat (C) = flat(D). First, we ive the followin lemma by Gottlob [4]. A de nite clause C is self-resolvin if C resolves with a copy of C. A de nite clause C is ambivalent if there exists an atom in body (C) with the predicate symbol same as one of head (C). Then: Lemma 7 ((Gottlob [4])). Let C and D be de nite clauses. 1. Suppose that C is not self-resolvin and D is not tautolo ical. Then, C = D if and only if C D. 2. Suppose that D is not ambivalent. Then, C = D if and only if C D.
Flattenin and Implication
163
flat(C) = p(x1,x2)← p(x3,x4),p(x4,x5),pf(x3,x1),pf(x5,x2) flat(C) = p(y1,y2)← p(y3,y4),p(y4,y5),pf(y3,y1),pf(y5,y2) {y1/x3,y2/x4} p(x1,x2)← p(y2,x5),p(y3,y4),p(y4,y5),pf(y1,x1),pf(x5,x2),pf(y3,y1),pf(y5,y2)
flat(C) = p(z1,z2)← p(z3,z4),p(z4,z5),pf(z3,z1),pf(z5,z2) {z1/y2,z2/x5} p(x1,x2)← p(y3,y4),p(y4,y5),p(z3,z4),p(z4,z5),pf(y1,x1),pf(z2,x2),pf(y3,y1),pf(y5,z1),pf(z3,z1),pf(z5,z2) pf(x,f(x))←
{f(z3)/z1} p(x1,x2)← p(y3,y4),p(y4,y5),p(z3,z4),p(z4,z5),pf(y1,x1),pf(z2,x2),pf(y3,y1),pf(y5,f(z3)),pf(z5,z2) pf(x,f(x))←
{y5/z3} p(x1,x2)← p(y3,y4),p(y4,y5),p(y5,z4),p(z4,z5),pf(y1,x1),pf(z2,x2),pf(y3,y1),pf(z5,z2) = p(x1,x2)← p(x3,x4),p(x4,x5),p(x5,x6),p(x6,x7),pf(x8,x1),pf(x9,x2),pf(x3,x8),pf(x7,x9) = flat(D)
Fi . 3. The SLD-derivation of flat(D) from flat(C)
defs(C) in Theorem 6
By incorporatin Lemma 7 with the previous theorems, we obtain the followin corollary: Corollary 8. Let C and D be re ular de nite clauses. 1. Suppose that C is not self-resolvin and D is not tautolo ical. Then, C = D if and only if flat(C) = flat (D). 2. Suppose that D is not ambivalent. Then, C = D if and only if flat (C) = flat (D). Proof. 1. By Lemma 7, C = D if and only if C D. By Theorem 3, C D if and only if flat (C) flat (D). By the de nition of and =, if flat (C) flat(D) then flat(C) = flat(D). So it holds that if C = D then flat(C) = flat(D). Hence, the statement holds by Theorem 5. 2. By the de nition of ambivalence, the predicate symbol of the head of D is di erent from all of the predicate symbols appearin in the body of D. This condition is preserved in flat (D), because the flattened predicate symbols, which
164
Kouichi Hirata
does not appear in D, are introduced only in the body of D. Hence, flat(D) is not ambivalent. By Lemma 7 and Theorem 3, the statement is obvious. In Theorem 6, C is iven as a doubly recursive de nite clause, that is, body(C) contains two atoms that are uni able with head (C 0 ), where C 0 is a variant of C. In the remainder of this section, we restrict the form of C to sin ly recursive. Here, a de nite clause C is sin ly recursive if body(C) contains at most one atom that is uni able with head (C 0 ), where C 0 is a variant of C. Lemma 9 ((Gottlob [4])). If C = D, then head (C) body(D).
head (D) and body(C)
l+1 Let C be a sin ly recursive de nite clause. It is obvious that S l (C) 1 for each l 0. Then, the lth self-resolvent Cl of C is and S l+1 (C) − S l (C) de ned inductively as follows: 1. C0 = C, 2. Cl+1 =
D if S l+1 (C) − S l (C) = D C otherwise
Lemma 10. Let C be a sin ly recursive re ular de nite clause with function tn ), where each t is either symbols. Suppose that C contains a term t = f (t1 pf (t1 tn v) . a variable or a constant. Also let C 0 be a de nite clause C vt Then, it holds that flat(Cl ) flat (Cl0 ) for each l 0. Proof. We show the statement by induction on l. If l = 0, then the statement is obvious, since C0 = C, C00 = C 0 and flat (C) = flat (C 0 ). Suppose that the statement holds for l k. It is su cient to show the 0 p(s). Consider Ck+1 and Ck+1 . By the defcase that C is of the form p(t) inition of the (k + 1)th self-resolvent, Ck+1 is a resolvent of C and Ck , and 0 is a resolvent of C 0 and Ck0 . Then, we can suppose that Ck+1 is of the Ck+1 form (head (C) body (Ck ) ) , where is an m u of head (Ck ) and body(C) 0 is of the form (head (C 0 ) and is a renamin substitution. Hence, Ck+1 0 0 0 0 tn v)) , where is a substitution obtained from by rebody(Ck ) pf (t1 placin the bindin t x in with v x, and 0 is a renamin substitution by addin the bindin u v (u is a new variable) to . By induction hypothesis, it holds that flat(Ck ) flat(Ck0 ) and flat(C) flat(C 0 ). By the forms of Ck+1 and 0 0 , it holds that flat (Ck+1 ) flat (Ck+1 ). Ck+1 Lemma 11. For a sin ly recursive de nite clause C, it holds that flat (Cl ) (flat (C))l for each l 0. Proof. We show the statement by induction on rank (C). If rank (C) = 0, then the statement is obvious, because flat(Cl ) = Cl and flat (C) = C for each l 0. Suppose that the statement holds for C such that rank (C) k. Let C be a sin ly recursive de nite clause such that rank (C) = k + 1. Since C contains tn ), some function symbols, suppose that C contains the term t = f (t1
Flattenin and Implication
165
where each t is either a variable or a constant. Let C 0 be a de nite clause pf (t1 tn v) . Then, it holds that flat(C 0 ) flat (C) and rank (C 0 ) = C vt k. By Lemma 10, it holds that flat(Cl0 ) flat (Cl ) for each l 0. By induction (flat (C))l for each hypothesis, it holds that flat (Cl ) flat(Cl0 ) (flat (C 0 ))l l 0. Hence, the statement holds for rank (C) = k + 1. Lemma 12. For a sin ly recursive de nite clause C, it holds that flat (C) = flat (Cl ) for each l 0. Proof. We show the statement by induction on l. If l = 0, then C0 = C, so the statement is obvious. Suppose that the statement holds for l k. Since Ck+1 is a resolvent of C and Ck and by Lemma 11, flat (Ck+1 ) is a resolvent of flat (C) and flat (Ck ). By the soundness of SLD-resolution (cf. [7,13]), it holds that flat(C) flat (Ck ) = flat (Ck+1 ). By induction hypothesis, it holds that flat (C) = flat(Ck ). Hence, it holds that flat (C) = flat(Ck+1 ), so the statement holds for l = k + 1. Theorem 13. Let C be a sin ly recursive re ular de nite clause and D be a re ular de nite clause. Then, C = D if and only if flat(C) = flat(D). Proof. By Theorem 5, it is su cient to show the only-if direction. We show it by induction of rank (D). If rank (D) = 0, that is, D is function-free, then so is C by Lemma 9. Then, flat(C) = C and flat (D) = D, so the statement is obvious. Suppose that the statement holds for D such that rank (D) k. Let D be a re ular de nite clause such that rank (D) = k + 1. Since D contains some tn ), where each function symbols, suppose that D contains a term t = f (t1 n) is a variable or a constant. Also let D0 be a de nite clause t (1 pf (t1 tn v) . Then, rank (D) = k and flat(D0 ) flat(D). Suppose D vt that C = D. Then, by Corollary 2 and the de nition of the lth self-resolvent, there exists an index l 0 such that Cl D. As similar as the proof of Lemma 19.6 in [13], we can construct the de nite clause C 0 from Cl such that C 0 D0 and flat(Cl ) flat (C 0 ) as follows: Suppose D0 . Let s1 sm be the set of distinct terms occurrin in Cl such that Cl that s = t. If s is a variable, then replace the bindin t s with v s . If s is rn ), in which case the rj are variables or constants, then of the form f (r1 rn v ) replace all occurrences of s in Cl with a new variable v , add pf (r1 in Cl , and add the bindin v v to . We call the de nite clause resultin from these m adjustments C 0 . Finally, replace all occurrences of t in bindin s in with y, and call the resultin substitution 0 . Then, it holds that C 0 0 D0 , so C 0 D0 . Hence, C 0 = D0 . Furthermore, by the construction of C 0 , it holds that flat (Cl ) flat (C 0 ). By induction hypothesis, it holds that flat (C 0 ) = flat(D0 ), so flat(Cl ) = flat (D). By Lemma 12, it holds that flat (C) = flat (D). Hence, the statement holds for rank (D) = k + 1.
166
5
Kouichi Hirata
Conclusion
In this paper, we have investi ated the relationship between flattenin and implication [13,14]. Let be a re ular de nite pro ram and C and D be de nite clauses. As the stron er relationship between flattenin and implication than Rouveirol’s Theorem [14], Nienhuys-Chen and de Wolf [13] have shown the followin theorem: If flat ( ) = flat (D), then
= D.
In this paper, we have shown that there exist de nite clauses C and D such that: C = D but flat (C) = flat(D). Furthermore, we have shown that if C and D satisfy one of the followin conditions, then it holds that C = D if and only if flat (C) = flat (D): 1. C is not self-resolvin and D is not tautolo ical, 2. D is not ambivalent, 3. C is sin ly recursive. The class of de nite clauses that flattenin preserves implication is correspondin to the class that the implication problem is decidable [4,6,7], and the class of de nite clauses that flattenin does not preserve implication in the above sense is correspondin to the class that the implication problem is undecidable [8,15]. It is a future work to investi ate the relationship between the classes of de nite clauses that flattenin preserves implication and that implication is decidable.
Acknowled ment The author would thank to anonymous referees for valuable comments.
References 1. An luin, D., Frazier, M. and Pitt, L.: Learnin conjunctions of Horn clauses, Machine Learnin 9, 147 164, 1992. 2. De Raedt, L. and Dzeroski, S.: First-order jk-clausal theories are PAC-learnable, Arti cial Intelli ence 90, 375 392, 1994. 3. Frazier, M. and Pitt, L.: Learnin from entailment: An application to propositional Horn sentences, Proc. 10th International Conference on Machine Learnin , 120 127, 1993. 4. Gottlob, G.: Subsumption and implication, Information Processin Letters 24, 109 111, 1987. 5. Hanschke, P. and W¨ urtz, J.: Satis ability of the smallest binary pro ram, Information Processin Letters 45, 237 241, 1993. 6. Leitsch, A.: Implication al orithms for classes of Horn clauses, Statistik, Infor¨ matik und Okonomie, Sprin er, 172 189, 1988.
Flattenin and Implication
167
7. Leitsch, A.: The resolution calculus, Sprin er-Verla , 1997. 8. Marcinkowski, J. and Pacholski, L.: Undecidability of the Horn-clause implication problem, Proc. 33rd Annual IEEE Symposium on Foundations of Computer Science, 354 362, 1992. 9. Mu leton, S.: Invertin implication, Proc. 2nd International Workshop on Inductive Lo ic Pro rammin , ICOT Technical Memorandum TM-1182, 1992. 10. Mu leton, S. (ed.): Inductive lo ic pro rammin , Academic Press, 1992. 11. Mu leton, S.: Inverse entailment and Pro ol , New Generation Computin 1 , 245 286, 1995. 12. Mu leton, S. and Pa e Jr., C., D.: Self-saturation of de nite clauses, Proc. 4th International Workshop on Inductive Lo ic Pro rammin , 162 174, 1994. 13. Nienhuys-Chen , S.-H. and de Wolf, R.: Foundations of inductive lo ic pro rammin , Lecture Notes in Arti cial Intelli ence 1228, 1997. 14. Rouveirol, C.: Extensions of inversion of resolution applied to theory completion, in [10], 63 92. 15. Schmidt-Schauss, M.: Implication of clauses is undecidable, Theoretical Computer Science 59, 287 296, 1988.
Appendix: Rouveirol’s Theorem For Rouveirol’s Theorem, it is clear that Rouveirol’s ori inal proof [14] is insu cient. On the other hand, Nienhuys-Chen and de Wolf [13] have shown Rouveirol’s Theorem as the consequence of Theorem 5 and the followin lemma: Lemma 14. Let
be a re ular de nite pro ram. Then, flat( ) defs ( ) =
.
However, we obtain the followin theorem: Theorem 15. There exist re ular de nite clauses C and D such that C
D but flat(C)
defs(C)
flat (D).
Proof. Let C and D be the followin re ular de nite clauses: p(x1 x3 ) p(x3 x2 ) C = p(f (x1 x3 ) f (x3 x2 )) D = p(f (f (x1 x2 ) f (x2 x3 )) f (f (x2 x3 ) f (x3 x4 ))) p(x1 x2 ) p(x2 x3 ) p(x2 x3 ) p(x3 x4 ) By resolvin C with C itself twice, we can show that C D. On the other hand, flat(C) and flat (D) are constructed as follows: flat(C) = p(x1 x2 ) flat(D) = p(x1 x2 )
p(x3 x4 ) p(x3 x4 ) pf (x3 x4 pf (x7 x8
p(x4 x5 ) pf (x3 x4 x1 ) pf (x4 x5 x2 ) p(x4 x5 ) p(x4 x5 ) p(x5 x6 ) x7 ) pf (x4 x5 x8 ) pf (x5 x6 x9 ) x1 ) pf (x8 x9 x2 )
. Also defs(C) = pf (x y f (x y)) The rst and second self-resolvin closures of flat(C) are constructed as Fi ure 4. Note that flat(C) defs(C) flat (D) if and only if there exists a de nite clause E S 2 (flat (C)) such that flat (D) is obtained by resolvin E
168
Kouichi Hirata
1
(flat (C)) = flat (C) p(x1 x2 ) p(x3 x4 ) p(x5 x6 ) p(x6 x7 ) pf (x5 x6 x8 ) pf (x8 x3 x1 ) pf (x3 x4 x2 ) pf (x6 x7 x3 ) p(x1 x2 ) p(x3 x4 ) p(x5 x6 ) p(x6 x7 ) pf (x3 x4 x1 ) pf (x6 x7 x8 ) pf (x4 x8 x2 ) pf (x5 x6 x4 ) 2 (flat (C)) = 1 (flat (C)) E1 : p(x1 x2 ) p(x3 x4 ) p(x4 x5 ) p(x6 x7 ) p(x7 x8 ) pf (x3 x4 x9 ) pf (x9 x10 x1 ) pf (x7 x8 x11 ) pf (x10 x11 x2 ) pf (x4 x5 x10 ) pf (x6 x7 x10 ) E2 : p(x1 x2 ) p(x3 x4 ) p(x4 x5 ) p(x6 x7 ) p(x7 x8 ) pf (x3 x4 x10 ) pf (x9 x10 x1 ) pf (x4 x5 x11 ) pf (x10 x11 x2 ) pf (x7 x8 x10 ) pf (x6 x7 x9 ) p(x1 x2 ) p(x3 x4 ) p(x5 x6 ) p(x7 x8 ) p(x8 x9 ) pf (x8 x3 x1 ) pf (x3 x4 x2 ) pf (x11 x5 x10 ) pf (x5 x6 x3 ) pf (x7 x8 x11 ) pf (x8 x9 x5 ) p(x1 x2 ) p(x3 x4 ) p(x5 x6 ) p(x7 x8 ) p(x8 x9 ) pf (x10 x3 x1 ) pf (x3 x4 x2 ) pf (x5 x6 x10 ) pf (x6 x11 x3 ) pf (x3 x8 x6 ) pf (x8 x9 x11 ) p(x1 x2 ) p(x3 x4 ) p(x5 x6 ) p(x7 x8 ) p(x8 x9 ) pf (x3 x4 x1 ) pf (x4 x10 x2 ) pf (x11 x5 x4 ) pf (x5 x6 x10 ) pf (x7 x8 x11 ) pf (x8 x9 x5 ) p(x1 x2 ) p(x3 x4 ) p(x5 x6 ) p(x7 x8 ) p(x8 x9 ) pf (x3 x4 x1 ) pf (x4 x10 x2 ) pf (x5 x6 x4 ) pf (x6 x11 x10 ) pf (x7 x8 x6 ) pf (x8 x9 x11 )
Fi . 4. The rst and second self-resolvin closures of flat (C) some times. Then, we cannot obtain the above E from with pf (x y f (x y)) each element in S 2 (flat (C)) except E1 and E2 . Furthermore, the resolvent of twice, where the selected atoms in E are E ( = 1 2) with pf (x y f (x y)) atoms of which the third ar ument’s term is x10 , contains a term with f . Hence, it holds that flat(C) defs(C) flat(D). Hence, we cannot directly conclude Rouveirol’s Theorem from Theorem 5 and Lemma 14. Note that the de nite clauses C and D in Theorem 15 are not a counterexample of the if-direction of Rouveirol’s Theorem, because E1 and E2 subsume flat (D) by the followin substitutions 1 and 2 : 1 2
= x4 x6 x5 x7 x6 x8 x7 x9 x8 x10 x9 x11 = x4 x3 x5 x4 x6 x5 x4 x7 x5 x8 x7 x9 x8 x10 x9 x11
Rouveirol’s Theorem seems to be correct, but it is necessary to improve the proof by [13,14] because of Theorem 6 and 15.
Induction of Lo ic Pro rams Based on ψ-Terms Yutaka Sasaki NTT Communica ion Science Labora ories, 2-4 Hikaridai, Seika-cho, Soraku-gun, Kyo o 619-0237, Japan sasak @cslab.kecl.ntt.co.jp
Abs rac . This paper ex ends he radi ional inductive lo ic pro rammin (ILP) framework o a - erm capable ILP framework. A¨ -Kaci’s - erms have in eres ing and signi can proper ies for markedly widening applicable areas of ILP. For example, - erms allow par ial descrip ions of informa ion, generaliza ion and specializa ion of sorts (or types) placed ins ead of func ion symbols, and abs rac descrip ions of da a using sor s; hey have comparable represen a ion power o feature structures used in na ural language processing. We have developed an algori hm ha learns logic programs based on - erms, made possible by a bo om-up approach employing he least eneral eneralization (l ) ex ended for erms. As an area of applica ion, we have selec ed informa ion ex rac ion (IE) asks in which sor informa ion is crucial in deciding he generali y of IE rules. Experimen s were conduc ed on a se of es examples and background knowledge consis ing of case frames of newspaper ar icles. The resul s showed high precision and recall ra es for learned rules for he IE asks.
1
Introduction
In the traditional settin of inductive lo ic pro rammin (ILP) [14], the input is a set of examples, which are usually round instances, and back round knowled e, which is a set of round instances or lo ic pro rams. The output of ILP systems is a set of lo ic pro rams, such as pure Prolo pro rams. The form (i.e., lan ua e) of the output is called a hypothesis lan ua e. The task of ILP systems is to nd, based on the back round knowled e, ood hypotheses that cover most positive examples and least ne ative examples (if any). Previously, as one direction of extendin the scope of the representation power of examples and a hypothesis lan ua e of ILP, RHB+ [19] was presented for learnin lo ic pro rams based on -terms which are lo ic terms whose variables have sorts (or types). -terms, however, are a very restricted form of terms used in LOGIN [1] and LIFE [3]. For example, in the previously proposed framework, a positive example that expresses Jack was injured was represented as injured(a ent
Jack)
O. Watanabe, T. Yokomori (Eds.): ALT’99, LNAI 1720, pp. 169 181, 1999. c Sprin er-Verla Berlin Heidelber 1999
170
Yu aka Sasaki
usin a feature (or attribute) a ent. If Jack is de ned as a sub-sort of people, this example could be eneralized to injured(a ent
people)
However, the example injured(a ent
passen er(count
10))
could not be eneralized to injured(a ent
people(count
number))
This is because RHB+ would treat the passen er as a function symbol and would not be able to eneralize the passen er to the sort people. This restricted the application ran e which had to learn from the data as to which of the ori inal structures of natural lan ua e sentences to preserve. This paper presents the desi n and al orithm of our new ILP system which is capable of handlin -terms. After explainin attractive points of -terms, we formally de ne -terms and explain their properties. Then previous typeoriented learner is briefly described and an al orithm for achievin an ILP system that learns lo ic pro rams based on -terms is presented. As an application to test the feasibility of our system, information extraction (IE) is briefly introduced. After that, experimental results on IE tasks are shown. A discussion and conclusions conclude this paper.
2
Attractive Points of ψ-Terms
For the sake of introducin of features (or attributes) and sorts, -terms enable the followin advanta es. partial descriptions For example, term name(f irst peter) expresses the information of a person whose rst name is known. This is equivalent to name(f irst peter,last ). In the process of uni cation[4], possibly other features can be added to the term. dynamic eneralization and specialization Sorts, placed at the positions of function symbols, can be dynamically eneralized and specialized. For example, person(id name(f irst person)) is a eneralized form of man(id name(f irst Jack)). abstract representation Abstract representations of examples usin sorts can reduce the amount of data. For example, f amiliar(a ent
person(residence
F rance) obj
F rench)
represents a number of round instances, such as, f amiliar(a ent
Ser e(residence
F rance) obj
F rench)
Induc ion of Logic Programs Based on
-Terms
171
coreference Coreference enables the recursive representation of terms. For example, X : person(spouse person(spouse X)) refers to itself recursively 1 . NLP applicability -terms have the same representation power as feature structures [8] which are used for formally representin the syntax and semantics of natural lan ua e sentences.
ψ-Terms
3
Given a set of sorts S containin the sorts and , the partial order on S such that is the least and is the reatest element, and features (i.e., labels or attributes) F , -terms are de ned as follows [2]. 3.1
Ordered Sorts
Sorts S have a partial order . 1 2 means that 2 is more eneral than 1 . A set of sorts S must include the reatest element and the least element . s t is de ned as the supremum of sorts s and t, and s t is the in mum of sorts s and t. Then, S forms a lattice [6]. If the iven sort hierarchy is a tree without and , we add r for root sort r and l for leaf sort l . As a special treatment, we distin uish constants from sorts when we have to distin uish them for ILP purposes, while constants are usually re arded as sorts. Formally, constants C is C S and for c C, t t
2
3.2
De nition of ψ-Terms
Informally, -terms are Prolo terms whose variables are replaced with variable V ar of sort s, which is denoted as Var:s. Function symbols are also replaced with sorts. Terms have features (labels or attributes) for readability and for representin partial information. For example, injured(a ent
X : people(of
number))
is an atomic formula based on -terms whose features are a ent and of and whose sorts are people and number. The recursive de nition of -terms is as follows. De nition 1 ( -terms) A -term. 1 2
-term is either an unta
ed
-term or a ta
ed
Variables are also used as coreference ags. This is one of he mos elegan ways o represen coreference. This is de ned as 2 1 in [8].
172
Yu aka Sasaki
De nition 2 (ta ed
-term)
A variable is a ta ed -term. If X is a variable and t is an unta ed
De nition 3 (unta ed
-term, X : t is a ta ed
-term.
-term)
A sort symbol is an unta ed -term. ln are features and t1 If s is a sort symbol, l1 tn ) is an unta ed -term. t1 ,...,ln
tn are
-terms, s(l1
tn are
-terms, p(l1
De nition 4 (atomic formula) ln are features and t1 If p is a predicate, l1 tn ) is an atomic formula. t1 ,...,ln
While terms have features, they are compatible with the usual atomic fortn ) is syntactic su ar for mulae of Prolo . The rst-order term notation p(t1 n tn ) [2]. the -term notation p(1 t1 3.3
Least General Generalization
In the de nition of the least eneral eneralization (l ) [17], the part that de nes the term l should be extended to the l of -terms 3 . Now, we operationally de ne the l of -terms usin the followin notations. a and b represent unta ed -terms. s, t, and u represent -terms. f , , and h t1 ln represent sorts. X, Y , and Z represent variables. Given t = f (l1 tn ), the l projection of t is de ned as t l = t . For simplicity of the al orithm, we re ard unta ed -term a appearin without a variable to type V : a, where V is a fresh variable. Note that a variable appearin nowhere typed is implicitly typed by . De nition 5 (l
of
-terms)
1. l (X : a X : a) = X : a. 2. l (s t) = u, where s = t and the tuple (s s1 lns sn ), t = Y : 3. If s = X : f (l1s lns then l (s t) = u, where L = l1s l (s l1 t l1 ) ljLj l u = Z : h(l1 (s t u) is added to Hist. 3
4
t u) is in the history Hist. t (l1t t1 lm tm ) and s = t, t t l1 lm and for features l L, 4 (s ljLj t ljLj )) with h = f .
The lgg of - erms has already been described in [1]. We presen an algori hm as an ex ension of Plo kin’s lgg. The lgg of feature terms, which are equivalen o - erms, can be found in [16]. Here, means he supremum of wo sor s.
Induc ion of Logic Programs Based on
For example, the l
-Terms
173
of
injured(a ent
passen er(of
10))
and injured(a ent
men(of
2))
is injured(a ent
people(of
number))
De nition 6 (l of atoms based on -terms) Let P and Q be atomic formulae (atoms). The l of atoms is de ned as follows. t s1 lns sn ) and Q = p(l1t t1 lm tm ), then 1. If P = p(l1s l (s l1 t l1 ) ljLj l (s ljLj t ljLj )) where L = l (P Q) = p(l1 t lns l1t lm . l1s s t s1 lns sn ), Q = q(l1t t1 l m tm ) and p = q, 2. If P = p(l1 l (P Q) is unde ned.
Let P and Q be atoms and L1 and L2 be literals. The l clauses are de ned as follows [12]. De nition 7 (l
of literals based on
of literals and
-terms)
1. If L1 and L2 are atomic, then l (L1 L2 ) is the l of atoms. 2. If L1 = P and L2 = Q, then l (L1 L2 ) = l ( P Q) = l (P Q). 3. If L1 = P and L2 = Q or L1 = P and L2 = Q, then l (L1 L2 ) is unde ned. De nition 8 (l of clauses) Ln and c2 = K1 Km . Then l Let clauses c1 = L1 c1 Kj c2 and l (L Kj ) is de ned . L j = l (L Kj ) L
4
(c1 c2 ) =
Previous Type-Oriented ILP System
This section briefly describes a summary of a previous ILP system. RHB+ learns lo ic pro rams based on -terms. -terms are restricted forms of -terms. Informally, -terms are Prolo terms whose variables are replaced with Var:type and whose function symbols have features. It employs a combination of bottom-up and top-down approaches, followin the result described in [23]. In the de nition of the least eneral eneralization (l ) [17], the de nition of the term l was extended to the -term l . The other de nitions of l were equivalent to the ori inals. The special feature of RHB+ is the dynamic type restriction by positive examples durin clause construction. The restriction uses positive examples currently covered in order to determine appropriate types. For each variable X appearin
174
Yu aka Sasaki
in the clause, RHB+ computes the supremum of all types bound to X when covered positive examples are uni ed with the current head in turn. The search heuristic PWI is a wei hted informativity employin a Laplace estimate [9]. Let T = Head :−Body BK. P W I(T ) = −
1 P
lo
2
P +1 Q(T ) + 2
where P denotes the number of positive examples covered by T and Q(T ) is the empirical content. Type information was made use of for computin Q(T ) . Let Hs be a set of instances of Head enerated by provin Body usin backtrackin . was de ned as the number of constants under type in the type hierarchy. When is a constant, is de ned as 1. Q(T ) = h2Hs
2T ypes(h)
where T ypes(h) returns the set of types in h. The stoppin condition also utilized Q(T ) in the computation of the Model Coverin Ratio (MCR): P M CR(T ) = Q(T ) RHB+ was successfully applied to the IE task of extractin key information from 100 newspaper articles related to new product release [21]. This implies a potential of applyin our -term capable ILP system to the IE task.
5
New ILP Capable of ψ-Term
This section describes a novel relational learner -RHB which learns lo ic prorams based on -terms. Extendin ILP to a -term capable ILP is not strai htforward. In top-down learnin , the learner constructs all possible literals to be added to the current body. When considerin only simple sorts which do not have any ar uments, top-down approaches, like Foil, are e cient. However, when it comes to learnin clauses with -terms, it is not realistic to produce all kinds of literals that contain possible terms. For example, if we have predicate p, sorts t1 and t2 , features l1 and l2 , and variable X, one of the possible literals is p(l1 t1 (l2 X:t1 (l1 t2 ))) because the predicate arity and term depth are unbound. The maximum depth of modication to the sort in a possible literal must be more than the maximum depth of modi cation to a sort seen in the iven trainin examples. Moreover, at each level of a modi cation to a sort, the maximum number of features of the sort
Induc ion of Logic Programs Based on
-Terms
175
is the number of features F . Therefore, eneratin all patterns of literals with -terms is too time consumin and so not very practical. To cope with this problem, the learnin strate y should be bottom-up. 5.1
Al orithms of ψ-Term Capable ILP
The positive examples are atomic formulae based on -terms. The hypothesis lan ua e is a set of Horn clauses based on -terms. The back round knowled e also consists of atomic formulae. -RHB, a -term capable ILP system, employs a bottom-up approach, like Golem [15]. Learnin Al orithm The learnin al orithm of our ILP system is based on the Golem’s al orithm [15] extended for -terms. The steps are as follows. Al orithm 1 Learnin al orithm 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
Given positive examples P , back round knowled e BK. Link sorts which have the same names in P and BK. A set of hypotheses H = . Select K pairs of examples (A B ) as EP (0 i K). Select sets of literals AR and BR as selected back round knowled e accordin to the variable depth D. Compute l s of clauses A :−AR and B :−BR . Simplify the l s by evaluatin with wei hted informativity P W I, which is the informativity de ned in Section 4. Select the best clause C, and add it to H if the score of C is better than the threshold . Remove covered examples from P . If P is empty then return H; Otherwise, oto Step 4.
In Step 2, we have to link sorts in the examples and the back round knowled e because the OSF theory [4] which underlies the theory of -terms is not formed under the unique name assumption. For example, if we have two terms f (t) and (t), t in f (t) and t in (t) are not identical. The OSF theory requires that they be represented as f (X : t) and f (X : t), if t is identical in both of two terms. Therefore, the same sort symbols in the examples and in the back round knowled e are linked and will be treated as identical symbols in the later steps. In Step 5, to speed up the learnin process, literals related to each pair of examples are selected. At rst, AR and BR are empty. Then, (1) select the back round knowled e literals Asel so that it has all literals whose sort symbols are identical to the sorts in A or AR , and select Bsel in the same manner usin B or BR . (2) Add literals Asel and Bsel to sets AR and BR , respectively. Repeat (1) and (2) n times when the prede ned variable depth is n. This iteration creates sets of literals.
176
Yu aka Sasaki
We use AR as the selected back round knowled e for A , and BR for B in Step 6. What is computed in Step 6 is the followin l of clauses. l
((A :−
AR ) (B :−
BR ))
where S is a conjunction of all of the elements in S. The l s of clauses have the variable depth of at most n. In Step 7, simpli cation of the l s is achieved by checkin all literals in the body as to whether removal of literals makes the score of the wei hted informativity worse or not. For the purpose of informativity estimation, we use the concept of round instances of atomic formulae based on -terms. We call atomic formula A round instance if all of the sorts appearin in A are constants. For example, f amiliar(a ent
Ser e(residence
F rance) obj
F rench)
Moreover, literals in the body are checked as to whether they satisfy the inputoutput mode declarations of the predicates.
6 6.1
NLP Application Information Extraction
This section presents a brief introduction of our tar et application: information extraction (IE). The task of information extraction involves extractin key information from a text corpus, such as newspaper articles or WWW pa es, in order to ll empty slots of iven templates. Information extraction techniques have been investi ated by many researchers and institutions in a series of Messa e Understandin Conferences (MUC), which are not only technical meetin s but also IE system contests on information extraction, conducted on common benchmarks. The input for the information extraction task is a set of natural lan ua e texts (usually newspaper articles) with an empty template. In most cases, the articles describe a certain topic, such as corporate mer ers or terrorist attacks in South America. The iven templates have some slots which have eld names, e. ., company name and mer er date . The output of the IE task is a set of lled templates. IE tasks are hi hly domain dependent because the rules and dictionaries used to ll values in the template slots depend on the domain. 6.2
Problem in IE System Development
The domain dependence has been a serious problem for IE system developers. As an example, Umass/MUC-3 needed about 1500 person-hours of skilled labor to build the IE rules represented as a dictionary [13]. Worse, new rules have to be constructed from scratch when the tar et domain is chan ed.
Induc ion of Logic Programs Based on
Template
177
Filled templates
Examples
Case frames
-Terms
Background knowledge
Type hierarchy
ILP System
IE rules
Select case frames relating to the examples
Fi . 1. Block dia ram of the preparation of data for ILP
To cope with this problem, some researchers have studied methods to learn information extraction rules. On this back round, we selected the IE task for an application of a -term capable ILP. An IE task is appropriate for our application because natural lan ua es contain a vast variety of nouns relatin to a taxonomy (i.e., sort hierarchy).
7
Experimental Results
For the purpose of estimatin the performance of our system, we conducted experiments on the learnin of IE rules. The IE tasks here involved the MUC4 style IE and the template elements to be lled included two items 5 . We extracted articles related to accidents from a one-year newspaper corpus written in Japanese 6 . Forty two articles were related to accidents which resulted in some deaths and injuries. The template we used consisted of two slots: the number of deaths and injuries. We lled one template for each article. After parsin the sentences, ta ed parse trees were converted into atomic formulae representin case frames. Fi ure 1 shows the learnin block dia ram. Those case frames were iven to our learner as back round knowled e. All of the 42 articles were able to be represented as case frames for the sake of the representation power of -terms, while only 25 articles were able to be represented usin -terms. Each slot of a lled template was iven as a positive example. For the precision and recall, the standard evaluation metrics for IE tasks, we counted them by usin four-fold cross validation on the 42 examples. 5 6
This is a rela ively simple se ing compared o s a e-of- he-ar IE asks. We hank he Mainichi Newspaper Co. for permi ing us o use he ar icles of 1992.
178
Yu aka Sasaki
Table 1. Comparison between RHB + and
-RHB
[dea hs] Time (sec) Hypo P / Q(T ) RHB + 172.7 4 25/25 -RHB 954.2 4 25/25 [injuries] Time (sec) Hypo P / Q(T ) RHB + 508.5 8 25/25 -RHB 3218.0 10 25/25
correct output answers output answers correct output answers Recall = all correct answers
P recision =
7.1
Natural Lan ua e Processin Tools
We used a sort hierarchy hand-crafted for the Japanese-En lish machine translation system ALT-J/E [11]. This hierarchy is a sort of concept thesaurus represented as a tree structure in which each node is called a cate ory (i.e., a sort). An ed e in the tree represents an is a relation amon the cate ories. The current version of the sort hierarchy is 12 levels deep and contains about 3000 cate ory nodes. We also used the commercial-quality morpholo ical analyzer, parser, and semantic analyzer of ALT-J/E. Results after the semantic analysis were parse trees with case and semantic ta s. We developed a lo ical form translator FEP2 which enerates case frames expressed as atomic formulae from the parse trees. 7.2
Results
Table 1 and Table 2 show the results of our experiments. Table 1 shows the experimental results of RHB+ and -RHB usin the same 25 examples as used in [19]. We used a SparcStation 20 for this experiment. -RHB showed a hi h accuracy like RHB+ but slowed down in exchan e for its extended representation power in the hypothesis lan ua e. Table 2 shows the experimental results on forty two examples. We used a AlphaStation 500/333MHz for this experiment. Overall, a very hi h precision, 90-97%, was achieved. 63-80% recall was achieved with all case frames includin errors in the case selection and semantic ta selection. These selections had an error ran e of 2-7%. With only correct case frames, 67-88% recall was achieved. It is important to note that the extraction of two di erent pieces of information showed ood results. This indicates that our learner has hi h potential in IE tasks.
Induc ion of Logic Programs Based on
-Terms
179
Table 2. Learnin results of accidents dea hs Precision (all case frames) 96.7% Recall (all case frames) 80.0% Recall (correc case frames) 87.5% Average ime (sec.) 966.8
8
injuries 89.9% 63.2% 66.7% 828.0
Discussion
The bene ts of -term capability, not just involvin the -term, in an application, depend on the writin style of the topic. In En lish, the expression ABC Corp.’s printer is commonly used and the lo ical term representation can be printer(pos ABC Corp. ). However, if the expression ABC Corp. released a printer and ... were very common, it could be release( ABC Corp. , printer). In this case, since the required representation is within the -term, extendin the lan ua e from -term based to -term based does not pay for the hi her computin cost. In the IE community, some previous research has looked at eneratin IE rules from texts with lled templates, for example, AutoSlo -TS [18], CRYSTAL [22], LIEP [10], and RAPIER [7]. The main di erences between our approach and the others are that we use semantic representations (i.e. case frames) created by a domainindependent parser and semantic analyzer. we use ILP techniques independent of both the parser and semantic analyzer. Note that the second item means that learned lo ic pro rams may have several atomic formulae in the bodies. This point di erentiates our approach from the simple eneralization of a sin le case frame. INDIE [5] learns a set of feature terms equivalent to -terms. The learnin is equivalent to the learnin of a set of atomic formulae based on -terms which cover all positive examples and no ne ative examples. Because its hypotheses are enerated so as to exclude any ne atives, it mi ht be intolerant to noise. Sasaki [20] reported a preliminary version ILP system which was capable of limited features of -terms. Preliminary experiments are conducted on learnin IE rules to extract information from only twenty articles.
9
Conclusions and Remarks
This paper has described an al orithm of a -term capable ILP and its application to information extraction. The l of lo ic terms was extended to the l of -terms. The learnin al orithm is based on the l of clauses with -terms.
180
Yu aka Sasaki
Natural lan ua e processin relies on a vast variety of nouns relatin to the sort hierarchy (or taxonomy) which plays a crucial role in eneralizin data enerated from the natural lan ua e. Therefore, the information extraction task matches the requirements of the -term capable ILP. Because of the modest robustness and performance of current natural lanua e analysis techniques (for Japanese texts), errors were found in parsin , case selection, and semantic ta selection. The experimental results, however, show that learned rules achieve hi h precision and recall in IE tasks. Moreover, an important point is that all of the 42 articles related to the topic were able to be represented as case frames, which demonstrates the representation power of -terms. This indicates that applyin ILP to the learnin from case frames will become more practical as NLP techniques pro ress in the near future.
References 1. H. A¨ -Kaci and R. Nasr, LOGIN: A logic programming language wi h buil -in inheri ance, J. Lo ic Pro rammin , 3, pp.185-215, 1986. 2. H. A¨ -Kaci and A. Podelski, Toward a Meanin of LIFE, PRL-RP-11, Digi al Equipmen Corpora ion, Paris Research Labora ory, 1991. 3. H. A¨ -Kaci, B. Duman , R. Meyer, A. Podelski, and P. Van Roy, The Wild Life Handbook, 1994. 4. H. A¨ -Kaci, A. Podelski and S. C. Golds ein, Order-Sor ed Fea ure Theory Uni ca ion, J. Lo ic Pro rammin , Vol.30, No.2, pp.99-124, 1997. 5. E. Armengol and E. Plaza, Induc ion of Fea ure Terms wi h INDIE, ECML-97, pp.33-48, 1997. 6. Birkho , G., Lattice Theory, American Ma hema ical Socie y, 1979. 7. M. E. Cali and R. J. Mooney, Rela ional Learning of Pa ern-Ma ch Rules for Informa ion Ex rac ion, ACL-97 Workshop in Natural Lan ua e Learnin , 1997. 8. B. Carpen er, The Lo ic of Typed Feature Structures, Cambridge Universi y Press, 1992. 9. B. Ces nik, Es ima ing Probabili ies: A Crucial Task in Machine Learning, ECAI90, pp.147-149, 1990. 10. S. B. Hu man, Learning Informa ion Ex rac ion Pa erns from Examples, Statistical and Symbolic Approaches to Learnin for Natural Lan ua e Processin , pp. 246-260, 1996. 11. S. Ikehara, M. Miyazaki, and A. Yokoo, Classi ca ion of Language Knowledge for Meaning Analysis in Machine Transla ions, Transactions of Information Processin Society of Japan, Vol. 34, pp.1692-1704, 1993 (in Japanese). 12. N. Lavrac and S. Dzeroski: Inductive Lo ic Pro rammin : Techniques and Applications, Ellis Horwood, 1994. 13. W. Lehner , C. Cardie,D. Fisher, J. McCar hy, E. Rilo and S. Soderland, Universi y of Massachuse s: MUC-4 Tes Resul s and Analysis, Fourth Messa e Understandin Conference, pp.151-158, 1992. 14. S. Muggle on, Induc ive Logic Programming, New Generation Computin , 8(4), pp.295-318, 1991. 15. S. Muggle on and C. Feng, E cien Induc ion of Logic Programs, in Inductive Lo ic Pro rammin , Academic Press,1992. 16. E. Plaza, Cases as erms: A fea ure erm approach o he s ruc ured represen a ion cases, First Int. Conf. on Case-Based Reasonin , pp. 263-276, 1995.
Induc ion of Logic Programs Based on
-Terms
181
17. G. Plo kin, A No e on Induc ive Generaliza ion, in B. Jel zer et al. eds., Machine Intelli ence 5, pp.153-163, Edinburgh Universi y Press, 1969. 18. E. Rilo , Au oma ically Genera ing Ex rac ion Pa ern from Un agged Tex , AAAI-96, pp. 1044-1049, 1996. 19. Y. Sasaki and M. Haruno, RHB+ : A Type-Orien ed ILP Sys em Learning from Posi ive Da a, IJCAI-97, pp.894-899, 1997. 20. Y. Sasaki, Learning of Informa ion Ex rac ion Rules using ILP Preliminary Repor , The Second International Conference on The Practical Application of Knowled e Discovery and Data Minin , pp.195 205, London, 1998. 21. Y. Sasaki, Applying Type-Orien ed ILP o IE Rule Genera ion, AAAI-99 Workshop on Machine Learnin for Information Extraction, pp.43 47,1999. 22. S. Soderland, D. Fisher, J. Asel ine, W. Lener , CRYSTAL: Inducing a Concep ual Dic ionary, IJCAI-95, pp.1314-1319, 1995. 23. J. M. Zelle and R. J. Mooney, J. B. Konvisser, Combining Top-down and Bo omup Me hods in Induc ive Logic Programming, ML-94, pp.343-351, 1994.
Complexity in the Case A ainst Accuracy: When Buildin One Function-Free Horn Clause Is as Hard as Any Richard Nock Department of Mathematics and Computer Science, Universite des Antilles-Guyane, Campus de Fouillole, 97159 Pointe-a-Pitre, France rnock@univ-a .fr
Abs rac . Some authors have repeatedly pointed out that the use of the accuracy, in particular for comparin classi ers, is not adequate. The main ar ument discusses the validity of some assumptions underlyin the use of this criterion. In this paper, we study the hardness of the accuracy’s replacement in various ways, usin a framework very sensitive to these assumptions: Inductive Lo ic Pro rammin . Replacement is investi ated in three ways: completion of the accuracy with an additional requirement, replacement of the accuracy by the ROC analysis, recently introduced from si nal detection theory, and replacement of the accuracy by a sin le criterion. We prove stron hardness results for most of the possible replacements. The major point is that allowin arbitrary multiplication of clauses appears to be totally useless. Another point is the equivalence in di culty of various criteria. In contrast, the accuracy criterion appears to be tractable in this framework.
1
Introduction
As the number of classi cation learnin al orithms is rapidly increasin , the question of ndin e cient criteria to compare their results is of particular relevance. This is also of importance for the al orithms themselves, as they can naturally optimize directly such criteria to achieve ood results. A criterion frequently encountered to address both problems is the accuracy, which received recently on these topics some criticisms about its adequacy [7]. The primary inadequacy of the accuracy stems from a tacit assumption that the overall accuracy controls by-class accuracies, or similarly that class distributions amon examples are constant and relatively balanced [6]. This is obviously not true : skewed distributions are frequent in a ronomy, or more enerally in life sciences. As an example, consider the human DNA, in which no more than 6% are codin enes [7]. In that cases, the interestin , unusual class is often the rare one, and the well-balanced hypothesis may not lead to discover the unusual individuals. Moreover, in real-world problems, not only is this assumption false, but also of heavy consequences may be the misclassi cation of some examples, another cost which is not inte rated in the accuracy. Fraud detection is a ood O. Watanabe, T. Yokomori (Eds.): ALT’99, LNAI 1720, pp. 182 193, 1999. c Sprin er-Verla Berlin Heidelber 1999
Complexity in the Case A ainst Accuracy
183
example of such situations [7], but medical domains are typical. As an example, consider the case where a muta en molecule is predicted as non-muta en, and the case where an harmless molecule is predicted as muta en. In that cases, the interestin class has the heaviest misclassi cation costs, and the equal error costs assumption may produce bad results. Finally, the accuracy may be inadequate in some cases because other parameters are to be taken into account. Constraints on size parameters are sometimes to be used because we want to obtain small formulae, for interpretation purposes. As an example, consider a ain the problem of muta enesis prediction, where two equally accurate formulae are obtained. If one is much smaller, it is more likely to provide useful descriptions for the minin expert. We have chosen for our framework a eld particularly sensitive to these problems, Inductive Lo ic Pro rammin (ILP). ILP is a rapidly rowin research eld, concerned by the use of variously restricted subclasses of Horn clauses to build Machine Learnin (ML) al orithms. Accordin to [9], almost seventy applications use ILP formalism, twenty of which are science applications, which can be partitioned into biolo ical (four) and dru desi n (sixteen) applications. ILP-ML al orithms have been applied with some success in areas of biochemistry and molecular biolo y [9]. Usin ILP formalism, we ar ue that the replacement of the accuracy raises structural complexity issues. The ar ument is structured as follows. First, to address the latter problems, we explain that the sin le accuracy requirement can be completed by an additional requirement to provide more adequate criteria. We inte rate various constraints over two important kinds of parameters: by-class error functions, and representation parameters such as feature selection ratios, size constraints. We show that any of such inte ration leads to a very ne ative structural complexity result, similar to N P -Hardness, which is not faced by the accuracy optimization alone. The result has a side e ect which can be presented as a loss in the formalism’s expressiveness, a rare property in classical ML complexity issues. Indeed, it authorizes the construction of arbitrary lar e (even exponential sized) sets of Horn clauses, but which we prove havin no more expressive power than a sin le Horn clause. We prove a threshold in intractability since it appears immediately with the additional requirement, and is not a function of the ti htness of it. Furthermore, the e ects of the constraints on optimal accuracies vanish as the number of predicates increases, as optimal accuracies with or without the additional constraints are asymptotically equal. Finally, for some criteria, their mixin with the accuracy brin s the most ne ative result: not only does the intractability appears immediately with the criterion, but also the error cannot be dropped down under that of the unbiased coin. We then study the replacement of the accuracy criterion usin a eneral method [6, 7], derived from statistical decision theory, based on a speci c bicriteria optimization. We show that this method leads to the same drawbacks. Finally, we investi ate the replacement of the error by a sin le criterion, and show that it is also to be analyzed very carefully, as some of the candidates lead exactly to the same ne ative results presented before. The reductions are
184
Richard Nock
presented for a subclass of Horn formalism simple enou h to be an element of the intersection of all classically encountered theoretical ILP studies.
2
Mono and Bi-criteria Solutions to Replace the Accuracy
Denote as C and H two classes of concepts representations, respectively called tar et’s class and hypothesis class. In real-world domains, we do not know the tar et concept’s class, that is why we have to make ad hoc choices for H with a powerful enou h formalism, yet ensurin tractability. Even if some benchmarks problems appear to be easily solvable [3], ML applications, and particularly ILP, face more di cult problems [9], for which the choice of H is crucial. Since most of the studies dealin with the accuracy replacement problem have been investi ated with two classes [7], we also consider two-classes problems and not multi-class cases. It is not really important for us, as results already become hard in that settin . Let c C. Suppose that we have drawn examples followin some unknown but xed distribution D, labelled accordin to c. We P can denote the accuracy of h H with respect to (w.r.t.) c by PD (h = c) = h(x)=c(x) D(x). 2.1
Extendin the Accuracy
The principal drawbacks of the accuracy are of two kinds: the equal costs assumption [6], and the well balanced assumption [7]. We propose a solution to the problem by the maximization of the accuracy subject to constraints. We also propose criteria on related problems, an example of such bein the feature selection problem, in which we want to build formulae on restricted windows of the total features set. For any xed positive rational , we use the followju−vj . We in adequate notion of distance between two reals u v : d (u v) = u+v+ also use ei ht rates on the examples (de nitions di er sli htly from [7]): T P = P P D(x) ; T P R = T P P ; F P = D(x) ; F P R = F P N; h(x)=1=c(x) h(x)=16P =c(x) P T N = h(x)=0=c(x) D(x) ; T N R = T N N ; F N = h(x)=06=c(x) D(x) ; F N R = P P F N P , with N = c(x)=0 D(x) and P = c(x)=1 D(x). In order to complete the accuracy requirements, we ima ine seven types of additional constraints, each of them bein parameterized by a number (between 0 and 1). Each of them de nes a subset of H, which shall be parameterized by D if the distribution controls the subset throu h the constraint. The rst three subsets of H contain hypotheses for which the F P and F N are not far from each other, or a one; HD 2 ( ) = side error is upper bounded: HDn1 ( ) = h H d (F o P FN) 1 F P . The two followin subh H FN ; HD 3 ( ) = h H F N sets are parameterized by constraints equivalent to some frequently encountered in the information retrieval community [8], respectively (1 minus) the precision and (1 minus) the recall criteria: HD 4 ( ) = h H F P (T P + F P ) . De ne #P(h) as the total num; HD 5 ( ) = h H F N (T P + F N ) ber of di erent predicates of h, #W(h) as the whole number of predicates of
Complexity in the Case A ainst Accuracy
185
h(if one predicate is present k times, it is counted k times), and #T as the total number of di erent available predicates. The two last subsets of H are parameterized by formulae respectively havin a su ciently small fraction of the available predicates, or havin a su ciently small overall size: H6 ( ) = . The division by the h H #P(h) #T ; H7 ( ) = h H #W(h) #T total number of di erent predicates in H7 ( ) is made only for technical reasons: to obtain hardness results for small values of and thus, already for small sizes of formulae (in the last constraint). The rst problem we address can be summarized as follows: 1 2 7 , can we nd an al orithm reProblem 1: Given and a turnin a set of Horn clauses from H(D )a ( ) whose error is no more than a iven γ, if such an hypothesis exists ? 2.2
Replacin the Accuracy: The ROC Analysis
Receiver Operatin Characteristic (ROC) analysis is a traditional methodolo y from si nal detection theory [1]. It has been used in machine learnin recently [6, 7] in order to correct the main drawbacks of the accuracy. In ROC space (this is the coordinate system), we visualize the performance of a classi er by plottin T P R on the Y axis, and F P R on the X axis. Fi ure 1 presents the ROC analysis, alon with three possible outputs which we present and analyze. If a TPR
set of Horn Clauses
1
0,5
0
continuous prediction random continuous prediction FPR 0,5 1
Fi . 1. The ROC analysis of a learnin al orithm. classi er produces a continuous output (such as an estimate of posterior probability of an instance’s class membership [7]), for any possible value of F P R, we can et a value for T P R, by thresholdin the output between its extreme bounds. If a classi er produces a discrete output (such as Horn clauses), then the classi er ives rise to a sin le point. If the classi er is the random choice of the class, either (if it is continuous) the curve is the line y = x, or (if it is discrete) there is a sin le dot, on the line y = x. One important thin to note is that the ROC representation ives the behavior of an al orithm without re ardin the class distribution or the error cost [6]. And it allows to choose the best of some classi ers, by the followin procedure. Fix as K + the cost
186
Richard Nock
of misclassifyin a positive example, and K − the cost of misclassifyin a ne ative example (these two costs depend on the problem). Then the expected cost of some classi er represented by point (F P R P T P R) is iven by the followP + D(x) (1 − T P R) K + F P R K −. in formula: c(x)=1 c(x)=0 D(x) Two al orithms, whose correspondin point are respectively (F P R1 T P R1 ) and expected cost i (T P R2 − T P R1 ) (F P R2 − (F P R2 T P R 2 ), have the same P P F P R1 ) = ( c(x)=1 D(x)K + ) ( c(x)=0 D(x)K − ). This ives the slope of an isoperformance line, which only depends on the relative wei hts of the examples, and the respective misclassi cation costs. Given one point on the ROC, the classi ers performin better are those on the northwest of the isoperformance line with the precedin slope, and to which the point belon s. If we want to nd an al orithm A performin surely better than an al orithm B, we therefore should strive to nd A such that its point lies into the rectan le whose opposite vertices are the (0 1) point (the perfect classi cation) and B’s point (a rey rectan le is shown on the top left of ure 1). From that, the second problem we address is the followin (Note the constraint’s weakness : the al orithm is required to work only on a sin le point): Problem 2: Given one point (T P Rx F P Rx ) on the ROC, can we nd an al orithm returnin a set of Horn clauses whose point falls into the rectan le with opposite vertices (0 1) and (T P Rx F P Rx ), if such an hypothesis exists ? 2.3
Replacin the Accuracy by a Sin le Criterion
The question of whether the accuracy can be replaced by a sin le criterion instead of two (such as in ROC) has been raised in [6]. Some researchers [6] propose the use of the followin criterion: (1 − F P R) T P R. A eometric interpretation of the criterion is the followin [6]: it corresponds to the area of a rectan le whose opposite vertices are (F P R T P R) and (1 0). The typical isoperformance curve is now an hyperbola. The third problem we address is therefore: Problem 3: Given γ, can we nd an al orithm returnin a set of Horn clauses such that (1 − F P R) T P R γ, if such an hypothesis exists ?
3
Introduction to the Proof Technique
We present here the basic ILP notions which we use, with a basic introduction to our proofs. Technical parts are proposed in two appendices. 3.1
ILP Back round Needed
The ILP back round needed to understand this article can be summarized as follows. More formalization and details are iven in [4], but they are not needed here. Given a Horn clause lan ua e L and a correct inference relation on L, an ILP learnin problem can be formalized as follows. Assume a back round knowled e BK expressed in a lan ua e LB L, and a set of examples E in
Complexity in the Case A ainst Accuracy
187
a lan ua e LE L. The oal is to produce an hypothesis h in an hypothesis class H L consistent with BK and E such that h and the back round knowled e cover all positive examples and none of the ne ative ones. Sometimes the formalism cannot correctly classify all examples accordin to the precedin scenario, for the reason that the examples describe a complex concept. We may transform the ILP learnin problem to a relaxed version, where we want the formulae to make su ciently small errors over the examples. The choice of the representation lan ua es for the back round knowled e and the examples, and the inference relation reatly influence the complexity (or decidability) of the learnin problem. A common restriction for both BK and E is to use round facts. As in [5], we use -subsumption as the inference relation (a clause h1 h2 [5, 4]). subsumes a clause h2 i there exists a substitution such that h1 In order to treat our problem as a classical ML problem, we use the followin lemma, which authorizes us to create ordinary examples: Lemma 1. [5] Learnin a Horn clause pro ram from a set of round back round knowled e BK and round examples E , the inference relation bein eneralized subsumption, is equivalent to learnin the same pro ram with -subsumption, and empty back round knowled e and examples de ned as round Horn clauses of the form e b, where e E and b BK. In the followin , we are interested in learnin concepts in the form of (sets of) non recursive Horn clauses. It is important to note that all results are still valid when considerin propositional, determinate or local Horn clauses, similarly to the study of [4], to which we refer for all necessary de nitions. For the sake of simplicity in statin our results, we sometimes abbreviate Function free Horn Clauses by the acronym FfHC . 3.2
Basic Tools for the Hardness Results
Concernin problem 1, x a 1 2 3 4 5 6 7 . We want to approximate the best concept in H(D;)a ( ) by one still in H(D;)a ( ). However, the best concept in H(D;)a ( ) enerally does not have an error equal to the optimal one over H (c). In fact, it has an error that we can denote optH(D;)a ( ) (c) = iven D, optHDP minh 2H(D;)a ( ) h(x)6=c(x) D(x) optHD (c). The oodness of the accuracy of a concept taken from H(D;)a ( ) should be appreciated with respect to this latter quantity. Our results on problem 1 are all obtained by showin the hardness of solvin the followin decision problem: De nition 1. Approx-Constrained(H (a )): Instance : A set of ne ative examples S − , a set of positive examples S + , a rational wei ht 0P < w(x ) = nd < 1 for each example x , a rational 0 γ < 1. We assume that x2S + [S − w(x ) = 1. P γ ? Question : ?h H(D;)a ( ) satisfyin h(x)6=c(x) w(x) De ne as ne the size of the lar est example we dispose of. Note that when the constraint is too ti ht, it can be the case that H(D;)a ( ) = . De ne as h the
188
Richard Nock
size of some h H (in our case, it is the number of Horn clauses of h). In the nonempty subset of H where formulae are the most constrained (i.e. stren thenin as the size of further the constraint ives an empty subset), de ne noptH ( ) (c) (D;)a
(ne )3 . the smallest hypothesis. Then, our reductions all satisfy noptH (c) (D;)a ( ) Note that the constraint makes enerally optH(D;)a ( ) (c) > optHD (c). However, the reductions all satisfy d optHD (c) optH(D;)a ( ) (c) = o(1), i.e. asymptotic values coincide. In addition, the principal result we et (similar for all other problems) is that we can suppose that the whole time used to write the total set of Horn clauses is assimilated to O(ne ), for any set. By writin time, we mean time of any procedure consistin only in writin down clauses. Examples of such a procedure are write down all clauses havin k literals , or even write down all Horn clauses . Such procedures can be viewed as for-to, or repeat al orithms. This property authorizes the construction of Horn clause sets havin arbitrary sizes, even exponential. Problem 2 is addressed by studyin the complexity of the followin decision problem. De nition 2. Approx-Constrained-ROC(H γF P R γT P R ): Instance : A set of ne ative examples S − , a set of positive examples S + , a rational wei ht 0P < w(x ) = nd < 1 for each example x , a rational 0 γ < 1. We assume that x2S + [S − w(x ) = 1. Question : ?h H satisfyin 1 − F P R 1 − γF P R and T P R γT P R ? Concernin problem 3, the reductions study a sin le replacement criterion Γ , and the followin decision problem. De nition 3. Approx-Constrained-Sin le(H Γ γ): Instance : A set of ne ative examples S − , a set of positive examples S + , a rational wei ht 0P < w(x ) = nd < 1 for each example x , a rational 0 γ < 1. We assume that x2S + [S − w(x ) = 1. Question : ?h H satisfyin Γ (h) γ?
4
Hardness Results
Theorem 1. We have: [1] 0 < < 1, Approx-Constrained(FfHC, (1 )) is Hard, when < (1 − ) . [2] 0 < < 12 , Approx-Constrained(FfHC, (2 )) is Hard. [3] a 3 4 5 6 7 0 < < 1, Approx-Constrained(FfHC, (a )) is Hard. At that point, the notion of hardness needs to be clari ed. By Hard we mean cannot be solved in polynomial time under some particular complexity assumption . The notion of hardness used encompasses that of classical N P completeness, since we use the results of [2] involvin randomized complexity classes. All our hardness results are to be read with that precision in mind. Due to space constraints, only proof of point [1] is presented in appendix 2; all other results strictly use the same type of reduction. Also, in appendix 1, we sketch the proof that all distributions under which our ne ative results are
Complexity in the Case A ainst Accuracy
189
proven lead to trivial positive results for the same problem when we remove the additional constraint, and optimize the accuracy alone. While ne ative results for optimizin the accuracy itself would naturally hold when considerin the additional constraints, we therefore prove that optimizin the accuracy under constraint is a strictly more di cult problem, with non-trivial additional drawbacks. Furthermore, the upperbound error value (γ in def. 1) in constraints 4, 5, 6, 7 can be xed arbitrarily in ]0 1 2[, i.e. requirin the Horn clauses set to perform sli htly better than the unbiased coin does not make the problem easier. We now show that the classical ROC components as described by [7] lead to the same results as those we claimed for the precedin bi-criteria optimizations. The problem is all the more di cult as the di culty appears as soon as we choose to use ROC analysis, and is not a function of the ROC bounds. Theorem 2. Approx-Constrained-ROC(FfHC, γF P R γT P R ) is Hard; the result holds 0 < γF P R γT P R < 1. The distribution under which the ne ative result is proven is an easy distribution for the accuracy’s optimization alone, similarly to those of the seven constraints. We now investi ate the replacement of the accuracy by a sin le criterion. The ne ative result stated in the followin theorem is to be read with all additional drawbacks mentioned for the previous theorems. A ain, the distribution under which the theorem is proven is easy when optimizin the accuracy alone. Theorem 3. γmax > 0 such that Constrained-Sin le(FfHC,(1 − F P R)
0 < γ < γmax , the problem ApproxT P R γ) is Hard.
175 (rou hly (Proof sketch included in appendix 2). As far as we know, γmax 41616 −3 4 2 10 ), but we think that this bound can be much improved.
5
Appendix 1: The Global Reduction
Reductions are achieved from the N P -Complete problem Clique [2], whose instance is a raph A raph G = (X E), and an inte er k. The question is Does there exist a clique of size k in G? . Of course, Clique is not hard to solve for any value of k. The followin lemma establishes values of k for which we can suppose that the problem is hard to solve ( nk = n! ((n − k)!k!) is the binomial coe cient): E , and k is not a constant, othTheorem 4. (i) We can suppose that k2 erwise Clique is polynomial. (ii) For any ]0 1[, Clique is hard for the value k = X or k = X . Proof. (i) is immediate ; (ii) follows from [2]: it is proven that the lar est clique size is not approximable to within X , for any constant 0 < < 1. Therefore, the raphs enerated have a clique number which is either l, or reater than l X , with l < X 1− . Therefore, the decision problem is intractable for ]0 1[. values of k > l, which is the case if k = X or k = X , with
190
Richard Nock
The structure of the examples is the same for any of our reductions. De ne ajXj ( ), in bijection with the vertices of a set of X unary predicates a1 ( ) G. To this set of unary predicates, we add two unary predicates, ( ) and t( ). The inferred predicate is denoted q( ). The choice of unary predicates is made only for a simplicity purpose. We could have replaced each of them by l-ary predicates without chan in our proof. De ne a set of constant symbols useful l1 l2 l3 l4 m i for the description of the examples: l j (i j) E 1 2 X . Examples are described in the followin way. Positive examples from S + are as follows: (i j)
E p j = q(l j ) p1 = q(l1 ) p2 = q(l2 )
jXjgnf jg ak (l j )
k2f1 2 k2f1 2
jXjg ak (l1 )
t(l j )
t(l1 )
a1 (l2 )
(1) (2) (3)
Ne ative examples from S − are as follows: i
1 2
X
n = q(m ) n01 n02
= q(l3 ) = q(l4 )
k2f1 2
jXjgnf g ak (m
k2f1 2
jXjg ak (l3 )
k2f1 2
jXjg ak (l4 )
)
(l3 ) (l4 )
t(m )
(4)
t(l3 )
(5) (6)
3 It comes that noptH (c) = O( X ) (codin size of positive examples) and (D )a ( ) ne = O( X ). Non-uniform wei hts are iven to each example, dependin on the constraint to be tackled with. The common-point to all reductions is that the wei hts of all examples nj (resp. all p j ) are equal (resp. to w− and w+ ). In each reduction, examples and clauses satisfy:
H1 p2 is forced to be badly classi ed. H2 n01 is always badly classi ed. H3 w(n02 ) ensures that n02 is always iven the ri ht class, forcin any clause to contain literal t( ) (When we remove n02 , we ensure that p2 is removed too). Lemma 2. Any clause containin literal ( ) can be removed. Proof. Suppose that one clause contains ( ). Then it can be -subsumed by n01 and by no other example (even if n02 exists, because of H3 ); but n01 -subsumes any clauses and also the empty clause. Therefore, removin the clause does not modify the value of any criteria based on the examples wei hts. Concernin the sixth constraint, the fraction of predicates used after removin the clause is at most the one before, thus, if the clause is an element of H6 ( ) before, it is still an element after. The same remark holds for the seventh constraint. As a consequence, p1 is always iven the positive class (even by the empty clause!). We now ive a eneral outline of the proof for Problem 1 ; reductions hl a set of Horn clauses, are similar for the other problems. Given h = h1 h2 we de ne the set I = i 1 2 X j 1 2 l a ( ) hj , and we x I = k 0 . In our proofs, we de ne two functions takin rational values, E(k 0 ) and 1 2 X a = 1 2 3 4 5 6 7). They are chosen such that: Fa (k 0 ) (k 0
Complexity in the Case A ainst Accuracy
191
P E(k 0 ) is strictly increasin , x2S + [S − jh(x)6=c(x) w(x) E(k 0 ) and E(k) = γ. Fa (k 0 ) is strictly decreasin , is a lowerbound of the function inside H(D )a ( ), and Fa (k) = (excepted for a = 3 F3 (k) = 1 ). a 1 2 3 4 5 6 7P, if there exists an unbounded set of Horn clauses h γ, its error rate imH(D )a ( ) satisfyin (x2S + ^h(x)=0)_(x2S − ^h(x)=1) w(x) 0 0 implies k k. So I = k = k. The interest of the plies k 0 k and constraint wei hts is then to force k2 positive examples from the set p j ( j)2E to be well classi ed, while we ensure the misclassi cation of at most k ne ative examples of the set n 2f1 2 jXjg. It comes that these k2 examples correspond to the k 2 ed es linkin the I = k vertices correspondin to ne ative examples badly classi ed. We therefore dispose of a clique of size k. Conversely, a 1 2 3 4 5 6 7 , iven some clique of size k whose set of vertices is denoted I, we show that 2f1 2 jXjgnI a (X) Psin leton h = q(X) γ. In this t(X) is H(D )a ( ), satisfyin (x2S + ^h(x)=0)_(x2S − ^h(x)=1) w(x) drops down to O(n ). case, noptH e (c) ( ) (D )a
All distributions used in theorems 1 and 3 are such that w+ < w− X , at least for raphs exceedin a xed constant size. Also, due to the ne ative examples of wei hts w− , if we remove the additional constraints and optimize the accuracy alone, we can suppose that the optimal Horn clause is a sin leton: mer in all clauses by keepin over predicates aj ( ) only those present in all clauses does not decrease the accuracy. Under such a distribution, the optimal Horn clause necessarily contains all predicates aj ( ), and the problem becomes trivial. The distribution in theorem 2 satis es w+ = w− . This is also a simple distribution for the accuracy’s optimization alone: indeed, the optimal Horn clause over predicates aj ( ) is such that it contains no predicates aj ( ) that does not appear at least in one positive example. If the raph instance of Clique is connex (and we can suppose so, otherwise the problem boils down to nd the lar est clique in one of the connected components), then the optimal Horn clause does not contain any of the aj ( ).
6 6.1
Appendix 2: Proofs of Ne ative Results Proof of Point [1], Theorem 1
1 + X 2 w− (1 + ) ; (i j) E, Wei hts of positive examples: w(p2 ) = 2(1− ) h i 1+ w− 1 1 − 1− − 12 w− X 2 1− + X −k − w(p j ) = w+ = (jXj+k) 2 ; w(p1 ) = 2 i h k 1 + 1− + X . Wei hts of ne ative examples: w(n02 ) = 1 2; w E − 2 1+ 2 E − k2 w+ + j 1 2 X w(nj ) = w− = jXj21jEj2 ; w(n01 ) = 12 1− 1+ 1 2 − k)w− . 2 (X 2 (note that w(n02 ) ensures that Fix γ = w(p2 ) + w(n01 ) + kw− + E − k2
n02 is iven the ri ht class), and kmax = 1 + max2k
jXj:jEj−(k2 )0
k 00 . From
192
Richard Nock
the choice of wei hts, lcm( x 2S + [S − d ) = O( X 8 ) ( lcm is the least com0 1 E(k 0 ) = mon multiple), which is polynomial. De ne the functions: k0 + 0 − k + 0 − 0 0 kmax E(k ) = E − 2 w +k w + E w +k w +w(p2 )+w(n1 ); 2 k w(p2 )+w(n1 ); kmax < k 0 of wei hts, E(k) = γ. k 0 k0
2
kmax
X E(k 0 ) = k 0 w− +w(p2 )+w(n1 ). From the choice 0 + 0 − 0 1 F1(k) = E w − k w + w(p2 ) − w(n1 ) q; F1 (k 0 ) = E − k2 − k 0 w− + w(p2 ) − w(n1 ) q; kmax <
0
X F1 (k 0 ) = − k 0 w− + w(p2 ) − w(n1 ) q, with q = + E w+ + k 0 w− + k w(p2 ) + w(n1 ). From the choice of wei hts, F1 (k) = . The equation obtained when k 0 < kmax takes its maximum for inte er valkmax ues when k 0 = ( X + k)2 + 0 5 0 5 > X . Furthermore, 1 + −1 − < w , which leads to E(k − 1) < E(k w X E − kmax max max ). In 2 0 a more eneral way, E(k ) is strictly increasin over natural inte ers. Now remark that the numerator of F1 (k 0 ) is strictly decreasin , and its denominator . Therefore, F1 (k0 ) is strictly decreasin . Furthermore Pstrictly increasin P w(x) F1 (k 0 ). If h Hfw g 1 ( ) satisfyd h(x)6=0=c(x) w(x) P h(x)6=1=c(x) γ, the error rate implies k 0 k and the constraint implies in h(x)6=c(x) w(x) 0 0 k k. Thus I = k = k. As pointed out in the precedin appendix, this leads to the existence of a clique of size k. Reciprocally, the Horn Pclause h constructed in Appendix 1 satis P es both relations h Hfw g 1 ( ), and h(x)6=c(x) w(x) γ. Indeed, we have h(x)6=1=c(x) w(x) = P E − k2 w+ + w(p2 ), but also h(x)6=0=c(x) w(x) = kw− + w(n1 ). Therefore, P P w(x) w(x) = F1 (k) = and h Hfw g 1 ( ). We d h(x)6=1=c(x) h(x)6=0=c(x) P have also h(x)6=c(x) w(x) = E(k) = γ. The reduction is achieved. We end by re ( E w+ + X w− ) 1− + , markin that d optH w (c) optH w 1 ( ) (c) which is o(1) (as X or E ), as claimed in subsection 3.2. 6.2
Proof Sketch of Theorem 3
Remark that T P R(1 − F P R) = T P R T N R. Wei hts are as follows for positive examples (we do not use p2 ): (i j)
(X − w(p1 ) = w+
γ
"
E w(p j ) = w+ =
( X + 1)2 − k −
k)w− jXj+1 3
2
k 2
+
(jXj+1)2 − k−
X +1
2
−3jXj
#
6
−3X
6. Wei hts are as follows
1 2 X w(nj ) = w− = for ne ative examples (we do not use n02 ): j 0 + − 1 ( X + k); w(n1 ) = 1 − E w − X w − w(p1 ). The choice of γmax comes from the necessity of keepin wei hts within correct limits. We explain how to the existence of a clique, by describin a polynomial of de ree 3, F (k 0 ) which upperbounds T P R T N R, and of course has the desirable property of havin its maxi-
Complexity in the Case A ainst Accuracy
193
mum for k 0 = k, with value γ, and with no other equal or reater values on the interval [0 X ]. Similarly to the other proofs, the value γ can only be reached when k 0 = k represents k holes amon predicates aj ( ) , and this induces a size-k k 0 0 1 F (k 0) = clique in the raph. De ne the function F (k 0 ) as follows. + k 0 − 0 0 kmax F (k ) = 2 w + w(p1 ) ( X − k 0 )w− ; w(p1 ) ( X − k )w ; 2 k X F (k 0 ) = ( E w+ + w(p1 ))( X − k 0 )w− . With our choice of kmax < k 0 wei hts, and inside the values of k 0 for which we described k (clearly, in the second F (k 0 )), F describes a polynomial of de ree 3, shown in ure 2. F upper-
F (k’) γ <γ k’=|X |
>0
k’ k’=k k’=2( |X |+1)/3- k k’=( |X |+1)/3 Fi . 2. Scheme of F (k 0 ) =
k 2
+ w + w(p1 ) ( X − k 0 )w− .
bounds T P R T N R of any set of Horn clauses, and the demand on T P R T N R leads to a sin le favorable case: the holes inside the set of Horn clauses describe a clique of size k 0 = k in the raph.
A Method of Similarity-Driven Knowled e Revision for Type Specializations Nobuhiro Morita, Makoto Hara uchi, and Yoshiaki Okubo Division of Electronics and Information En ineerin , Hokkaido University N-13 W-8, Sapporo 060-8628, JAPAN morita,makoto,yoshiaki @db-ei.en .hokudai.ac.jp
Abstract. This paper proposes a new framework of knowled e revision, called Similarity-Driven Knowled e Revision. Our revision is invoked based on a similarity observation by users and is intended to match with the observation. Particularly, we are concerned with a revision strate y accordin to which an inadequate variable typin in describin an object-oriented knowled e base is revised by specializin the typin to more speci c one without loss of the ori inal inference power. To realize it, we introduce a notion of extended sorts that can be viewed as a concept not appearin explicitly in the ori inal knowled e base. If a variable typin with some sort is considered over- eneral, the typin is modi ed by replacin it with more speci c extended sort. Such an extended sort can e ciently be identi ed by forward reasonin with SOL-deduction from the ori inal knowled e base. Some experimental results show the use of SOL-deduction can drastically improve the computational e ciency.
1
Introduction
This paper proposes a novel method of knowled e revision (KR), called Similarity-Driven Knowled e Revision. In traditional KR methods previously proposed (e. . [1]), their revision processes are tri ered by some examples lo ically inconsistent with an ori inal knowled e base. Then the knowled e base should be revised so that its extension (de ned by a set of facts derived from it) becomes consistent with the examples. As a result, the inconsistency in the knowled e base will be removed. Althou h we can have several revision methods to remove the conflict as a lo ical inconsistency, a minimal revision principle is especially preferred. Here the principle implies a strate y for selectin a new knowled e base that has a minimal extension consistent with the examples. Thus, both the notion of conflicts in knowled e bases and the basic revision strate y have been investi ated in terms of extensions. However, even if we nd no extensional conflict in our knowled e base, we mi ht need to revise it from an non-extensional viewpoint. We intend to utilize the revision from the non-extensional viewpoint to make our revision faster in the presence of extensional conflicts and to do revision even for cases not involvin them. O. Watanabe, T. Yokomori (Eds.): ALT’99, LNAI 1720, pp. 194 205, 1999. c Sprin er-Verla Berlin Heidelber 1999
A Method of Similarity-Driven Knowled e Revision for Type Specializations female
s’
>
male_with_long_hair (male has_long_hair)
female_with_long_hair (female has_long_hair) >
male
195
Fi . 1. Extensional Relationship between male and f emale
As a rst approach to such a non-extensional revision, this paper considers a conflict caused by an over- eneral typin of variables in an object-oriented knowled e base. When the knowled e base contains an over- eneral typin of some variable, then it deduces ne ative facts that will eventually be observed. However, it is not only way to nd the over- eneral typin . This paper proposes to use a notion of similarities between types to nd the inappropriateness, and tries to present a revision method based on it. As a type becomes more eneral, we have more chances to observe properties and rules shared by another type. Thus the possible class of similarities between types will contain a similarity that does not meet a user’s intention. Suppose we have a method for ndin a similarity for a iven knowled e base. There exist two cases where the similarity does not t a user’s intention: 1) shows the similarity between types s1 and s2 , while the user considers they are not similar and 2) shows the dissimilarity between s1 and s2 , while the user considers they are similar. This paper is especially concerned with a knowled e revision in the former case, assumin GDA, Goal-Dependent Abstraction, as an al orithm to nd the similarities between types under an order-sorted lo ic, where the types are called sorts in the lo ic. Given a knowled e base KB and a oal G to be proved, GDA detects an appropriate similarity for G between sorts s1 and s2 reflected in KB if the same property relevant to G is shared for both s1 and s2 . For example, suppose we have knowled e if f emale has a property has lon hair, then f emale has a property takes lon time shower and if male has the property has lon hair, then male has the property takes lon time shower . It can be represented as the followin order-sorted clauses: takes lon time shower(X : f emale) has lon hair(X : f emale) and takes lon time shower(X : male) has lon hair(X : male) .
In these clauses, the description X : s is called a variable typin and means that the ran e of the value is restricted to an object (constant) belon in to the sort s. For the knowled e, GDA detects a similarity between f emale and male with respect to the property takes lon time shower . However, this similarity seems not to t our intuition very well. Such an un tness would come from the fact that has lon hair would be considered as a feasible feature for f emale, while not so for male, as illustrated in Fi . 1. This implies that the variable typin X : male is inappropriate in the sense of over- eneral. We consider such a wron typin as an intensional conflict in the ori inal knowled e base and try to resolve it. The wron typin X : male is specialized (revised) by ndin a new sort concept s0 that is a subconcept of male and has the concept male with lon hair
196
Nobuhiro Morita, Makoto Hara uchi, and Yoshiaki Okubo
as its subconcept, as shown in Fi . 1. In a word, such a new s0 is an ima inable subconcept of male that has the property has lon hair as its feasible feature like f emale has. For the revised knowled e base, GDA detects a similarity between s0 and f emale that seems to t our intuition. Thus, our revision is tri ered when we reco nized an un tness of detected similarity. Therefore, we call it Similarity-Driven Knowled e Revision. A newly introduced ima inable concept is de ned as an extended sort. An extended sort denotes a concept that does not appear explicitly in the ori inal knowled e base. If a variable typin X : s is considered over- eneral, the typin is modi ed by replacin the eneral sort s with a more speci c extended sort s0 . In order for the extended sort to be meanin ful, we present four conditions to be satis ed. It is theoretically showed that such an extended sort can be identi ed by a forward reasonin from the ori inal knowled e base. Especially, a forward reasonin with SOL-deduction is adopted for an e cient computation. Some experimental results show that the use of SOL-deduction can drastically improve the computational e ciency.
2
Preliminaries
In this section, we introduce some fundamental terminolo ies and de nitions 1 . Our knowled e base consists of the followin three components. Sort Hierarchy: A sort hierarchy is iven as a partially ordered set of sort symbols, H = (S ). Each sort s S denotes a sort concept and is interpreted as the set of possible instances of the concept. The relation s s0 means that s is a subconcept of s0 . Type Declaration: A type declaration, T D, is a nite set of typed constants cn : sn , where ci is a constant symbol denotin that is of the form c1 : s1 an object and si is a sort in S. A typed constant ci : si declares that the object denoted by ci primarily belon s to the sort concept si , where si is called the primary type (or simply, type) of ci which is referred to as [ci ]. The extension of sort is de ned as follows: De nition 1 (Extension of Sort). Let s be a sort. The extension of s, As , is de ned as As = c [c] s We say that c belon s to s, if c As . From the interpretation, it is obvious that if s s0 and [c] = s, then c belon s to s0 as well as to s. Domain Theory: A domain theory (or simply, theory), T , is a set of functionBk , where A and Bi are positive free Horn clauses of the form A B1 tm ) and each term ti is a typed variable Xi : si literals of the form p(t1 or a typed constant. A typed variable Xi : si means that the ran e of values is restricted to instances belon in to si . On the other hand, we assume that a predicate p can take objects of any sort concept as its ar uments. 1
Throu hout this paper, we assume that the notions with respect to First-Order Lo ic are familiar to the reader.
A Method of Similarity-Driven Knowled e Revision for Type Specializations
197
The inference under our knowled e base is the same as standard rst-order deduction except the followin point: Every substitution = Xi : si ti should satisfy the type constraint that [ti ] si . That is, the variable must be instantiated to a variable or a constant of more speci c sort concept 2 .
3
Goal-Dependent Abstraction
In this section, we briefly present a framework of Goal-Dependent Abstraction that is used as the basis of our similarity ndin al orithm GDA. 3.1
Abstraction Based on Sort Mappin
We rst introduce a framework of abstraction based on sort mappin , an extended version of the framework of abstraction based on predicate mappin [3]. De nition 2 (Sort Mappin ). Let (S ) be a sort hierarchy. A sort mappin onto : S − S , where S 0 is an abstract sort set such that S S 0 = . is a mappin A sort mappin can easily be extended to a mappin over a set of any expressions. For an expression E, (E) is de ned as the expression obtained by mappin the sort symbols in E under . For any expressions E1 and E2 , if (E1 ) = (E2 ) = E 0 , then E1 and E2 are said to be similar and to be instantiations of E 0 under . Thus, a sort mappin can be viewed as a representation of similarity between sorts. Therefore, we often use the term similarity as a synonym for sort mappin . De nition 3 (Theory Abstraction Based on Sort Mappin ). Let KB =< onto (S ) T D T > be an order-sorted (concrete) knowled e base and : S − S 0 be a sort mappin . The abstract domain theory of T based on , SortAbs (T ), is −1 (C 0 ) T C , where for an expression de ned as SortAbs (T ) = C 0 C 0 −1 0 −1 0 (E ) is de ned as (E ) = E (E) = E 0 . E, For example, assume we have a domain theory consistin of three clauses (facts) p(X : s1 ), p(X : s2 ) and q(X : s1 ). Based on a similarity such that (s1 ) = (s2 ) = s0 , the rst two clauses can be preserved as an abstract clause p(X : s0 ), while the last one cannot so. Thus, only such a property p shared amon all similar sorts are preserved by the abstraction process. Based on this interestin characteristic, a dynamic abstraction framework, called GoalDependent Abstraction, has been proposed [4,5]. 2
We consider in this paper only substitutions satisfyin this constraint.
198
3.2
Nobuhiro Morita, Makoto Hara uchi, and Yoshiaki Okubo
Appropriate Similarity for Goal
When we try to prove a oal from our knowled e base, the used (necessary) knowled e is completely dependent on the oal. Observin which properties (knowled e) can be preserved in the abstraction process based on a similarity, we de ne an appropriateness of similarity with respect to a iven oal. De nition 4 (Appropriateness of Similarity for Goal). Let T be a theory, a similarity and G a oal. Assume that G can be proved from T and P roof (G) is the set of clauses in T that are used in a proof of G. The similarity is said to be appropriate for G if (P roof (G)) SortAbs (T ). If a similarity is appropriate for a oal G, it is implied that the properties appearin in P roof (G) (that is, the properties relevant to G) are shared amon all similar sorts de ned by 3 . Since such an appropriate similarity depends on a oal we try to prove, we can realize a oal-dependent aspect of abstraction based on an appropriate similarity for the iven oal. Given a knowled e base and a oal to be proved, an al orithm GDA detects an appropriate similarity for the oal. Discussin GDA in more detail is beyond the scope of this paper. Its precise description can be found in the literatures [4,5]. In our knowled e revision, GDA is used to detect a similarity reflected in our knowled e base. Our revision process is invoked when the detected similarity does not t our intuition.
4
Similarity-Driven Knowled e Revision for Type Specialization
In this section, we propose a novel method of knowled e revision, called Similarity-Driven Knowled e Revision. Before ivin a formal descriptions, we present some assumptions imposed in the followin discussion. Our knowled e revision is tri ered by an observation of undesirable similarity that is reflected in a iven ori inal knowled e base. For a knowled e base and s2 . For the similara oal G, assume GDA detects a similarity such that s1 ity, assume that a user considers they are not similar because the user reco nizes some di erence between s1 and s2 with respect to the plausibility of a property p relevant to G, as illustrated in the rst section. In this case, our knowled e base is tried to revise by specializin a variable typin with s1 or s2 , which is considered too eneral with respect to the property p. However, since such a reco nition of over- enerality seems to hi hly depend on the user’s subjectivity, our revision system would not be able to decide itself which variable typin to be specialized. Therefore, we assume that a variable typin to be specialized is iven to the system by the user as an input. In our knowled e base, the occurrence of a iven variable typin to be specialized mi ht not be identi ed uniquely. In that case, it would be desired to 3
Briefly speakin , in terms of EBL method [6], this means that for any similar sort, the oal G can be explained in the same way (that is, the same explanation structure).
A Method of Similarity-Driven Knowled e Revision for Type Specializations
199
adequately specialize all of the occurrences. Althou h one mi ht consider that a naive way is to specialize them individually, the result of one specialization process deeply a ects subsequent specializations. It is not so easy to obtain a ood characterization of such a complicated revision at the present time. In this paper, therefore, we deal with only a case where a variable typin to be specialized can uniquely be identi ed. 4.1
Extended Sort
We rst introduce a notion of extended sort that is a core concept in our revision. In our method, a modi cation of the type s of a variable is realized by ndin a new sort s0 that is more speci c than s. Such a newly introduced sort is precisely de ned as an extended sort. De nition 5 (Extended Sort). Let us consider a conjunction of atoms, es, and focus on a typed variable X : s in es. Then, es is called an extended sort with the root variable X. The root variable is referred to as Root(es). An extended sort is interpreted as the set of possible instances of its root variable, as de ned below. In the de nition, a Herbrand model of a theory T is a subset M of Herbrand base B = p(a1
an ) p is a predicate symbol and a is a typed constant symbol
which satis es a condition that for any clause A substitution , A M whenever Bs M.
Bs in T and any round
De nition 6 (Extension of Extended Sort). Let T be a theory and M be an Herbrand Model of T . Consider an extended sort es such that Root(es) = X. The extension of es with respect to M, EM (es), is de ned as EM (es) = X
For a constant c, if c
is a round substitution to es such that es
M .
EM (es), then it is said that c belon s to es.
For example, es = has a child(X : person Y : person) likes(Y Z : do ) with the root variable X is interpreted as the set of persons each of which has a child who likes a do . Thus the variables except the root variable are existentially quanti ed. An orderin on extended sorts can be introduced based on their extensions. De nition 7 (Orderin on Extended Sorts). Let T be a theory and E be the set of all extended sorts. For any extended sorts es1 and es2 in E , es1 T es2 i for any Herbrand Model M of T EM (es1 ) EM (es2 ). Let us consider an extended sort, es1 = has son(X : f ather Y : boy) dislikes(Y X), whose root variable is X. es1 means fathers each of which has a (youn ) son who dislikes his father. By the de nition, es1 is a subsort of f ather and parent under a sort hierarchy involvin f ather parent and boy youn person. As a more eneral concept placed between es1 and parent, we can suppose fathers each of which has a child who dislikes his/her parent, father or mother (possibly both
200
Nobuhiro Morita, Makoto Hara uchi, and Yoshiaki Okubo
of two). It will be expressed by an extended sort es2 = has child(X : f ather W : youn person) dislikes(W Z : parent). It can be drawn from the followin knowled e that es1 is a special case of es2 : has son(X Y ). Therefore Domain Theory: We know has child(X Y ) es1 is a special case of es3 = has child(X : f ather Y : boy) dislikes(Y X). Sort Hierarchy: For the existentially quanti ed variables Z : parent and W : youn person in es2 to have their values, it su ces to have values in their subsort f ather and boy, respectively. In addition, it is possible to identify Z with X (that is, Z and X are the same person). This is done by a substitution = Z : parent X : f ather W : youn person Y : boy . Thus any instance of es3 turns out to be an instance of es2 . As a result, it is found es1 is more speci c than es2 (via es3 ). We can summarize this ar ument by the followin proof-theoretic characterization of the orderin . Theorem 1. Let T be a theory and es1 and es2 be extended sorts such that to es1 that Root(es1 ) = Root(es2 ) = X. Consider a round substitution substitutes no constant appearin in T . Then for a round substitution to es2 such that X = X , es1 T es2 i T es2 es1 . 4.2
Revisin Knowled e Base by Specializin Type of Variable
We describe here our specialization process for over- eneral variable typin . For a knowled e base KB =< H T D T > and a oal G, let us assume that s2 and a user considers that s1 and GDA detects a similarity such that s1 s2 are not similar. In addition, assume the user considers a variable typin by s1 to be over- eneral. ES be a clause in P roof (G), where ES is a conjunction of Let C = P atoms and contains a typed variable X : s1 . Here we consider ES as an extended sort whose root variable is X. For an expression E, E(t t ) denotes the resultant expression obtained by res2 , it is found placin every occurrence of t in E with t0 . Since GDA detects s1 ES(X:s1 X:s2 ) is from De nition 4 that the clause C(X:s1 X:s2 ) = P(X:s1 X:s2 ) true (provable) under the ori inal knowled e base. Nevertheless, this similarity does not t the user’s intuition. This un tness would be caused in a situation such as one illustrated in Fi . 1. In this example, the objects of s1 satisfyin ES are a small part of s1 , while ones of s2 satisfyin ES(X:s1 X:s2 ) are most of s2 . In a word, the objects of s2 satisfyin ES(X:s1 X:s2 ) would be considered typical, while ones of s1 satisfyin ES not so. In order for GDA not to detect a similarity between them, we introduce a new sort s01 which is a subsort of s1 and subsumes ES (refer to Fi . 1). Then, the ori inal clause C is modi ed into a new clause C(X:s1 X:s1 ) = P(X:s1 X:s1 ) ES(X:s1 X:s1 ) . Furthermore, the newly introduced sort s01 is adequately inserted into the ori inal sort hierarchy and the ori inal type declaration is modi ed. As a result, for the revised knowled e base, GDA detects a similarity between s01 and s2 instead of one between s1 and s2 . It should be emphasized here that since s01 subsumes the extended sort ES, the extension of the ori inal knowled e base is not a ected by this revision. That is, our revision can be considered minimal. Below we discuss our specialization process in more detail.
A Method of Similarity-Driven Knowled e Revision for Type Specializations
201
Identifyin Newly Introduced Sort: The newly introduced sort s01 is precisely de ned as an extended sort that is identi ed accordin to the followin criterion. De nition 8 (Appropriateness of Introduced Extended Sort). If an extended sort es satis es the followin conditions, then it is considered as an appropriate extended sort to be introduced: C1 : ES T es. C2 : For any answer substitution to es w.r.t. the knowled e . base KB, there exists a term t such that X : s1 t C3 : es ES = . C4 : For any es0 such that es0 es and any selection of root variable in P and es0 , P T es0 . By the condition C1, it is uaranteed that the revision causes no influence on the extension of the ori inal knowled e base. C2 is required in order for s01 to become a proper subsort of s1 . C3 and C4 are imposed to remove redundant descriptions. An appropriate extended sort is basically computed in a enerate-and-test manner. Accordin to Theorem 1, a forward reasonin from T ES is performed to obtain a candidate es such that T ES es . Then the candidate is tested for its appropriateness based on the conditions C2, C3 and C4. If the candidate is veri ed to satisfy those conditions, the newly introduced sort s01 is de ned by es. Based on the de nition of es, the user mi ht assi n an adequate sort symbol to s01 if he/she prefers. For an e cient computation of es, we present an useful theorem and propose to adopt a reasonin method, SOL-deduction [7]. Theorem 2. Let C = P ES be the clause in P roof (G) to be replaced in the revision, and es be an extended sort such that Root(es) = Root(ES) = X : s. ES es and T − C es , If es satis es C1, C2 and C4, then (T − C ) where is a round substitution to ES that substitutes no constant appearin in T and is a round substitution to es such that X = X . From the theorem, it is su cient for our candidate eneration to perform a forward reasonin from (T − C ) ES , instead of from T ES . Since removin C from T will reduce the cost of the forward reasonin , we can expect an e cient candidate eneration. In addition, the theorem says that any candidate derived from T − C is quite useless to obtain an appropriate extended sort. That is, it is su cient to have candidates that can newly be derived by addin ES to T − C . If we have a method by which we can e ciently obtain such candidates, desired extended sorts can be obtained more e ciently. We can obtain such an e cient method with the help of SOL-deduction [7]. Briefly speakin , for a theory T and a clause C, by performin SOL-deduction from T C , we can centrally obtain all clause that can newly be derived by addin C to T . This characteristic of SOL-deduction is quite helpful for our task of candidate eneration. However, it should be noted that we still have to test for the appropriateness based on C2, C3 and C4.
202
Nobuhiro Morita, Makoto Hara uchi, and Yoshiaki Okubo
Insertin Extended Sort into Sort Hierarchy: After the computation of the newly introduced sort s01 (de ned as es), we have to modify the ori inal sort hierarchy by insertin s01 into an adequate position. Such a position is identi ed accordin to the followin theorems. Theorem 3. Let es be an extended sort such that Root(es) = X : s. For any Herbrand Model M, EM (es) As . The theorem says that the new sort s01 should be a subsort of s1 in the resultant hierarchy. Theorem 4. Let es be an extended sort such that Root(es) = X : s. If there exists an answer substitution to es with respect to KB such that X = Y : t (where Y is a new variable), then for any Herbrand Model M, At EM (es). Let be the set of answer substitutions to es. Consider the set of sorts Subses such that Subes = s and X : s1 Y : s appears in . Theorem 4 says that s01 should have every sort in Subes as its subsort in the resultant hierarchy. Modifyin Type Declaration: As well as the modi cation of sort hierarchy, a modi cation of type declaration mi ht be needed. The modi cation have to be performed accordin to the condition that c EMT (es), c belon s to s01 , where MT is the Least Herbrand Model of T . The next theorem shows how to modify the ori inal type declaration. Theorem 5. Let es be an extended sort such that Root(es) = X : s. The next two statements are equivalent: c EMT (es). There exists an answer substitution to es with respect to KB such that X = c, or X = Y : t (where Y is a new variable) and c At . In the case where an answer substitution such that X = Y : t is obtained, no modi cation of type declaration is necessary. In this case, since the ori inal sort hierarchy is modi ed and t becomes a subsort of s01 as the result, c At obviously belon s to s01 . In the case where an answer substitution such that X = c is obtained, let be the set of answer substitutions to es. Consider the set of constants Instes de ned as Instes = c and X : s1 c appears in . From Theorem 5, for any c Instes , we have to modify the ori inal type declaration T D into T D0 based on which it is implied that c belon s to s01 . Let us assume that for a constant c Instes , c : t T D. We have to take two cases into account. In the rst case where t = s1 , the ori inal type s1 of c is simply replaced with s01 . In the second case where t = s1 , the type s1 of c is replaced with a sort sc that corresponds to a common subsort of t and s01 , since the type of a constant should be unique. As such a sc , it mi ht be necessary to introduce a quite new sort and then the sort hierarchy mi ht be modi ed followin the introduction. Our revision process is summarized as an al orithm in Fi . 2.
A Method of Similarity-Driven Knowled e Revision for Type Specializations
203
Input:
A knowled e base KB = H T D T >. P roof (G): the set of clauses that is used to prove a oal G from KB. A sort s to be specialized. Output: A revised knowled e base KB = H T D T >. 1. Extract a clause C = P ES from P roof (G) such that a variable typed with s appears in its body, and the variable X typed with s to the root variable of ES. 2. Derive es from (T − fCg) [ fES g by a forward reasonin with SOL-deduction, where is a round substitution to ES that substitutes no constant appearin in KB. Then make sure that es satis es the appropriateness conditions C2, C3 and C4. If such a es cannot be found, terminate with failure. 3. Inform that a new sort s to be introduced can be de ned as es. ES(X:s X:s ) 4. Revise H to H by replacin the clause C in T with C = P(X:s X:s ) 5. Modify H into H by insertin the new sort s into an adequate position. 6. Modify T D into T D by redeclarin with s or with a quite new sort s if necessary. In the latter case, modify H followin the introduction of s . 7. Output the new knowled e base KB = H T D T > and terminate.
Fi . 2. Al orithm for Knowled e Revision by Type Specialization Sort H erarchy H and Type Declarat on T D : cup
can
material metal
aluminum_cup
plastic_cup
ceramic_cup
steel_cup steel_can steel_can
aluminum_can
aluminum
steel
plastic
ceramic
Doma n Theory T : throw away on f r day(X) ncombust ble(X) throw away on tuesday(X) combust ble(X) ncombust ble(X : cup) made of (X : cup Y : metal) ncombust ble(X : can) made of (X : can Y : metal) has handle(X : cup) made of (X : cup Y : metal)
made made made made made
of (X of (X of (X of (X of (X
: : : : :
alum num cup alum num : metal) plast c cup plast c : mater al) ceram c cup ceram c : mater al) steel can steel : metal) alum num can alum num : metal)
Fi . 3. Order-Sorted Knowled e Base
4.3
Example of Similarity-Driven Knowled e Revision
We illustrate here our revision processes for a knowled e base shown in Fi . 3 4 . For the knowled e base and a oal G = throw away on f riday(X), we obtain incombustible(X) P roof (G) = throw away on f riday(X) incombustible(X : cup) made of (X : cup Y : metal) made of (X : cup aluminum : metal) . can. Let us assume that And GDA detects a similarity such that cup contrary to the similarity, a user considers that cup and can are not similar and considers a variable typin with cup to be over- eneral. A clause in P roof (G) whose body contains a variable typed with cup is C = incombustible(X : cup) made of (X : cup Y : metal), where made of (X : cup Y : metal) is considered as an extended sort ES whose root variable is 4
In the ure, the sort hierarchy and type declaration are iven in a hierarchical form. A plain line denotes a is a subsort of relation and a dotted line corresponds to a declaration of constant type. For example, it is declared that the type of a constant aluminum is a sort metal.
204
Nobuhiro Morita, Makoto Hara uchi, and Yoshiaki Okubo
Table 1. Experimental Results KB Size R. Type Num. of Atoms Exec. Time KB Size FWR-T 4 80 FWR-TC 2 50 10 40 SOL-T 4 10 SOL-TC 2 0
R. Type Num. of Atoms Exec. Time FWR-T 6 1130 FWR-TC 2 1020 SOL-T 6 40 SOL-TC 2 20
X. Accordin to Theorem 2, consider a round instance of ES with a substitution = X : cup a Y : metal b , where a and b are constants not appearin in the ori inal knowled e base. By a forward reasonin from (T − C ) made of (a b) , an atom has handle(a) can be derived. As the next step, has handle(X : cup) is tested for its appropriateness. The answer substitution to the oal has handle(X : cup) is X : cup Y : aluminum cup , where Y is a new variable. Therefore, has handle(X : cup) satis es C2. It is obvious that it satis es C3. Furthermore, since incombustible(X : cup) T has handle(X : cup), C4 is satis ed as well. Therefore, it is veri ed that has handle(X : cup) is an appropriate extended sort. Let s be a new sort de ned by the extended sort has handle(X : cup). The sort s is tried to put to an adequate position in the ori inal sort hierarchy. It is easily found that s becomes a subsort of cup. Moreover, since s subsumes the extended sort ES = made of (X : cup Y : metal), only aluminum cup can become a subsort of s. The ori inal type declaration does not need to be modi ed in this revision. Finally, replace the ori inal clause C with C(X:cup X:s) = incombustible(X : s) made of (X : s Y : metal). The extension of the newly introduced sort s is de ned as one of has handle(X : cup). Some sort symbol meanin cup with handle mi ht be assi ned to s.
5
Experimental Results
We have implemented a knowled e revision system based on our al orithm and made an experimentation to verify its usefulness 5 . We show the results here. As discussed previously, we have four ways to obtain an extended sort to be introduced: 1) by forward reasonin from T ES , 2) by forward reasonin from (T − C ) ES . 3) by SOL-deduction from T ES . 4) by SOL-deduction from (T − C ) ES . In our experimental results, they are referred to as reasonin types, FWR-T , FWR-TC , SOL-T and SOL-TC , respectively. We provided two knowled e bases consistin of 10 clauses and 40 clauses. For each knowled e base and reasonin type, the number of atoms de nin an extended sort and the execution time (msec.) were examined. Our experimental results are shown in Table 1. The results show that the removal of C a ects the quality of computed extended sorts. For both knowled e bases, some unnecessary atoms were derived 5
Our system has been written in SICStus Prolo and run on a SPARC-station 5.
A Method of Similarity-Driven Knowled e Revision for Type Specializations
205
by the reasonin s from T . Moreover, such a meanin less derivation undesirably a ects the execution times. It is, therefore, considered that the removal of C is e ective to improve both the e ciency and the quality of our knowled e revision. The experimental results also show that the use of SOL-deduction can drastically reduce the execution time. The reduction ratios tend to be lar e as knowled e base rows. This implies that the use of SOL-deduction would be useful to improve the e ciency of our revision even in a real domain. We are currently considerin a le al domain [4] as an attractive application eld for our method.
6
Concludin Remarks
In this paper, we presented a novel framework of Similarity-Driven Knowled e Revision. Our revision process is invoked by an observation of undesirable similarity reflected in the ori inal knowled e base. The knowled e base is revised by specializin an over- eneral variable typin X : s into more speci c X : s0 . The central task is to nd an extended sort es by which s0 is precisely de ned and is an ima inable subsort of s. For the e cient computation of the extended sort, we presented an e ective proof-theoretic property and proposed the use of SOL-deduction. A drastic improvement of the e ciency was empirically veri ed. An important future work is to investi ate this kind of knowled e revision in a case where several over- eneral variable typin s are found. Further investi ation would be needed to obtain a ood characterization of this complicated revision. Contrary to the assumption on similarity observation in this paper, it would also be worth studyin a revision method in a case where a dissimilarity between two sorts is detected by GDA, while the user considers they are similar. In this case, it mi ht be needed to add some new clauses to the ori inal knowled e base as well as modifyin variable typin s.
References 1. S. Wrobel. Concept Formation and Knowled e Revision, Kluwer Academic Publishers, Netherlands, 1994. 2. C. Walter. Many-Sorted Uni cation, Journal of the Association for Computin Machinery, Vol. 35 No. 1, 1988, pp.1 17. 3. D. J. Tenenber . Abstractin First-Order Theories, Chan e of Representation and Inductive Bias (P. D. Benjamin, ed. ), Kluwer Academic Publishers, USA, 1989, pp. 67 79. 4. T. Kakuta, M. Hara uchi and Y. Okubo. A Goal-Dependent Abstraction for Leal Reasonin by Analo y, Arti cial Intelli ence & Law, Vol.5, Kluwer Academic Publishers, Netherlands, 1997, pp. 97 118. 5. Y. Okubo and M. Hara uchi. Constructin Predicate Mappin s for Goal-Dependent Abstraction, Annals of Mathematics and Arti cial Intelli ence, Vol.23, 1998, pp. 169 197. 6. T. Mitchell, R. Keller and S. Kedar-Cabelli. Explanation-Based Generalization: A Unifyin View, Machine Learnin , Vol.1, 1986, pp. 47 80. 7. K. Inoue. Liner Resolution for Consequence-Findin , Arti cial Intelli ence, Vol. 56 No. 2 3, 1992, pp 301 353.
PAC Learnin with Nasty Noise Nader H. Bshouty , Nadav Eiron, and Eyal Kushilevitz Computer Science Department, Technion, Haifa 32000, Israel. bshouty,nadav,eyalk @cs.techn on.ac. l
Abs rac . We introduce a new model for learnin in the presence of noise, which we call the Nasty Noise model. This model eneralizes previously considered models of learnin with noise. The learnin process in this model, which is a variant of the PAC model, proceeds as follows: Suppose that the learnin al orithm durin its execution asks for m examples. The examples that the al orithm ets are enerated by a nasty adversary that works accordin to the followin steps. First, the adversary chooses m examples (independently) accordin to the xed (but unknown to the learnin al orithm) distribution D as in the PAC-model. Then the powerful adversary, upon seein the speci c m examples that were chosen (and usin his knowled e of the tar et function, the distribution D and the learnin al orithm), is allowed to remove a fraction of the examples at its choice, and replace these examples by the same number of arbitrary examples of its choice; the m modi ed examples are then iven to the learnin al orithm. The only restriction on the adversary is that the number of examples that the adversary is allowed to modify should be distributed accordin to a binomial distribution with parameters (the noise rate) and m. On the ne ative side, we prove that no al orithm can achieve accuracy of 2 in learnin any non-trivial class of functions. On the positive side, we show that a polynomial (in the usual parameters, and in − 2 ) number of examples su ce for learnin any class of nite VC-dimension with accuracy > 2 . This al orithm may not be e cient; however, we also show that a fairly wide family of concept classes can be e ciently learned in the presence of nasty noise.
1
Introduction
Valiant’s PAC model of learnin [23] is one of the most important models for learnin from examples. Althou h bein an extremely ele ant model, the PAC model has some drawbacks. In particular, it assumes that the learnin al orithm has access to a perfect source of random examples. Namely, upon request, the learnin al orithm can ask for random examples and in return ets pairs (x ct (x)) where all the x’s are points in the input space distributed identically and independently accordin to some xed probability distribution D, and ct (x) is the correct classi cation of x accordin to the tar et function ct that the al orithm tries to learn. Since Valiant’s seminal work, there were several attempts to relax these assumptions, by introducin models of noise. The rst such noise model, called the ?
Some of this research was done while this author was at the Department of Computer Science, the University of Cal ary, Canada.
O. Watanabe, T. Yokomori (Eds.): ALT’99, LNAI 1720, pp. 206 218, 1999. c Sprin er-Verla Berlin Heidelber 1999
PAC Learnin with Nasty Noise
207
Random Classi cation Noise model, was introduced in [2] and was extensively studied, e. ., in [1,6,10,14,15,17]. In this model the adversary, before providin each example (x ct (x)) to the learnin al orithm tosses a biased coin; whenever the coin shows H , which happens with probability , the classi cation of the example is flipped and so the al orithm is provided with the, wron ly classi ed, example (x 1 − ct (x)). Another (stron er) model, called the Malicious Noise model, was introduced in [24], revisited in [18], and was further studied in [8,11,12,13,21]. In this model the adversary, whenever the -biased coin shows H , can replace the example (x ct (x)) by some arbitrary pair (x0 b) where x0 is any point in the input space and b is a boolean value. (Note that this in particular ives the adversary the power to distort the distribution D.) In this work, we present a new model which we call the Nasty (Sample) Noise model. In this model, the adversary ets to see the whole sample of examples requested by the learnin al orithm before ivin it to the al orithm and then modify E of the examples, at its choice, where E is a random variable distributed by the binomial distribution with parameters and m, where m is the size of the sample1 . The modi cation applied by the adversary can be arbitrary (as in the Malicious Noise model).2 Intuitively speakin , the new adversary is more powerful than the previous ones it can examine the whole sample and then remove from it the most informative examples and replace them by less useful and even misleadin examples (whereas in the Malicious Noise Model for instance, the adversary also may insert to the sample misleadin examples but does not have the freedom to choose which examples to remove). The relationships between the various models are shown in Table 1. Random Noise-Location Adversarial Noise-Location Label Noise Only Random Classi cation Noise Nasty Classi cation Noise Point and Label Noise Malicious Noise Nasty Sample Noise
Table 1. Summary of models for PAC-learnin from noisy data We ar ue that the newly introduced model, not only eneralizes the previous noise models, includin variants such as Decatur’s CAM model [13] and CPCN model [14], but also, that in many real-world situations, the assumptions previous models made about the noise seem insu cient. For example, when trainin data is the result of some physical experiment, noise may tend to be stron er in boundary areas rather than bein uniformly distributed over all inputs. While special models were devised to describe this situation in the exact-learnin settin (for example, the incomplete boundary query model of Blum et al., [5]), it may be re arded as a special case of Nasty Noise, where the adversary chooses to provide unreliable answers on sample points that are near the boundary of the tar et concept (or to remove such points from the sample). Another situation to which our model is related is the settin of A nostic Learnin . In this model, a 1
2
This distribution makes the number of examples modi ed be the same as if it were determined by m independent tosses of an -biased coin. However, we allow the adversary’s choice be dependent on the sample drawn. We also consider a weaker variant of this model, called the Nasty Classi cation Noise model, where the adversary may modify only the classi cation of the chosen points (as in the Random Classi cation Noise model).
208
Nader H. Bshouty, Nadav Eiron, and Eyal Kushilevitz
concept class is not iven. Instead, the learnin al orithm needs to minimize the empirical error while usin a hypothesis from a prede ned hypotheses class (see, for example, [19] for a de nition of the model). Assumin the best hypothesis classi es the input up to an fraction, we may alternatively see the problem as that of learnin the hypotheses class under nasty noise of rate . However, we note that the success criterion in the a nostic learnin literature is di erent from the one used in our PAC-based settin . We show two types of results. Sections 3 and 4 show information theoretic results, and Sect. 5 shows al orithmic results. The rst result, presented in Sect. 3, is a bound on the quality of learnin possible with a nasty adversary. This result shows that any learnin al orithm cannot learn any non-trivial concept class with accuracy less than 2 when the sample contains nasty noise of rate . It is complemented by a matchin positive result in Sect. 4 that shows that any class of nite VC-dimension can be learned by usin a sample of polynomial size, with any accuracy > 2 . The size of the sample required is polynomial in the usual PAC parameters and in 1 where = − 2 is the mar in between the requested accuracy and the above mentioned lower bound. The main, quite surprisin , result (presented in Sect. 5) is another positive result showin that e cient learnin al orithms are still possible in spite of the powerful adversary. More speci cally, we present a composition theorem (analo ous to [3,8] but for the nasty-noise learnin model) that shows that any concept class that is constructed by composin concept classes that are PAClearnable from a hypothesis class of xed VC-dimension, is e ciently learnable when usin a sample subject to nasty noise. This includes, for instance, the class of all concepts formed by any boolean combination of half-spaces in a constant dimension Euclidean space. The complexity here is, a ain, polynomial in the usual parameters and in 1 . The al orithm used in the proof of this result is an adaptation to our model of the PAC al orithm presented in [8]. Our results may be compared to similar results available for the Malicious Noise model. For this model, Cesa-Bianchi et al. [11] show that the accuracy of learnin with malicious noise is lower bounded by (1 − ). A matchin al orithm for learnin classes similar to those presented here with malicious noise is presented in [8]. As for the Random Classi cation Noise model, learnin with arbitrary small accuracy, even when the noise rate is close to a half, is possible. A ain, the techniques presented in [8] may be used to learn the same type of classes we examine in this work with Random Classi cation Noise.
2
Preliminaries
In this section we provide basic de nitions related to learnin in the PAC model, with and without noise. A learnin task is speci ed usin a concept class, denoted C, of boolean concepts de ned over an instance space, denoted X . A boolean concept c is a function c : X 0 1 The concept class C is a set of boolean concepts: C 0 1 X. Throu hout this paper we sometimes treat a concept as a set of points instead of as a boolean function. The set that corresponds to a concept c is simply x c(x) = 1 . We use c to denote both the function and the correspondin set interchan eably. Speci cally, when a probability distribution D is de ned over X , we use the notation D(c) to refer to the probability that a point x drawn from X accordin to D will have c(x) = 1.
PAC Learnin with Nasty Noise
2.1
209
The Classical PAC Model
The Probably Approximately Correct (PAC) model was ori inally presented by Valiant [23]. In this model, the learnin al orithm has access to an oracle PAC X is drawn that returns on each call a labeled example (x ct (x)) where x (independently) accordin to a xed distribution D over X , unknown to the C is the tar et function the learnin al orithm learnin al orithm, and ct should learn . De nition 1. A class C of boolean functions is PAC-learnable usin hypothesis class H in polynomial time if there exists an al orithm that, for any ct C, any 0 < < 1 2, 0 < < 1 and any distribution D on X , when iven access to the PAC oracle, runs in time polynomial in lo X , 1 , 1 and with probability at . least 1 − outputs a function h H for which: PrD [ct (x) = h(x)] 2.2
Models for Learnin in the Presence of Noise
Next, we de ne the model of PAC-learnin in the presence of Nasty Sample Noise (NSN for short). In this model, a learnin al orithm for the concept class C is iven access to an (adversarial) oracle NSNC (m). The learnin al orithm is allowed to call this oracle once durin a sin le run. The learnin al orithm passes a sin le natural number m to the oracle, specifyin the size of the sample it needs, and ets in return a labeled sample S (X 0 1 )m. (It is assumed, for simplicity, that the al orithm knows in advance the number of examples it needs; It is possible to extend the model to circumvent this problem.) The sample required by the learnin al orithm is constructed as follows: As in the PAC model, a distribution D over the instance space X is de ned, and C is chosen. The adversary then draws a sample S of a tar et concept ct m points from X accordin to the distribution D. Havin full knowled e of the learnin al orithm, the tar et function ct , the distribution D, and the sample drawn, the adversary chooses E = E(S ) points from the sample, where E(S ) is a random variable. The E points chosen by the adversary are removed from the sample and replaced by any other E point-and-label pairs by the adversary. The m − E points not chosen by the adversary remain unchan ed and are labeled by their correct labels accordin to ct . The modi ed sample of m points, denoted S, is then iven to the learnin al orithm. The only limitation that the adversary has on the number of examples that it may modify is that it should be distributed accordin to the binomial distribution with parameters m and , namely: m n (1 − )m−n Pr[E = n] = n Dm and then choosin E where the probability is taken by rst choosin S accordin to the correspondin random variable E(S ). De nition 2. An al orithm A is said to learn a class C with nasty sample noise of rate 0 with accuracy parameter > 0 and con dence parameter < 1 if, iven access to any oracle NSNC (m), for any distribution D and any tar et 0 1 such that, with probability at least ct C it outputs a hypothesis h : X 1 − the hypothesis satis es PrD [h ct ]
210
Nader H. Bshouty, Nadav Eiron, and Eyal Kushilevitz
We are also interested in a restriction of this model, which we call the Nasty Classi cation Noise learnin model (NCN for short). The only di erence between the NCN and NSN models is that the NCN adversary is only allowed to modify the labels of the E chosen sample-points, but it cannot modify the E points themselves. Previous models of learnin in the presence of noise can also be readily shown to be restrictions of the Nasty Sample Noise model: The Malicious Noise model corresponds to the Nasty Noise model with the adversary restricted to introducin noise into points that are chosen uniformly at random, with probability , from the ori inal sample. The Random Classi cation Noise model corresponds to the Nasty Classi cation Noise model with the adversary restricted so that noise is introduced into points chosen uniformly at random, with probability , from the ori inal sample, and each point that is chosen ets its label flipped. 2.3
VC Theory Basics
The VC-dimension [25], is widely used in learnin theory to measure the complexity of concept classes. The VC-dimension of a class C, denoted VCdim(C), is the maximal inte er d such that there exists a subset Y X of size d for which all 2d possible behaviors are present in the class C, and VCdim(C) = if such a subset exists for any natural d. It is well known (e. ., [4]) that, for any two classes C and H (over X ) of VC-dimension d, the class of ne ations c X c C has VC-dimension d, and the class of unions c h c C h H has VC-dimension at most 2max VCdim(C) VCdim(H) + 1 = O(d). Followin [3] we de ne the dual of a concept class: ? 0 1 H of a class H 0 1 X is de ned to be De nition ? 3. The dual H ? ? the set x x X where x is de ned by x (h) = h(x) for all h H.
If we view a concept class H as a boolean matrix MH where each row represents a concept and each column a point from the instance space, X , then the matrix correspondin to H? is the transpose of the matrix MH . The followin claim, from [3], ives a ti ht bound on the VC dimension of the dual class: Claim 1: For every class H, VCdim(H) lo VCdim(H? ) In the followin discussion we limit ourselves to instance spaces X of nite cardinality. The main use we make of the VC-dimension is in constructin nets. The followin de nition and theorem are from [7]: De nition 4. A set of points Y X is an -net for concept class H 0 1 X under distribution D over X , if for every h H such that D(h) , Y \h = . d, any Theorem 1. For any class H 0 1 X of VC-dimension distribution D over X , and any > 0, > 0, if m max 4 lo 2 8d lo 13 examples are drawn i.i.d. from X accordin to the distribution D, they constitute an -net for H with probability at least 1 − . In [22], Tala rand proved a similar result:
PAC Learnin with Nasty Noise
211
De nition 5. A set of points Y X is an -sample for the concept class distribution D over X , if it holds that every h H H 0 1 X under the jY \hj satis es D(h) − jY j 0 1 X of Theorem 2. There is a constant c1 , such that for any class H VC-dimension > 0, > 0, if m d, and distribution D over X , and any c1 d + lo 1 examples are drawn i.i.d. from X accordin to the distribution D, 2 they constitute an -sample for H with probability at least 1 − . 2.4
Consistency Al orithms
Let P and N be subsets of points from X . We say that a function h : X 0 1 is consistent on (P N ) if h(x) = 1 for every positive point x P and h(x) = 0 for every ne ative point x N . A consistency al orithm (see [8]) for a pair of classes (C H) (both over the same instance space X ), receives as input two subsets of the instance space, (P N ), runs in time t( P N ), and satis es the followin . If there is a function in C that is consistent with (P N ), the al orithm outputs YES and some h H that is consistent with (P N ), or NO if no consistent h H exist (there is no restriction on the output in the case that there is a consistent function in H but not in C). Given a subset of points of the instance space Q X , we will be interested in the set of all possible partitions of Q into positive and ne ative examples, such that there is a function h H and a function c C that are both consistent with this partition. This may be formulated as: SCON (Q) = P CON(P Q P ) = YES where CON is a consistency al orithm for (C H). Bshouty [8] shows the followin , based on the Sauer Lemma [20]: Lemma 1. For any set of points Q, SCON (Q)
Q
VC-dim(H)
.
Furthermore, an e cient al orithm for eneratin this set of partitions (alon with the correspondin functions h H) is presented, assumin that C is PAClearnable from H of constant VC dimension. The al orithm’s output is denoted 4 SCON (Q) = ((P Q P ) h) P SCON (Q) and h is consistent with (P Q P ) .
3
Information Theoretic Lower Bound
In this section we show that no learnin al orithm (not even ine cient ones) can learn a non-trivial concept class with accuracy better than 2 under the NSN model; in fact, we prove that this impossibility result holds even for the NCN model. De nition 6. A concept class C over an instance space X is called non-trivial X and two concepts c1 c2 C, such that if there exist two points x1 x2 c1 (x1 ) = c2 (x1 ) and c1 (x2 ) = c2 (x2 ). Theorem 3. Let C be a non-trivial concept class, be a noise rate and < 2 be an accuracy parameter. Then, there is no al orithm that learns the concept class C with accuracy under the NCN model (with rate ).
212
Nader H. Bshouty, Nadav Eiron, and Eyal Kushilevitz
Proof sketch: We base our proof on the method of induced distributions introduced in [18, Theorem 1]. We show that there are two concepts c1 c2 C and a probability distribution D such that PrD (c1 c2 ) = 2 and an adversary can force the labeled examples shown to the learnin al orithm to be distributed identically both when c1 is the tar et and when c2 is the tar et. Let c1 and c2 be the two concepts whose existence is uaranteed by the fact that C is a non-trivial class, and let x1 x2 X be the two points that satisfy c1 (x1 ) = c2 (x1 ) and c1 (x2 ) = c2 (x2 ). We de ne the probability distribution D to x1 x2 . Clearly, be D(x1 ) = 1 − 2 , D(x2 ) = 2 , and D(x) = 0 for all x X we indeed have PrD (c1 c2 ) = PrD (x2 ) = 2 . The adversary will modify exactly half of the sample points of the form (x2 ) to (x2 1 − ). This would result with the learnin al orithm bein iven a sample e ectively drawn from the followin induced distribution: Pr(x1 c1 (x1 )) = 1 − 2
and
Pr(x2 c1 (x2 )) = Pr(x2 c2 (x2 )) =
This induced distribution would be the same no matter whether the true tar et is c1 or c2 . Therefore, accordin to the sample that the learnin al orithm sees, it is impossible to di erentiate between the case where the tar et function is c1 and the case where the tar et function is c2 . Note that in the above proof we indeed take advanta e of the nastiness of the adversary. Unlike the malicious adversary, our adversary can focus all its power on just the point x2 , causin it to su er a relatively hi h error rate, while examples in which the point is x1 do not su er any noise. Finally, since any NCN adversary is also a NSN adversary, Theorem 3 implies the followin : Corollary 1. Let C be a non-trivial concept class, > 0 be the noise rate, and < 2 be an accuracy parameter. There is no al orithm that learns the concept class C with accuracy under the NSN model, with noise rate .
4
Information Theoretic Upper Bound
In this section we provide a positive result that complements the ne ative result of Sect. 3. This result shows that, iven a su ciently lar e sample, any hypothesis that performs su ciently well on the sample (even when this sample is subject to nasty noise) satis es the PAC learnin condition. Formally, we analyze the followin eneric al orithm for learnin any class C of VC-dimension d, whose inputs are a certainty parameter > 0, the nasty error rate parameter < 12 and the required accuracy = 2 + : Al orithm NastyConsistent:
c d + lo 2 1. Request a sample S = (x bx ) of size m 2 m( + 2. Output any h C such that x S : h(x) = bx h exists, choose any h C arbitrarily).
4) (if no such
Theorem 4. Let C be any class of VC-dimension d. Then, (for some constant c) al orithm NastyConsistent is a PAC learnin al orithm under nasty sample noise of rate .
PAC Learnin with Nasty Noise
213
Proof. By Hoe din ’s inequality [16], with probability 1 − 2 the number of sample points that are modi ed by the adversary E is at most m( + 4). Now, we note that the tar et function ct , errs on at most E points of the sample shown to the learnin al orithm (as it is completely accurate on the nonmodi ed sample S ). Thus, with hi h probability Al orithm NastyConsistent will be able to choose a function h C that errs on no more that ( + 4)m points of the sample shown to it. However, in the worst case, these errors of the function h occur in points that were not modi ed by the adversary. In addition, h may be erroneous for all the points that the adversary did modify. Therefore, all we are uaranteed in this case, is that the hypothesis h errs on no more that 2E points of the ori inal sample S . By Theorem 2, there exists a constant c such that, with probability 1 − 2, the sample S is a 2 -sample for the class of symmetric di erences between functions from C. By the union bound we therefore have that, with probability at least 1 − , E ( + 4)m, meanin (2 + 2)m, and that S is a 2-sample for the class of that S \ (ct h) symmetric di erences, and so: PrD [(ct h)] 2 + = as required.
5
Composition Theorem for Learnin with Nasty Noise
Followin [3] and [8], we de ne the notion of composition class : Let C be a class of boolean functions : X 0 1 n . De ne the class C to be the set of all boolean functions F (x) that can be represented as f ( 1 (x) k (x)) where C for i = 1 k. We de ne the size of f is any boolean function, and i f( 1 ht ) Ht de ne the k ) to be k. Given a vector of hypotheses (h1 ht ) to be the set of sub-domains Wa = x (h1 (x) ht (x)) = a set W(h1 for all possible vectors a 0 1 t. We now show a variation of the al orithm presented in [8] that can learn the class C with a nasty sample adversary, assumin that the class C is PAClearnable from a class H of constant VC dimension d. The al orithm builds on the fact that a consistency al orithm CON for (C H) can be constructed, iven an al orithm that PAC learns C from H [8]. This al orithm can learn the concept class C with any con dence parameter and with accuracy that is arbitrarily close to the lower bound of 2 , proved in the previous section. Its sample complexity and computational complexity are both polynomial in k, 1 and 1 , where = − 2 . The al orithm is based on the followin idea: Request a lar e sample from the oracle and randomly pick a smaller sub-sample from the sample retrieved. The random choice of a sub-sample neutralizes some of the power the adversary has, since the adversary cannot know which examples are the ones that will be most informative for us. Then use the consistency al orithm for (C H) to nd one representative from H for any possible behavior on the smaller subsample. These hypotheses from H now de ne a division of the instance space into cells , where each cell is characterized by a speci c behavior of all the hypotheses picked. The nal hypotheses is simply based on takin a majority vote amon the complete sample inside each such cell. To demonstrate the al orithm, we consider (informally) the speci c, relatively simple, case where the class to be learned is the class of k intervals on the line (see Fi . 1). The al orithm, iven a sample as input, proceeds as follows:
214
Nader H. Bshouty, Nadav Eiron, and Eyal Kushilevitz
1. The al orithm uses a small , random sub-sample to divide the line into sub-intervals. Each two adjacent points in the sub-sample de ne such a subinterval. 2. For each such sub-interval the al orithm calculates a majority vote on the complete sample. The result is our hypothesis. The number of points (which in this speci c case is the number of sub-intervals) the al orithm chooses in the rst step depends on k. Intuitively, we want the total wei ht of the sub-intervals containin the tar et’s end-points to be relatively small (this is what is called the bad part in the formal analysis that follows). Naturally, there will be 2k such bad sub-intervals, so the lar er k is, the lar er the sub-sample needed. Except for these bad sub-intervals, all other subintervals on which the al orithm errs have to have at least half of their points modi ed by the adversary. Thus the total error will be rou hly 2 , plus the wei ht of the bad sub-intervals.
Target Concept: (Errornous) Sample points: Sub-sample and intervals: "Bad"
"Bad"
"Bad"
"Bad"
Algorithm’s hypothesis:
Fi . 1. Example of NastyLearn for intervals. Now, we proceed to a formal description of the learnin al orithm. Given the constant d, the size k of the tar et function3 , the bound on the error rate , the parameters and , and two additional parameters M N (to be speci ed below), the al orithm proceeds as follows: Al orithm NastyLearn: 1. Request a sample S of size N . 2. Choose uniformly at random a sub-sample R S of size M . 3. Use the consistency al orithm for (C H) to compute ((Pt R Pt ) ht ) SCON (R) = ((P1 R P1 ) h1 ) 4. Output the hypotheses H(h1 ht ), computed as follows: For any Wa ht ) that is not empty, set H to be the majority of labels in S \Wa . W(h1 If Wa is empty, set H to be 0 on any x Wa . 3
The al orithm can use the doublin technique in case that k is not iven to it. In this case however, the sample size is not known in advance and so we need to use the extended de nition of the Nasty Noise model that allows repetitive queries to the oracle.
PAC Learnin with Nasty Noise
Theorem 5. Let 8 c2 dk 78k 24k lo lo M = max
d
and
N=
M c1 d2 k 2 2
215
4 d 2 d + lo
where c1 and c2 are constants. Then, Al orithm NastyLearn learns the class C with accuracy = 2 + and con dence in time polynomial in k, 1 , and 1 . Before commencin with the actual proof, we present a technical lemma: Lemma 2. Assumin N is set as in the statement of Theorem 5, with probability at least 1 − 4 , E (the number of points in which errors are introduced) is at most ( + 12)N . For lack of space, the proof is omitted (see [9] for details). We are now ready to present the proof of Theorem 5: Proof. To analyze the error made by the hypothesis that the al orithm enerates, let us denote the adversary’s strate y as follows: 1. Generate a sample of the requested size N accordin to the distribution D, and label it by the tar et concept F . Denote this sample by S . 2. Choose a subset Sout S of size E = E(S ), where E(S ) is a random variable (as de ned in Sect. 2.2). 0 1 of size E. 3. Choose (maliciously) some other set of points Sin X 4. Hand to the learnin al orithm the sample S = (S Sout ) Sin . Assume the tar et function F is of the form F = f ( 1 k ). For all i 1 k , denote by hj , where ji 1 t , the hypothesis the al orithm have chosen in step 3 that exhibits the same behavior i has over the points of R (from the de nition of SCON we are uaranteed that such a hypothesis exists). By de nition, there are no points from R in hj i , so: R \ (hj
i)
=
(1)
As the VC-dimension of both the class C of all i ’s and the class H of all hi ’s is d, the class of all their possible symmetric di erences also has VC-dimension O(d) (see Sect. 2.3). By applyin Theorem 1, when viewin R as a sample taken from S accordin to the uniform distribution, and by choosin M to be as in the statement of the theorem, R will be an -net (with respect to the uniform distribution over S) for the class of symmetric di erences, with = 6k, with probability at least 1 − 4. Note that there may still be points in S which are (i) = S \ (hj in hj i . Hence, we let S i ). Now, by usin (1) we et that S( ) jSj
4, simultaneously for all i. 6k with probability at least 1 − ht ) we de ne: For every sub-domain B W(h1 4
NB = S \ B S 4 NBout, = Sout \ B \ i (hj
i )
4
NBin = Sin \ B 4
NBout,b = Sout \ B \ S 4 NB = S \ B \ i (hj i)
S
i (hj
i)
216
Nader H. Bshouty, Nadav Eiron, and Eyal Kushilevitz
In words, NB and NBin simply stand for the size of the restriction of the ori inal (noise-free) sample S and the noisy examples Sin introduced by the adversary to the sub-domain B. The other de nitions are based on the distinction between the ood part of B, where the i s and the hj s behave the same, and the bad part, which is present due to the fact that the i s and the hj s exhibit the same behavior only on the sub-sample R, rather than on the complete sample S. Since the hypothesisStakes a majority vote in each sub-domain, then it will err on the domain B \ ( i (hj i )) if the number of examples left untouched in B is less than the number of examples in B that were modi ed by the adversary, plus those that were misclassi ed by the hj s (with respect to the i s). This may be formulated as the followin condition: NBin + NB NB − NBout, − NB . Therefore, the total error the al orithm may experience is at most: " # [ X hj D(B) D i + in +N out, +2N B: NB NB B B
i
We now calculate a bound for each of the two terms above separately. To bound the second term, note that by Theorem 2 our choice of N uarantees S to be a 4. Note that 6jW(h1 ht )j -sample for our domain with probability at least 1 − ht ) and from the Sauer Lemma [20] we have from the de nition of W(h1 ⊥ ht ) tVCdim(H ) , which, with Claim 1 and Lemma 1 yields: that W(h1 W(h1
M VCdim(H)VCdim(H
ht )
⊥
)
d
M d2
Our choice of N indeed uarantees, with probability at least 1 − 4 that: X X NB + D(B) N 6 W(h1 ht ) out, out, in+N B: NBNB B
in+N B: NBNB B
+2NB
6
+
+2NB
X B2W(h1
ht )
NBin +NBout, +2NB N
From the above choice of N , it follows that S is also a 6jW(h1 ht )j -sample for the class of symmetric di erences of the form hj i . Thus, with probability at least 1 − 4, we have: k X
D(hj
i)
i=1
2 + 6
X B2W(h1
ht )
NBout,b N
The total error made by the hypothesis (assumin that none of the four bad events happen) is therefore bounded by: Pr[H F ] D
3 + 6
X B2W(h1
ht )
NBin + NBout, + NBout,b + 2NB N
as required. This bound holds with certainty at least 1 − .
2 +
=
PAC Learnin with Nasty Noise
217
References 1. J. A. Aslam and S. E. Decatur, General Bounds on Statistical Query Learnin and PAC Learnin with Noise via Hypothesis Boostin , FOCS93, pp. 282-291, 1993. 2. D. An luin and P. Laird, Learnin from Noisy Examples , Machine Learnin , Vol. 2, pp. 343 370, 1988. 3. S. Ben-David, N. H. Bshouty and E. Kushilevitz, A Composition Theorem for Learnin Al orithms with Applications to Geometric Concept Classes , STOC97, pp. 324 333, 1997. 4. S. Ben-David and A. Litman, Combinatorial Variability of Vapnik Chervonenkis Classes with Applications to Sample Compression Schemes , Discrete Applied Math 86, pp. 3 25, 1998. 5. A. Blum, P. Chalasani, S. A. Goldman, and D. K. Slonim, Learnin with Unreliable Boundary Queries , COLT95, pp. 98 107, 1995. 6. A. Blum, M. Furst, J. Jackson, M. J. Kearns, Y. Mansour, and S. Rudich, Weakly Learnin DNF and Characterizin Statistical Query Learnin Usin Fourier Analysis , STOC94, 1994. 7. A. Blumer, A. Ehrenfeucht, D. Haussler and M. K. Warmuth, Learnability and the Vapnik-Chervonenkis dimension , J. of the ACM, 36(4), pp. 929 965, 1989. 8. N. H. Bshouty, A New Composition Theorem for Learnin Al orithms , STOC98, pp. 583 589, 1998. 9. N. H. Bshouty, N. Eiron and E. Kushilevitz, PAC Learnin with Nasty Noise , Technical Report CS0693, Department of Computer Science, The Technion Israel Institute of Technolo y. 10. N. H. Bshouty, S. A. Goldman, H. D. Mathias, S. Suri, and H. Tamaki, NoiseTolerant Distribution-Free Learnin of General Geometric Concepts , STOC96, pp. 151 160, 1996. 11. N. Cesa-Bianchi, E. Dichterman, P. Fischer, and H. U. Simon, Noise-Tolerant Learnin near the Information-Theoretic bound , STOC96, pp. 141 150, 1996. 12. N. Cesa-Bianchi, P. Fischer, E. Shamir and H. U. Simon, Randomized hypotheses and Minimum Disa reement hypotheses for Learnin with Noise , proceedin s of EuroCOLT 97, pp. 119 133, 1997. 13. S. E. Decatur, Learnin in Hybrid Noise Environments Usin Statistical Queries , in Learnin from Data: Arti cial Intelli ence and Statistics V, D. Fisher and H. J. Lenz, (Eds.), 1996. 14. S. E. Decatur, PAC Learnin with Constant-Partition Classi cation Noise and Applications to Decision Tree Induction , Proceedin s of the Sixth International Workshop on Arti cial Intelli ence and Statistics, pp. 147 156, 1997. 15. S. E. Decatur and R. Gennaro, On Learnin from Noisy and Incomplete Examples , COLT95, pp. 353 360, 1995. 16. W. Hoe din , Probability Inequalities for Sums of Bounded Random Variables , J. of the American Statistical Association, 58(301), pp. 13 30, 1963. 17. M. Kearns, E cient Noise-Tolerant Learnin from Statistical Queries , Proc. 25th ACM Symposium on the Theory of Computin , pp. 392 401, 1993. 18. M. J. Kearns and M. Li, Learnin in the Presence of Malicious Errors , SIAM J. on Computin , 22:807-837, 1993. 19. M. J. Kearns, R. E. Schapire, and L. M. Sellie, Toward E cient A nostic Learnin , Machine Learnin , vol. 17(2), pp. 115 142, 1994. 20. N. Sauer, On the Density of Families of sets , Journal of Combinatorial Theory, 13:145 147, 1972. 21. R. E. Schapire, The Desi n and Analysis of E cient Learnin Al orithms , MIT Press, 1991. 22. M. Tala rand, Sharper Bounds for Gaussian and Empirical Processes , Annals of Probability, 22:28 76, 1994.
218
Nader H. Bshouty, Nadav Eiron, and Eyal Kushilevitz
23. L. G. Valiant, A Theory of the Learnable , Comm. ACM, 27(11):1134-1142, 1984. 24. L. G. Valiant, Learnin Disjunctions of Conjunctions , IJCAI85, pp. 560 566, 1985. 25. V. N. Vapnik and A. Y. Chervonenkis, On the Uniform Conver ence of Relative Frequencies of Events to Their Probabilities , Theory Probab. Appl. 16, no. 2 pp. 264 280, 1971.
Positive and Unlabeled Examples Help Learnin
?
Francesco De Comi e, Francois Denis, Remi Gilleron, and Fabien Le ouzey LIFL, URA 369 CNRS, Univer ite de Lille 1 59655 Villeneuve d’A cq FRANCE decomite, denis, illeron, letouzey @lifl.fr
Abstract. In many learning problem , labeled example are rare or expen ive while numerou unlabeled and po itive example are available. However, mo t learning algorithm only u e labeled example . Thu we addre the problem of learning with the help of po itive and unlabeled data given a mall number of labeled example . We pre ent both theoretical and empirical argument howing that learning algorithm can be improved by the u e of both unlabeled and po itive data. A an illu trating problem, we con ider the learning algorithm from tati tic for monotone conjunction in the pre ence of cla i cation noi e and give empirical evidence of our a umption . We give theoretical re ult for the improvement of Stati tical Query learning algorithm from po itive and unlabeled data. La tly, we apply the e idea to tree induction algorithm . We modify the code of C4.5 to get an algorithm which take a input a et LAB of labeled example , a et POS of po itive example and a et UNL of unlabeled data and which u e the e three et to con truct the deci ion tree. We provide experimental re ult ba ed on data taken from UCI repo itory which con rm the relevance of thi approach.
Key words: PAC model, Stati tical Querie , Unlabeled Example , Po itive Example , Deci ion Tree , Data Mining
1
Introduction
Usual learning algori hms only use labeled examples. Bu , in many machine learning se ings, ga hering large se s of unlabeled examples is easy. This remark has been made abou ex classi ca ion asks and learning algori hms able o classify ex from labeled and unlabeled documen s have recen ly been proposed ([BM98], [NMTM98]). We also argue ha , for many machine learning problems, a na ural source of posi ive examples ( ha belong o a single class) is available and posi ive da a are abundan and cheap. For example consider a classical domain, such as he diagnosis of diseases: unlabeled da a are abundan (all pa ien s); posi ive da a may be numerous (all he pa ien s who have he disease); bu , labeled da a are rare if de ec ion es s for his disease are expensive. As a Thi re earch wa partially upported by Motricite et Cognition objectif region Nord/Pa -de-Calai O. Watanabe, T. Yokomori (Eds.): ALT’99, LNAI 1720, pp. 219 230, 1999. c Sprin er-Verla Berlin Heidelber 1999
: Contrat par
220
France co De Comite et al.
second example, consider mailing for a speci c marke ing ac ion: unlabeled da a are all he clien s in he da abase; posi ive da a are all he clien s who asked informa ion abou he produc concerned by he marke ing ac ion before he mailing was done; bu , labeled da a are rare and expensive because a survey has o be done on a par of he da abase of all clien s. We do no address ex classi ca ion problems in he presen paper, bu hey are concerned oo: for a web-page classi ca ion problem, unlabeled web-pages can be inexpensively ga hered, a se of web pages you are in eres ed in is available in your bookmarks, labeled web-pages are fairly expensive bu a small se of hand labeled web-pages can be designed. I has been proved in [Den98] ha many concep s classes, namely hose which are learnable from s a is ical queries, can be e cien ly learned in a PAC framework using posi ive and unlabeled da a only. Bu he price o pay is an increase in he number of examples needed o achieve learning (al hough i remains of polynomial size). We consider he problem of learning wi h a small se of labeled examples, a se of posi ive examples and a large se of unlabeled examples. We assume ha unlabeled examples are drawn according o some hidden dis ribu ion D, ha labeled examples are drawn according o he s andard example oracle EX(f D), and ha posi ive examples are drawn according o he oracle EX(f Df ) where Df is he dis ribu ion D res ric ed o posi ive examples. The reader should no e ha our problem is di eren from he problem of learning wi h imbalanced raining se s (see [KM97]) because we use hree sources of examples. In he me hod we discuss here, labeled examples are only used o es ima e he arge weigh ( he propor ion of posi ive examples among all examples); herefore, if an es ima e of he arge weigh is available for he problem, only posi ive and unlabeled da a are needed. We presen experimen al resul s showing ha unlabeled da a and posi ive da a can e cien ly boos accuracy of he s a is ical query learning algori hm for mono one conjunc ions in he presence of classi ca ion noise. Such boos ing can be explained by he fac ha SQ algori hms are based on he es ima e of probabili ies. We prove ha hese es ima es could be replaced by: an es ima e of he weigh of he arge concep wi h respec o (w.r. .) he hidden dis ribu ion using he (small) se of labeled examples and es ima es of probabili ies which can be compu ed from posi ive and unlabeled da a only. If he se s of unlabeled and posi ive da a are large enough, all es ima es can be calcula ed wi hin he accuracy of he es ima e of he weigh of he arge concep . We presen heore ical argumen s in he PAC framework showing ha a gain in he size of he query space (or i s VC dimension) can be ob ained on he number of labeled examples. Bu as usual, he resul s could be be er for real problems. In he las sec ion of he paper, we consider s andard me hods of decision ree induc ion and examine he commonly used C4.5 algori hm described in [Qui93]. In his algori hm, when re ning a leaf in o an in ernal node, he decision cri erion is based on s a is ical values. Therefore, C4.5 can be seen as a s a is ical query algori hm and he above ideas can be applied. We adap he code of C4.5. Our algori hm akes as inpu s hree se s: a se of labeled examples, a se of posi ive
Po itive and Unlabeled Example Help Learning
221
examples and a se of unlabeled examples. The informa ion gain cri erion used by C4.5 is modi ed such ha he hree se s are used. The reader should no e ha labeled examples are used only once for he compu a ion of he weigh of he arge concep under he hidden dis ribu ion. We provide some promising experimen al resul s, bu fur her experimen s are needed for an experimen al valida ion of our approach.
2 2.1
Preliminaries Basic De nitions and Notations
For each n 1, Xn deno es an ins ance space on n a ribu es. A concep f is a subse of some ins ance space Xn or equivalen ly a 0 1 -valued func ion de ned on Xn . For each n 1, le Cn 2Xn be a se of concep s. Then C = Cn deno es a concep class over X = Xn . The size of a concep f is he size of a smalles represen a ion for a given represen a ion scheme. An example of a concep f is a pair x f (x) , which is positive if f (x) = 1 and ne ative o herwise. We deno e by P os(f ) he se of all x such ha f (x) = 1. If D is a dis ribu ion de ned over X and if A is a subse of he ins ance space X, we deno e by D(A) he probabili y of he even [x A] and we deno e by DA he induced dis ribu ion. For ins ance, if f is a concep over X such ha D(f ) = 0, Df (x) = D(x) D(f ) if x P os(f ) and 0 o herwise. We deno e by f he complemen of he se f in X and f he symme ric di erence be ween f and . A monotone conjunction is a conjunc ion of boolean variables . For each x 0 1 n, we use he no a ion xn , he conjunc ion x(i) o indica e he i h bi of x. If V is a subse of x1 of variables in V is deno ed by xi 2V xi . If V = , xi 2V xi = 1. 2.2
PAC and SQ Models
Le f be a arge concep in some concep class C. Le D be he hidden dis ribuion de ned over X. In he PAC model [Val84], he learner is given access o an example oracle EX(f D) which re urns a each call an example x f (x) drawn randomly according o D. A concep class C is PAC learnable if here exis a learning algori hm L and a polynomial p( ) wi h he following proper y: for any f C, for any dis ribu ion D on X, and for any 0 < < 1 and 0 < < 1, if L is given access o EX(f D) and o inpu s and , hen wi h probabili y a leas 1 − , L ou pu s a hypo hesis concep h sa isfying error(h) = D(f h) in ime bounded by p(1 1 n size(f )). The SQ-model [Kea93] is a specializa ion of he PAC model in which he learner forms i s hypo hesis solely on he basis of es ima es of probabili ies. 0 1 0 1 associa ed A statistical query over Xn is a mapping : Xn wi h a olerance 0 < 1. In he SQ-model he learner is given access o a s a is ics oracle ST AT (f D) which, a each query ( ), re urns an es ima e of D( x ( x f (x) ) = 1 ) wi hin accuracy . Le C be a concep class over X. We say ha C is SQ-learnable if here exis a learning algori hm L and polynomials
222
France co De Comite et al.
p( ) q( ) and r( ) wi h he following proper y: for any f C, for any dis ribu ion D over X, and for any 0 < < 1, if L is given access o ST AT (f D) and o an inpu , hen, for every query ( ) made by L, he predica e can be evalua ed in ime q(1 n size(f )), and 1 is bounded by r(1 n size(f )), L hal s in ime bounded by p(1 n size(f )) and L ou pu s a hypo hesis h C sa isfying D(f h) . I is clear ha given access o he example oracle EX(f D), i is easy o simula e he s a is ics oracle ST AT (f D) drawing a su cien ly large se of labeled examples. This is formalized by he following resul : Theorem 1. [Kea93] Let C be a class of concepts over X. Suppose that C is SQ learnable by al orithm L. Then C is PAC learnable, and furthermore: If L uses a ni e query space Q and is a lower bound on the allowed approximation error for every query made by L, the number of calls of EX(f D) is O(1 2 log Q ) If L uses a query space Q of ni e VC dimension d and is a lower bound on the allowed approximation error for every query made by L, the number of calls of EX(f D) is O(d 2 log 1 ) The reader should no e ha his resul has been ex ended o whi e noise PAC models: he Classi ca ion Noise model of Angluin and Laird [AL88]; he Cons an Par i ion Classi ca ion Noise Model [Dec97]. The proofs may be found in [Kea93] and [Dec97]. Also no e ha almos all he concep classes known o be PAC learnable are SQ learnable and are herefore PAC learnable wi h classi ca ion noise.
3
Learnin Monotone Conjunctions in the Presence of Classi cation Noise
In his sec ion, he arge concep is a mono one conjunc ion over x1 xn . In he noise free case, a learning algori hm for mono one conjunc ions is: Learnin Monotone Conjunctions - Noise Free Case input: , V = Draw a ample S of m( ) example for i=1 to n do if for every po itive example x 1 , x( ) = 1 then V output: h = xi V xi
V
xi
I can be proved ha O (1 log 1 + n ) examples are enough o guaran ee ha he hypo hesis h ou pu by he learning algori hm has error less han wi h con dence a leas 1 − . The given algori hm is no noise- oleran . In he presence of classi ca ion noise, i is necessary o compu e an es ima e p1 (xi = 0) of p1 (xi = 0) which is he probabili y ha a random example according o he
Po itive and Unlabeled Example Help Learning
223
hidden dis ribu ion D is posi ive and sa is es x(i) = 0. Then only variables such ha his es ima e is small enough are included in he ou pu hypo hesis. Le us suppose ha examples are drawn according o a noisy oracle which, on each call, rs draws an ins ance x according o D oge her wi h i s correc label and hen flips he label wi h probabili y 0 < 1 2. Le us suppose ha he noise ra e is known, hen we can consider he following learning algori hm of mono one conjunc ions from s a is ics in he presence of classi ca ion noise: Learnin Monotone Conjunctions - Noise Tolerant Case input: , , V = Draw a ample S of m( ) example the size of S is su cient to ensure that the followin estimates are accurate to within (2n) with a con dence reater that 1 − for i=1 to n do compute an e timate pb1 (xi = 0) of p1 (xi = 0) if pb1 (xi = 0) (2n) then V V xi output: h = xi V xi
If he noise ra e is no known, we can es ima e i wi h echniques described in [AL88]. We do no consider ha case because we wan o show he bes expec ed gain. Le q1 (xi = 0) (resp. q0 (xi = 0)) be he probabili y ha a random example according o he noisy oracle is posi ive (resp. nega ive) and sa is es x(i) = 0. We have:
p1 (xi = 0) =
(1 − )q1 (xi = 0) − q0 (xi = 0) 1−2
(1)
Thus we can es ima e p1 (xi = 0) using es ima es of q1 (xi = 0) and q0 (xi = 0). Then simple algebra and s andard Cherno bound may be used o prove ha examples are su cien o guaran ee ha he O (n2 log n) ( 2 (1 − 2 )2 ) log 1 hypo hesis h ou pu by he learning algori hm has error less han wi h con dence a leas 1 − . The reader should no e ha his bound is qui e larger han he noise free case one. We now make he assump ion ha labeled examples are rare, bu ha sources of unlabeled examples and posi ive examples are available o he learner. Unlabeled examples are drawn according o D. A noisy posi ive oracle, on each call, draws examples from he noisy oracle un il i ge s one wi h label 1. We raise he following problems: How can we use posi ive and unlabeled examples in he previous learning algori hm? Wha could be he expec ed gain? In he learning algori hm of conjunc ions from s a is ics wi h noise, an esima e of p1 (xi = 0) is calcula ed using (1). From usual formulas for condiional probabili ies, q1 (xi = 0) may be expressed as he probabili y q(1) ha
224
France co De Comite et al.
a labeled example is posi ive according o he noisy oracle imes he probabili y qf (xi = 0) ha a posi ive example (drawn according o he posi ive noisy oracle) sa is es x(i) = 0. Now, using he formula for probabili ies of disjoin even s, q0 (xi = 0) is equal o he probabili y q(xi = 0) ha an unlabeled example (drawn according o D) sa is es x(i) = 0 minus q1 (xi = 0). Thus, o compu e an es ima e of p1 (xi = 0), we use he following equa ions: q1 (xi = 0) = q(1) qf (xi = 0), q0 (xi = 0) = q(xi = 0) − q1 (xi = 0) and p1 (xi = 0) = [(1 − )q1 (xi = 0) − q0 (xi = 0)] (1 − 2 ). Consequen ly, o compu e es ima es of p1 (xi = 0) for all i, we have o compu e an es ima e of q(1) wi h labeled examples, es ima es of qf (xi = 0) for every i using he source of posi ive examples, and compu e es ima es of q0 (xi = 0) for every i using he source of unlabeled examples. The reader should no e ha labeled examples are used only once for he calcula ion of an es ima e of he probabili y ha a labeled example is posi ive. Thus, we have given a posi ive answer o our rs ques ion: unlabeled examples and posi ive examples can be used in he learning algori hm of conjunc ions from s a is ics. We now raise he second ques ion, ha is: wha could be he expec ed gain? We give below an experimen al answer o hese ques ions. We compare hree algori hms: The rs one is he learning algori hm of conjunc ions from s a is ics where only labeled examples are used The second one compu es an es ima e of q(1) from labeled examples and uses exac values for qf (xi = 0) and q0 (xi = 0). Tha amoun s o say ha an in nite pool of posi ive and unlabeled da a is available The hird one compu es an es ima e of q(1) from labeled examples and esima es of qf (xi = 0) and q0 (xi = 0) from a nite number of posi ive and unlabeled examples x (n) ) Each of hese hree algori hms ou pu s an ordered lis V = (x (1) of variables such ha , for each i, p1 (x (i) = 0) p1 (x (i+1) = 0). For a given ordered lis V , and for each i, we de ne i (V ) = ji x (j) . The minimal error of an ordered lis V is de ned as errormin (V ) = min error( i (V )) 0 i n which is he leas error ra e we can hope. We compare he minimal errors for he hree algori hms. Firs , le us make more precise hese hree algori hms. We recall ha labeled examples are drawn from a noisy oracle, ha posi ive examples are drawn from he noisy oracle res ric ed o posi ive examples, and ha he noise ra e is known. Al orithm L(LABN ) input: a ample LAB of N labeled example for i=1 to n do qb1 (xi = 0) = x c LAB x( ) = 0 c = 1 qb0 (xi = 0) = x c LAB x( ) = 0 c = 0 =0)− q c0 (xi =0) pb1 (xi = 0) = (1− )qc1 (xi1−2 output: ordered li t V = (x (1) x (n) )
N N
Po itive and Unlabeled Example Help Learning
225
Algorithm L(LABN P OSM UN LM ) Algorithm L(LABN P OS UN L ) input: a sample LAB of N labeled examples, input: a sample LAB of N labeled examples a sample P OS of M positive examples qb(1) = jfhx ci 2 LAB j c = 1gj N and a sample UN L of M unlabeled examples qb(1) = jfhx ci 2 LAB j c = 1gj N for i=1 to n do for i=1 to n do compute exactly qf (xi = 0) qbf (xi = 0) = jfhx ci 2 P OS j x(i) = 0gj M compute exactly q(xi = 0) qb1 (xi = 0) = qb(1) qf (xi = 0) qb(xi = 0) = jfx 2 UN L j x(i) = 0gj M qb0 (xi = 0) = q(xi = 0) − qb1 (xi = 0) qb1 (xi = 0) = qb(1) qbf (xi = 0) (1− )q1 (xi =0)− q0 (xi =0) qb0 (xi = 0) = qb(xi = 0) − qb1 (xi = 0) p1 (xi = 0) = c 1−2 (1− )q1 (xi =0)− q0 (xi =0) p1 (xi = 0) = c output: ordered list V = (x (1) x (n) ) 1−2 output: ordered list V = (x (1) x (n) )
c
c
c
c
Now, we describe experimen s and experimen al resul s 1 . The concep class xn for some n. The is he class of mono one conjunc ions over n variables x1 arge concep is a conjunc ion con aining ve variables. The class D of dis ribu[0 ]n ions is de ned as follows: D D is charac erized by a uple ( 1 n) − 15 where = 2(1 − 2 ) 0 26; for a given D D, all values x(i) are selec ed independen ly of each o her, and x(i) is se o 0 wi h probabili y i and se o has been chosen such ha , if each i is 1 wi h probabili y 1 − i . No e ha drawn randomly and independen ly in [0 ], he average weigh of he arge concep f w.r. . D is 0.5. We suppose ha examples are drawn accordingly o noisy oracles where he noise ra e is se o 0 2. Experiment 1. The number n of variables is se o 100. We compare he averages of minimal errors for algori hms L(LABN ) and L(LABN P OS1 U N L1 ) as func ions of he number N of labeled examples. For a given N , averages of minimal errors for an algori hm are ob ained doing k imes he following: f is a randomly chosen conjunc ion of ve variables D is chosen randomly in D by choosing randomly and independen ly each i N examples are drawn randomly w.r. . D, hey are labeled according o he arge concep f , hen he correc label is flipped wi h probabili y Minimal errors for L(LABN ) and L(LABN P OS1 U N L1 ) are compu ed and hen compu e he averages over he k i era ions. We se k o 100. The resul s can be seen in Fig. 1. The op plo corresponds o L(LABN ) and he bo om plo o L(LABN P OS1 U N L1 ). Experiment 2. We now consider a more realis ic case: here are M posi ive examples and an equal number of unlabeled examples. We show ha oge her wi h a small number N of labeled examples, hese posi ive and unlabeled examples give abou as much informa ion as do M labeled examples alone. We compare he averages of minimal errors for algori hms L(LABN P OSM U N LM ) and L(LABM ) as func ions of M . The number n of variables is se o 100. The number N of labeled examples is se o 10. The resul s can be seen in Fig. 1. 1
Source and cript can be found at ftp:// rappa.univ-lille3.fr/pub/Softs/posunlab.
226
France co De Comite et al. 0.45
0.4 LAB only LAB + unlimited number of POS and UNL examples
LAB + POS + UNL LAB 2
0.4
0.35
0.35
0.3
0.3
error rate
error rate
0.25 0.25
0.2
0.2
0.15 0.15 0.1
0.1
0.05
0.05
0
0 0
10
20
30
40
50 LAB size
60
70
80
90
100
0
100
200
300
400 500 600 POS, UNL et LAB2 sizes
700
800
900
1000
Fi . 1. resul s of experimen s 1 and 2. For experimen 1: arge size = 5; 100 variables; 100 i era ions. This gure shows he gain we can expec using free posi ive and unlabeled da a. For experimen 2: arge size = 5; 100 variables; 100 i era ions; size(LAB) = 10; size(P OS) = size(U N L) = size(LAB2) = M ; M ranges from 10 o 1000 by s ep 10. These curves show ha wi h only 10 labeled examples, he learning algori hm performs almos as well wi h M posi ive and M unlabeled examples as wi h M labeled examples.
4
Theoretical Framework
Le C be a class of concep s over X. Suppose ha C is SQ learnable by some algori hm L. Le f be he arge concep and le us consider a s a is ical query made by L. The s a is ics oracle ST AT (f D) re urns an es ima e D of D = D( x ( x f (x) ) = 1 ) wi hin some given accuracy. We may wri e: D = D( x ( x 1 ) = 1 f (x) = 1 ) + D( x ( x 0 ) = 1 f (x) = 0 ) = D( x ( x 1 ) = 1 f ) + D( x ( x 0 ) = 1 f) = D(A f ) + D(B f ) where he se s A and B are de ned by: A = x (x 1)=1 ,B= x (x 0)=1 . Fur hermore, le A be any subse of he ins ance space X and f be a concep over X, we have D(A f ) = Df (A) D(f ) and D(A f ) = D(A) − D(A f ). From he preceding equa ions, we ob ain ha D = D(f ) (Df (A) − Df (B)) + D(B). Now, in order o es ima e D , i is su cien o es ima e D(f ), Df (A), Df (B) and D(B). If we ge an es ima e of D(f ) wi hin accuracy and es ima es of Df (A), Df (B) and D(B) wi hin accuracy , i can be easily shown ha D(f ) (Df (A) − Df (B)) + D(B) is an es ima e of D(f ) wi hin accuracy + (3 + 2 ). As usual, D(f ) can be es ima ed using he oracle EX(f D). We can es ima e Df (A) and Df (B) using he POS oracle EX(f Df ). We can es ima e D(B) wi h he UNL oracle EX(1 D). So we can modify any s a is ical query based algori hm so ha i uses he EX, POS and UNL oracles. Fur hermore, if he s andard algori hm makes N queries, labeled, posi ive and unlabeled example sources will be used o es ima e respec ively 1, 2N and N queries. In his paper, we make he assump ion ha labeled examples are expensive and ha unlabeled and posi ive examples are cheap . If we make he s ronger assump ion ha posi ive and unlabeled da a are free, we can es ima e Df (A),
Po itive and Unlabeled Example Help Learning
227
Df (B) and D(B) wi hin arbi rary accuracy, i.e. = 0. If min is he smalles olerance needed by he learning algori hm L and wha ever is he number of queries made by L, we see ha we only need labeled examples o es ima e only one probabili y, say D(f ), wi hin accuracy min . Le Q be he query space used by L. Theorem 1 gives an upper bound of he number of calls of EX(f D) necessary o simula e he s a is ical queries needed by L. We see ha we can expec o divide his number of calls by V CDIM (Q). A more precise heore ical s udy remains o be done. For ins ance, i should be in eres ing o es ima e he expec ed improvemen s of he accuracy when he number of labeled examples is xed depending on he number of posi ive and unlabeled examples. This could be done for each usual s a is ical query learning algori hm.
5
Tree Induction from Labeled, Positive, and Unlabeled Data
C4.5, and more generally decision ree based learning algori hms are SQ-like algori hms because a ribu e es choices depend on s a is ical queries. Af er a brief presen a ion of C4.5 and as C4.5 is an SQ algori hm, we describe in Sec . 5.2 how o adap i for he rea men of posi ive and unlabeled da a. We nally discuss experimen al resul s of a modi ed version of C4.5 which uses posi ive and unlabeled examples. 5.1
C4.5, a Top-Down Decision Tree Al orithm
Mos algori hms for ree induc ion use a op-down, greedy search hrough he space of decision rees. The splittin criterion used by C4.5 [Qui93] is based on a s a is ical proper y, called information ain, i self based on a measure from informa ion heory, called entropy. Given a sample S of some arge concep , c he en ropy of S is Entropy(S) = i=1 −pi log2 pi where pi is he propor ion of examples in S belonging o he class i. The informa ion gain is he expec ed reduc ion in en ropy by par i ioning he sample according o an a ribu e es t. I is de ned as Gain(S t) = Entropy(S) − v2V alues(t)
Nv Entropy(Sv ) N
(2)
where V alues(t) is he se of every possible value for he a ribu e es t, Nv is he cardinali y of he se Sv of examples in S for which t has value v and N is he cardinali y of S. 5.2
C4.5 with Positive and Unlabeled Data
Le X be he ins ance space, we only consider binary classi ca ion problems. The classes are deno ed by 0 and 1, an example is said o be posi ive if i s
228
France co De Comite et al.
label is 1. Le P OS be a sample of posi ive examples of some arge concep f , le LAB be a sample of labeled examples and le U N L be a se of unlabeled da a. Le D be he hidden dis ribu ion which is de ned over X. P OS is a se of examples x f (x) = 1 re urned by an example oracle EX(f Df ), LAB is a se of examples x f (x) re urned by an example oracle EX(f D) and U N L is a se of ins ances x drawn according o he dis ribu ion D. The en ropy of a sample S is de ned by Entropy(S) = −p0 log2 p0 − p1 log2 p1 . In his formula, S is he se of raining examples associa ed wi h he curren node n and p1 is he propor ion of posi ive examples in S. Le Dn be he l ered dis ribu ion, ha is he hidden dis ribu ion D res ric ed o ins ances reaching he node n, le Xn be he se of ins ances reaching he node n: p1 is an es ima ion of Dn (f ). Now, in he ligh of he resul s of Sec . 4, we modify formulas for he calcula ion of he informa ion gain. We have Dn (f ) = D(Xn f ) D(Xn ). Using he equa ion D(Xn f ) = Df (Xn ) D(f ), we ob ain Dn (f ) = Df (Xn ) D(f ) 1 D(Xn ). We can es ima e Df (Xn ) using he se of posi ive examples associa ed wi h he node n, we can es ima e D(f ) wi h he comple e se of labeled examples, and we can es ima e D(Xn ) wi h unlabeled examples. More precisely, le P OS n be he se of posi ive examples associa ed wi h he node n, le U N Ln be he se of unlabeled examples associa ed wi h he node n, and le LAB1 be he se of posi ive examples in he se of labeled examples LAB, he en ropy of he node n is calcula ed using he following: p1 = inf
P OS n P OS
LAB1 LAB
UNL 1 ; p0 = 1 − p1 U N Ln
(3)
1j The reader should no e ha jLAB jLABj is independen of he node n. We now de ne he informa ion gain of he node n by Gain(n t) = Entropy(n) − v2V alues(t) (Nvn N n )Entropy(nv) where V alues(t) is he se of every possible value for he a ribu e es t, N n is he cardinali y of U N Ln , Nvn is he cardinali y of he se U N Lnv of examples in U N Ln for which t has value v, and nv is he node below n corresponding o he value v for he a ribu e es t.
5.3
Experimental Results
We applied he resul s of he previous sec ion o C4.5 and called he resul ing algori hm C4.5PosUnl. The di erences as compared wi h C4.5 are he following: C4.5PosUnl akes as inpu hree se s: LAB, P OS and U N L LAB1 LAB which appears in (3) is calcula ed only once For he curren node, en ropy and gain are calcula ed using (3) When gain ra io is used, spli informa ion is calcula ed wi h unlabeled examples The majori y class is chosen using (3) hal ing cri eria during he op-down ree genera ion are evalua ed on U N L When pruning he ree, classi ca ion errors are es ima ed wi h he help of propor ions p0 and p1 from (3)
Po itive and Unlabeled Example Help Learning
229
We consider wo da a se s from he UCI Machine Learnin Database [MM98]: kr-vs-kp and adult. The majori y class is chosen as posi ive. We x he sizes of he es se , se of posi ive examples, and se of unlabeled examples. These values are se o: For kr-vs-kp: 1000 for he es se , 600 for he se of posi ive examples, and o 600 for he se of unlabeled examples For adult: 15000 for he es se , 10000 for he se of posi ive examples, and o 10000 for he se of unlabeled examples We le he number of labeled examples vary, and compare he error ra e of C4.5 and C4.5PosUnl. For a given size of LAB, we i era e 100 imes he following: all se s are selec ed randomly (for P OS, a larger se is drawn and only he selec ed number of posi ive examples are kep ), we compu e he error ra e for C4.5 wi h inpu LAB and he error ra e for C4.5PosUnl wi h inpu LAB, P OS and U N L. Then, we average ou he error ra es over he 1000 experimen s. The resul s can be seen in Fig. 2. The error ra es are promising when he number of labeled examples is small (e.g. less han 100). We hink ha he be er resul s of C4.5 for higher number of examples is due o our pruning algori hm which does no use in he bes way posi ive and unlabeled examples (C4.5PosUnl rees are consis en ly larger han C4.5 ones).
40
30 LAB only LAB + fixed number of POS and UNL examples
LAB only LAB + fixed number of POS and UNL examples
35 25 30 20
error rate
error rate
25
20
15
15 10 10 5 5
0
0 0
50
100
150 LAB size
200
250
300
0
100
200
300
400 LAB size
500
600
700
800
Fi . 2. error ra e of C4.5 and C4.5PosUnl averaged over 100 rials. The lef gure shows he resul s on he kr-vs-kp da a se and he righ one corresponds o he adult da a se . adult and kr-vs-kp were selec ed in his paper because hey are well known and con ain many examples. The experimen s were run on all o her wo-class UCI problems (ftp:// rappa.univ-lille3.fr/pub/Experiments/C45PosUnl).
6
Conclusion
In many prac ical learning si ua ions, labeled da a are rare or expensive o collec while a grea number of posi ive and unlabeled da a are available. In his paper,
230
France co De Comite et al.
we have given experimen al and heore ical evidence ha hese kind of examples can e cien ly be used o boos s a is ical query learning algori hms. A lo of work remains o be done, in several direc ions: More precise heore ical resul s mus be s a ed, a leas for speci c s a is ical query learning algori hms C4.5 should be modi ed fur her, especially he pruning algori hm which mus be adap ed o he da a ypes presen ed in his paper We in end o collec real da a of he kind s udied here (labeled, posi ive and unlabeled) o es his new varian of C4.5 Our me hod can be applied o any s a is ical query algori hm. I would be in eres ing o know if i can be appropria e elsewhere
References D. Angluin and P. Laird. Learning from noi y example . Machine Learnin , 2(4):343 370, 1988. [BM98] A. Blum and T. Mitchell. Combining labeled and unlabeled data with cotraining. In Proc. 11th Annu. Conf. on Comput. Learnin Theory, page 92 100. ACM Pre , New York, NY, 1998. [Dec97] S. E. Decatur. Pac learning with con tant-partition cla i cation noi e and application to deci ion tree induction. In Proceedin s of the Fourteenth International Conference on Machine Learnin , 1997. [Den98] F. Deni . Pac learning from po itive tati tical querie . In ALT 98, 9th International Conference on Al orithmic Learnin Theory, volume 1501 of Lecture Notes in Arti cial Intelli ence, page 112 126. Springer-Verlag, 1998. [Kea93] M. Kearn . E cient noi e-tolerant learning from tati tical querie . In Proceedin s of the 25th ACM Symposium on the Theory of Computin , page 392 401. ACM Pre , New York, NY, 1993. [KM97] M. Kubat and S. Matwin. Addre ing the cur e of imbalanced training et : One- ided election. In Proceedin s of the 14th International Conference on Machine Learnin , page 179 186, 1997. [MM98] C.J. Merz and P.M. Murphy. UCI repo itory of machine learning databa e , 1998. [NMTM98] K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Learning to cla ify text from labeled and unlabeled document . In Proceedin s of the 15th National Conference on Arti cial Intelli ence, AAAI-98, 1998. [Qui93] J. R. Quinlan. C4.5: Pro rams for Machine Learnin . Morgan Kaufmann, San Mateo, CA, 1993. [Val84] L.G. Valiant. A theory of the learnable. Commun. ACM, 27(11):1134 1142, November 1984. [AL88]
Learnin Real Polynomials with a Turin Machine Dennis Cheun Department of Mathematics, City University of Hon Kon 83 Tat Chee Avenue, Kowloon, HONG KONG
[email protected]
Abs rac . We provide an al orithm to PAC learn multivariate polynomials with real coe cients. The instance space from which labeled samples are drawn is IRN but the coordinates of such samples are known only approximately. The al orithm is iterative and the main in redient of its complexity, the number of iterations it performs, is estimated usin the condition number of a linear pro rammin problem associated to the sample. To the best of our knowled e, this is the rst study of PAC learnin concepts parameterized by real numbers from approximate data.
1
Introduction
In the PAC model of learnin one often nds concepts parameterized by real numbers. Examples of such concepts appear in the rst pa es of well-known textbooks such as [7]. The al orithmics for learnin such concepts follows the same pattern as that for learnin concepts parameterized by Boolean values. xm in the instance space X. One randomly selects a number of elements x1 Then, with the help of an oracle, one decides which of them satisfy the tar et concept c . Finally, one computes a hypothesis ch which is consistent with the sample, i.e. a concept ch which is satis ed by exactly those x which satisfy c . A main result from Blumer et al. [3] provides a bound for the size m of the sample above in order to uarrantee that the error of ch is less than with probability at least 1 − , namely m
C0
1
lo
1
+
VCdim(C)
lo
(1)
where C0 is a universal constant and VCdim(C) is the Vapnick-Chervonenkis dimension of the concept class at hand. This result is specially useful when concepts are not discrete entities (i.e. not representable usin words over a nite alphabet) since in this case one can bound the size of the sample without usin the VC dimension. A particularly important case of concepts parameterized by real numbers is the one in which the membership test of an instance x X to a concept c in the concept class C can be expressed by a quanti er-free rst-order formula. In this O. Watanabe, T. Yokomori (Eds.): ALT’99, LNAI 1720, pp. 231 240, 1999. c Sprin er-Verla Berlin Heidelber 1999
232
Dennis Cheun
case, concepts in Cn N are parameterized by elements in IRn , the instance space X is the Euclidean space IRN , and the membership of x X to c C is iven by the truth of Ψn N (x c) where Ψn N is a quanti er-free rst-order formula of the theory of the reals with n + N free variables. In this case, a result of Goldber and Jerrum [5] bounds the VC-dimension of Cn N by VCdim(Cn N )
2n lo (8eds)
(2)
Here d is a bound for the de rees of the polynomials appearin in Ψn N and s is a bound for the number of distinct atomic predicates in Ψn N . One may say that, at this sta e, the problem of PAC learnin a concept c > 0 we simply compute m satisfyin (1) with the VCCn N is solved. Given xm X dimension replaced by the bound in (2). Then we randomly draw x1 and nally we compute a hypothesis ch Cn N consistent with the membership m (which we obtain from some oracle). To obtain ch we of x to c , i = 1 may use any of the al orithms proposed recently to solve the rst-order theory of the reals (cf. [6, 8, 1]). It is however at this sta e that our research has its startin point by remarkin that, from a practical viewpoint, we can not exactly read the elements x . Instead, we obtain rational approximations x . As an example, ima ine that we want to learn how to classify some kind of stones accordin as to whether a stone satis es a certain concept c or not. For each stone we measure N parameters e. . radioactivity, wei ht, etc. and we have access to a collection of such stones already classi ed. That is, for each stone in the collection, we know whether the stone satis es c . When we measure one of the parameters x of a stone, say the wei ht, we don’t obtain the exact wei ht but an approximation xN ) x . The membership of this stone to c depends nevertheless on x = (x1 and not on x. Our problem thus, becomes that of learnin c from approximate data. A key feature is that we know the precision of these approximations and that we can actually modify in our al orithm to obtain better approximations. In our example this corresponds to xin the number of di its appearin on the display of our measurin instrument. In this paper we ive an al orithm to learn from approximate data for a particular learnin problem namely, PAC learnin the coe cients of a multivariate polynomial from the si ns ( 0 or < 0) the polynomial takes over a sample of points. While there are several papers dealin with this problem (e. . [11, 2]) they either consider Boolean variables, i.e. X = 0 1 N , or they work over nite elds. To the best of our knowled e, the consideration of rounded-o real data is new. In studyin the complexity of our al orithm we will naturally deal with a classical theme in numerical analysis, that of conditionin , and we will nd the common dependence of runnin time on the condition number of the input (cf. [4]).
Learnin Real Polynomials with a Turin Machine
2
233
The Problem
Consider the class PdON of real polynomials of de ree d in N variables which have a certain xed monomial structure. That is, x a subset O INN such that O, 1 + + N d. Thus, the elements in PdON for all = ( 1 N) have the form c y f= 2O 1
N
with c IR, and y = y1 yN . Let n be the cardinality of O. We will denote by c the vector of coe cients of f and assume an orderin of the elements in O cn ). Also, to emphasize the dependance of f on its coe cient so that c = (c1 vector we will write the polynomial above as fc . Our oal is to PAC learn a tar et polynomial fc∗ with coe cients c . The instance space is IRN and we assume a probability distribution D over it. For an instance y IRN , we say that y satis es c when fc∗ (y) 0. This makes PdON into a concept class by associatin to each f PdON the concept set y IRN f (y) 0 . The error of a hypothesis fch is iven by Error (ch ) = Prob (si n (fc∗ (y)) = si n (fch (y))) where the probability is taken accordin to D and the si n function is de ned by 1 if z 0 si n (z) = 0 otherwise 0 1 is available computAs usual, we will suppose that an oracle EXc∗ : IRN in EXc∗ (y) = si n (fc∗ (y)). We nally recall that a randomized al orithm PAC learns fc∗ with error and con dence when it returns a concept ch satisfyin with probability at least 1 − . Error (ch ) Should we be able to deal with arbitrary real numbers, the followin al orithm would PAC learn fc∗ .
1. 2. 3. 4.
Al orithm 1 Input: N d and Compute m usin (1) and (2) Draw m random points y ( ) IRN Use the function EXc∗ to obtain si n (fc∗ (y ( ) )) for i = 1 From step 3, we obtain a number of linear inequalities in c and these inequalities can be writen in matrix form B1 c < 0 B2 c 0
5. Find any vector ch satisfyin the system in step 4 6. Output: ch
m
234
Dennis Cheun
Note that, to execute step 5, we don’t need the eneral al orithms for solvin the rst-order theory over the reals mentioned in the precedin section but only an al orithm to nd a feasible point of a linear pro rammin instance whose feasible set is non-empty (since c belon s to it). If real data can not be dealt with exactly, we need to proceed di erently. We be in to do so in the next section, by discussin our model of round-o .
3
Round-O
and Errors
Let y IRN . We say that y Q 0 N approximates y with precision , 0 < (or that y is a measure of y with such precision) when yj − yj
yj
for j = 1
<1
N.
Remark. The de nition above is the usual de nition of relative precision found in numerical analysis. Numbers here are represented in the form z=
a
10e
where e ZZ, a = 0 or a [1 10), and a is written with lo 10 di its. The number a is called the mantissa of z and the number e its exponent. For instance 3 14159
100
approximates with precision 10−6 . There is a stron correlation between and the number of di its (or bits) necessary to write down yj . However, this correlation does not translate into a de nite relation since the ma nitude of yj (amon other thin s) also contribute to determinin this number of di its. For instance the number 1 23456 1012 will use 13 di its when written down. On the other hand, 4 00000 100 only needs one di it. And both have precision 10−6 . In our learnin problem, we x a precision and we measure each instance y ( ) in our sample to obtain y ( ) . Consequently, when we compute B1 and B2 we do not obtain those matrices but rather some approximations B1 and B2 . Our rst result bounds the relative error (the precision) for the entries of B1 and B2 . ()
()
Theorem 1. Let b be any entry of B1 or B2 . If yj − yj 1 m and j = 1 N then b−b b
= (1 + )d − 1
()
yj
for i =
Learnin Real Polynomials with a Turin Machine
235
A crucial remark at this sta e is that if the feasible set of the system B1 c < 0 B2 c 0 has empty interior, no matter how small is , the system B1 c < 0 B2 c 0 may have not solutions at all. Therefore, in the sequel, we will search only for interior solutions of the system. That is, we will search for solutions of the system (LP1)
Bc < 0
with
B=
B1 B2
Our problem remains to nd a solution of Bc < 0 knowin only B. In the next section we will ive a rst step in this direction.
4
Narrowin the Feasible Set
Consider the system (LP2) Bc < 0 Let A = (B −B) and Ax < 0 x 0
(LP3) with x
IR2n . Similarly, let A = (B −B) and Ax < 0 x 0
(LP4) The followin lemma is immediate. Lemma 2. For i = 1
m and j = 1
De ne A as follows. For i = 1 aj =
2n,
aj −aj aj
m and j = 1 aij 1−
if a j
aij 1+
if a j < 0.
Now consider the system (LP5)
Ax < 0 x 0
0
2n let
= (1 + )d − 1.
236
Dennis Cheun
Theorem 3. If x is a solution of (LP5) and x = (u v) with u v c = u − v is a solution of (LP1).
IRn then
Theorem 3 inspires the followin al orithm.
1. 2. 3. 4. 5. 6. 7. 8.
Al orithm 2 Input: N d and Compute m usin (1) and (2) Get m random points y ( ) IRN Use the function EXc∗ to obtain si n (fc∗ (y ( ) )) for i = 1 := (3 2)1 d − 1 m Measure y ( ) with precision to obtain y ( ) , i = 1 Write down the system (LP2), i.e., Bc < 0 Transform system (LP2) to (LP4) and then to (LP5), as described above If there is any vector xh satisfyin (LP5) return ch := uh − v h and HALT else := 2 o to step 5
m
Remark. The initial value 0 for the precision is set to (3 2)1 d − 1 since this implies = 1 2. We actually can take for 0 the lar est power of 2 smaller than (3 2)1 d − 1. At each iteration of the al orithm is squared. This corresponds to doublin () the number of bits of the mantissas in the measures yj . Before steppin into the analysis of Al orithm 2, we derive an upper bound for the relative error of A as an approximation of A. In the next statement, and in the rest of this paper, denotes the 2-norm in Euclidean space. Proposition 4. If
(3 2)1
d
− 1 then for i = 1 a −a a
5
m
4
Complexity and Condition
Al orithm 2 can be implemented on a Turin machine (modulo the oracle EXc∗ ). Its runnin time its determined by two quantities: 1) the number of iterations performed by the al orithm, and 2) the bit-size of the rational numbers involved in the intermediate computations.
Learnin Real Polynomials with a Turin Machine
237
Notice that the cost of each iteration is dominated by step 8 and, more precisely, by checkin the existence of a solution of (LP5). This is a linear prorammin problem over the rationals and, as such, it can be solved in polynomial time by either the ellipsoid method or the interior point method (see, e. ., [10]). These methods work in time bounded by a polynomial of low de ree in the dimension of the input system and linear in the lar est bit-size of its entries. In our problem, the dimension of the input system is xed (m 2n) throu h all the iterations and is itself polynomial in n d and lo . The bit-size of the entries of (LP5) presents a more complicated issue. It depends, on the one hand, on the lar est and smaller (in absolute value) quanti() ties amon the yj and, on the other hand, on the precision with which these quantities are measured. The rst number, L = max
1
()
yj
() yj
i
m j
()
2n yj = 0
is not controlled by Al orithm 2. It is actually a random variable dependent on the distribution D. The second number, the precision , a ects the bit-size of () yj as observed in Remark 1. In the rest of this paper we will focus on estimatin the number of iterations performed by Al orithm 2. In doin so, it will be necessary to traverse the territory of linear pro rammin and numerical analysis. IRn be the ith row of B and b? be the hyperplane perpendicular to Let b b and passin throu h the ori in. Note that b? is the boundary of the half-space de ned by the inequality b c < 0. De nition 5. For every c IRn let (B c) be the acute an le, i.e. 0 ? 2 , between c and the hyperplane b . Also, let (B c) =
min
=1
m
(B c)
(B c)
Finally, let the condition number of B be C(B) =
min
c2Sol(B)
1 sin (B c)
Here Sol(B) denotes the set of points c IRn such that Bc < 0. We will denote by c any point in Sol(B) for which this minimum is attained. Remark. Note that c actually maximizes (B c). Also, for c Sol(B), let d be the distance between c and b? . Then, d = c sin (B c). So, we can rewrite C(B) as c C(B) = min c2Sol(B) min m d di can be seen as the (normalized) distance from c to the The expression min kck boundary of Sol(B). So, c is a solution of (LP1) havin a maximal distance to this boundary and C(B) is the inverse of this distance.
238
Dennis Cheun
Intuitively, if C(B) is small (i.e. if (B c) is lar e) a reater error can be allowed in the coe cients of B and we may need less iterations in Al orithm 2. The followin theoren, our main result, quanti es this fact. Theorem 6. If C(B) < then the al orithm will halt and return a solution ch . Furthermore, the number of iterations is bounded by the smallest inte er reater than 1 lo lo
where
0
2
is the value of
2
1+
p 1 4 2C(B)
(d)
−1
lo 2 ( 0 ) set in step 4.
Remark. Theorem 6 bounds the number of iterations in Al orithm 2 as a function of C(B). We end this section by notin that C(B) is actually a random variable y (m) . The number of iterations since B depends on the random sample y (1) in Al orithm 2 is a random variable as well. Its expected value can be bounded by replacin C(B) by its expected value in the bound of Theorem 6.
6
A Characterization of C(B)
Condition numbers are de ned in numerical analysis mainly for continuous funcIRm . At a point x IRn , the condition number (x) measures tions : IRn the lar est possible value (x + ) − (x) over all in nitesimal perturbations of the point x. A recurrent theme is the relation between (x) and the distance from x to the set of ill-posed inputs, i.e. to the set of points x IRn such that (x) = . For a number of problems, the condition number (x) is the inverse to the distance from x to (often multiplied by x to scale properly). For computational problems which are not describable by a function as above the de nition of condition number is less clear (we have been very sketchy here, for a more detailed discusion on condition numbers see [4, 12]). For linear pro rammin problems, several condition numbers were de ned in the last few years (e. . [9, 13]). The one introduced by Rene ar is de ned precisely in terms of distance to ill-posedness. Let Bc 0 be a feasible system, i.e. Sol(B) = . Also, let D(B) = sup
max b j − b0 j <
Rene ar de nes the condition number of B to be C(B) =
max b j D(B)
Sol(B 0 ) =
Learnin Real Polynomials with a Turin Machine
239
A variation of Rene ar’s condition number, also in the spirit of the inverse to the distance to infeasibility (but normalized di erently) is the followin . Let D (B) = sup
b − b0 < b
max
and C (B) =
Sol(B 0 ) =
1 D (B)
In this section, we will state some relationships between C(B), C (B) and C(B). Theorem 7. C(B) = C (B). Proposition 8. C (B)
nC(B).
One can prove that an upper bound for C(B) in terms of C (B) with the format of Proposition 8 i.e. with the form C(B) f (n m)C (B) for some function f of the dimensions n and m can not exist. A key di erence between C(B) and C (B) is that while both condition numbers are homo eneous of de ree zero in B (i.e. C( B) = C(B) for all > 0), C (B) is actually multihomo eneous in its rows and C(B) is sensitive to di erences in row scalin . The next two propositions make this more precise. 0 b for i = 1 Proposition 9. Let 1 m > 0 and b = B 0 the matrix whose ith row is b0 . Then C (B) = C (B 0 ).
m. Denote by
Proposition 10. C (B)
min b C(B) n max j b j
Acknowled ement. The ideas of Rene ar [9] have been a source of inspirationfor our work. We are also indebted to Steve Smale who rst su ested to us the subject of this paper.
References [1] Basu, S., R. Pollack, and M.- . Roy (1994). On the combinatorial and al ebraic complexity of quanti er elimination. In 35th annual IEEE Symp. on Foundations of Computer Science, pp. 632 641. [2] Bergadano, ., N. Bshouty, and S. Varrichio (1996). Learnin multivariate polynomials from substitutions and equivalence queries. Preprint.
240
Dennis Cheun
[3] Blumer, A., A. Ehrenfeucht, D. Haussler, and M. Warmuth (1989). Learnability and the vapnik-chervonenkis dimension. J. of the ACM 36, 929 965. [4] Cucker, . (1999). Real computations with fake numbers. To appear in Proceedin s of ICALP’99. [5] Goldberg, P. and M. Jerrum (1995). Boundin the Vapnik-Chervonenkis dimension of concept classes parameterized by real numbers. Machine Learnin 18, 131 148. [6] Heintz, J., M.- . Roy, and P. Solerno (1990). Sur la complexite du principe de Tarski-Seidenber . Bulletin de la Societe Mathematique de France 118, 101 126. [7] Kearns, M. and U. Vazirani (1994). An Introduction to Computational Learnin Theory. The MIT Press. [8] Renegar, J. (1992). On the computational complexity and eometry of the rstorder theory of the reals. Part I. Journal of Symbolic Computation 13, 255 299. [9] Renegar, J. (1995). Incorporatin condition measures into the complexity theory of linear pro rammin . SIAM Journal of Optimization 5, 506 524. [10] Schrijver, A. (1986). Theory of Linear and Inte er Pro rammin . John Wiley & Sons. [11] Shapire, R. and L. Sellie (1993). Learnin sparse multivariate polynomials over a eld with queries and counterexamples. In 6th ACM Workshop on Computational Learnin Theory, pp. 17 26. [12] Smale, S. (1997). Complexity theory and numerical analysis. In A. Iserles (Ed.), Acta Numerica, pp. 523 551. Cambrid e University Press. [13] Vavasis, S. and Y. Ye (1995). Condition numbers for polyhedra with real number data. Oper. Res. Lett. 17, 209 214.
Faster Near-Optimal einforcement Learning: Adding Adaptiveness to the E3 Algorithm Carlos Domin o Dept. of Math. and Comp. Science, Tokyo Institute of Technolo y Me uro-ku, Ookayama, Tokyo, Japan carlos@ s.t tech.ac.jp
Abstract. Recently, Kearns and Sin h presented the rst provably efcient and near-optimal al orithm for reinforcement learnin in eneral Markov decision processes. One of the key contributions of the al orithm is its explicit treatment of the exploration-exploitation trade o . In this paper, we show how the al orithm can be improved by substitutin the exploration phase, that builds a model of the underlyin Markov decision process by estimatin the transition probabilities, by an adaptive samplin method more suitable for the problem. Our improvement is two-folded. First, our theoretical bound on the worst case time needed to conver e to an almost optimal policy is si ni catively smaller. Second, due to the adaptiveness of the samplin method we use, we discuss how our al orithm mi ht perform better in practice than the previous one.
1
Introduction
In reinforcement learnin , an a ent faces the problem of learnin how to behave in an unknown dynamic environment in order to achieve a oal. Instead of receivin examples as in the supervised learnin model, the learnin a ent must discover by interaction with the environment how to behave to et the most reward [9]. Reinforcement learnin has been receivin increasin attention in the last few years from both, machine learnin practitioners and theoreticians. The formal modelin of the environment as a Markov decision process (see next section for a formal de nition) makes it particularly suitable for obtainin theoretical results that mi ht also be applicable to real-world problems. Recently, learnin theoretical style results have been obtained, bein the al orithm E3 of Kearns and Sin h [5] one of the most relevant. The E3 al orithm is the rst reinforcement learnin al orithm that provably achieves near-optimal performance in polynomial time for eneral Markov decision processes, in contrast with previous asymptotic results. One of the key contributions of the al orithm is an explicit treatment of the exploration versus exploitation dilemma inherent to any reinforcement learnin problem. In other words, any reinforcement learnin Supported by the EU Science and Technolo y Fellowship Pro ram (STF13) of the European Commission. O. Watanabe, T. Yokomori (Eds.): ALT’99, LNAI 1720, pp. 241 251, 1999. c Sprin er-Verla Berlin Heidelber 1999
242
Carlos Domin o
al orithm has to spend some time obtainin information about the environment, the exploration phase, and then, use that information to discover an almost optimal policy, the exploitation phase. Obviously, a too lon exploration phase mi ht lead to a poor bound while, a too short one, mi ht not allow the al orithm to nd an almost optimal policy. The E3 al orithm provides an explicit method for decidin when to switch between the two phases. A close look at the analysis of the E3 al orithm reveals that the factor that stron ly dominates the time bound (a polynomial of de ree 4 in all the relevant problem parameters) comes from the samplin process used by the authors to estimate the transition probabilities durin the exploration phase. Not only the bound is too lar e for any practical purposes, it also fails to satisfy the followin intuitively desirable property. Given a certain state i in a Markov decision process, suppose that one of the transition probabilities from state i to state j executin action a has a very hi h probability, for instance 0 9. Intuitively, most of the time when we apply action a from state i we will land on state j and then, we should be able to realize very quickly that the probability of that particular transition is lar e. On the other hand, suppose that another state i0 has transition probabilities uniformly distributed amon all the reachable states. In this case, when applyin action a from i0 we will be landin all the time in di erent states. Intuitively, in this case, more experience will be required to obtain a ood approximation of all these transition probabilities. However, the E3 al orithm will execute action a from both states, i and i0 exactly the same number of times. In other words, the number of times that we need to execute an action from certain state in the exploration phase is xed in advance to a worst case bound independent of the underlyin Markov decision process. The improvement proposed here solves the two problems just mentioned. We substitute the static batch samplin method used in the ori inal E3 al orithm by a sequential samplin method adapted to this problem from the one proposed in [8]. This al orithm does samplin sequentially and has a stoppin condition that depends on the current estimated. Thus, the amount of samplin needed to estimate certain transition probability will depend adaptively on the unknown underlyin value bein estimated, satisfyin the intuition outlined above. In other words, our version of E3 will perform di erently dependin on how it is the underlyin Markov decision process. That is it, it will adapt to the situation at hand instead of bein always in the worst case. Moreover, even in the worst case, we will still use less amount of examples than the xed worst case bound provided in [5]. Due to the nature of our samplin method, we can obtain estimators that are multiplicatively close to the ori inal probabilities instead of additively close as in the ori inal E3 . This will allow us to modify the proof of correctness so that, the amount of samplin required will be smaller, even in the worst case. Since the time spend estimatin the underlyin model dominates the overall time bound, this will lead us to reduce it from a polynomial of de ree 4 to a polynomial of de ree 2 in worst case, a si ni cant reduction.
Faster Near-Optimal Reinforcement Learnin
243
Adaptive samplin has been studied since lon time a o (see, for instance, the book by Walt [10]) and has also been recently used in the context of database query estimation [8] and knowled e discovery [1,2]. Furthermore, adaptivity is a very desirable property for an al orithm that is expected to be used in practical applications. See the discussion about the relevance of adaptivity in the context of learnin and discovery science in [11]. As noted by the authors, a practical implementation based on the al orithmic ideas provided by them would enjoy performance on natural bounds that is considerably better than what their bounds indicate. In fact, our improvement corroborates that intuition since our modi cation uses all the ideas stron ly related to the reinforcement learnin problem proposed while substitutin the method of samplin used by a more appropriate one for this problem in an adhoc manner. It is important to notice that our method while bein more e cient also keeps the same theoretical uarantees of reliability as the one used by the ori inal E3 al orithm. This paper is or anized as follows. In Section 2 we provide the formal de nitions related to reinforcement learnin . Then, we move on Section 3 where we review in detail the E3 al orithm and its proof. In Section 4 we show how to modify the E3 al orithm by the adaptive samplin method and sketch a proof of its correctness. We will conclude in Section 5 discussin our result and future related work.
2
Preliminaries and De nitions
A Markov decision process (MDP) M can be de ned by a tuple (S A T R) where: S consists of a set of states S = 1 N ; A is a set of actions A = ak ; T represents the set of transition probabilities for each state-action a1 a (ij) 0 speci es the probability of landin in state j when pair (i a) where PM executin action a from state i in M 1 ; R represents the rewards R : S [0 Rmax ] where R(i) is the reward obtained by the a ent in state i. A policy de nes how the a ent behaves on the Markov decision process. More formally, a policy in a Markov decision process M over a set of states ak is a mappin : 1 N a1 ak . 1 N with actions a1 Once a Markov decision process and a policy have been de ned, we will discuss about how ood is that policy usin the two standard asymptotic measures of return of a policy: the expected avera e return and the expected discounted return. Since the oal of the E3 al orithm, as well as our improved version, is to obtain nite time conver ence results, in this draft we will be talkin only about T -step expected avera e return and this can be translated to expected discounted return throu h the horizon time 1 (1 − γ) or to the expect avera e return throu h the mixin time of the optimal policy as described in [5]. Let M be a Markov decision process and a let be a policy P in M . The T -step (expected) avera e return from state i is de ned as UM (i T ) = p PrM [p]UM (p) 1
Note that
P j
PM ( j) = 1 for any pair ( a).
244
Carlos Domin o
where the sum is over all T -paths p in M that start at i, PrM [p] is the probability of crossin path p in M executin and UM (p) is the avera e return alon path p de ned as UM (p) = T1 (R 1 + + R T ). Moreover, we de ne the optimal T -step ∗ avera e return from i in M by UM (i T ) = max UM (i T ). Thus, the oal in reinforcement learnin will be the followin . Given parameters (the error), (the con dence) and T (the time horizon in the case of discounted return and the mixin time of the optimal policy in the case of innite horizon avera e return) we would like to have a reinforcement learnin al orithm that with probability lar er than 1 − obtains a policy such that the expect return of the policy is close to the optimal one.
3
The E3 Algorithm
Before oin into the details of our improved version, we will need to review in certain detail al orithm E3 and its proof of correctness and reliability. The E3 al orithm is what is usually called an indirect or model-based reinforcement learnin al orithm. That it is, it builds a partial model of the underlyin Markov decision process and then, it attempts to compute an optimal policy from it. Thus, there are two phases clearly distinct that we review in the followin . One phase consists in obtainin knowled e of the underlyin Markov model from experience on it. The kind of experience that the al orithm has access to consist of, iven that the a ent is in a particular state at a a particular time step, choose the action to perform, execute it and land on a (possibly) di erent state from where a further experiment can be performed. Kearns and Sin h coined this action as balanced wanderin meanin that upon arrival in a state, the al orithm tries the action that has been tried the fewest number of times. Since we assume that the world where the a ent moves is modeled with a MDP, the state reached when choosin an action from certain state is determined accordin to the transition probabilities distribution. This experience is athered at each state the al orithm visits and used to build an approximate model of the transition probabilities distribution in the obvious way. This phase is what it is known as the exploration phase. The other phase, known as the exploitation phase, consists on makin use of the statistics athered so far to compute a policy that, hopefully, at some point will be close to the optimal one in the sense made precise in Section 2. These two phases are interleaved durin the learnin process. That is, the al orithm collects statistics by experimentin on the MDP (therefore, it is in the exploration phase) and at some point, it decides to switch to the exploitation phase and attempts to compute an optimal policy usin the approximate model constructed so far. Whenever the policy is still not close enou h to the optimal, it oes back to the exploration phase and tries to collect new statistics about the unknown part of the underlyin MDP and so on. Thus, it is important to notice that the al orithm mi ht choose in some cases not to build a complete
Faster Near-Optimal Reinforcement Learnin
245
model of the underlyin MDP if doin so it is not necessary to achieve a close to optimal return. One of the main contributions of the E3 al orithm was to provide an explicit method for decidin when to switch between phases, hence, the name Explicit Explore or Exploit (E3 ) for the al orithm. For this, a crucial de nition was that of a known state, a state that has been visited enou h number of times so that the estimates of the transition probabilities from that state are close enou h to their true values. We will state here their de nition for future comparison with the new one we will provide in Section 4. Recall that , and T are the input parameters of the al orithm as described in Section 2. De nition 1. [5] Let M be a Markov decision process over N states. We say that a state i of M is known if each action has been executed from i at least mkn = O((N T Rmax ) )4 lo (1 )) times. An important observation is that, iven the de nition above, we cannot do balanced wanderin forever before at least one state becomes known. By the Pi eonhole Principle, at most after N (mkn − 1) + 1 steps of balanced wanderin , some state becomes known. When the a ent lands in a known state (either previously known or that just becomes known at this point for the rst time), the al orithm does the followin . The al orithm builds a new MDP with the set of currently known states. More precisely, if S is the set of currently known states in M , it constructs a MDP MS induced on S by M where all transitions between states in S are preserved and the rest are redirected to a new absorbin state that intuitively represents all the unknown and unvisited states. Notice that even thou h the al orithm does not known MS it has a ood approximation of it thanks to the de nition of known state, we will refer to this approximation as MS . The notion of approximation used is the followin . De nition 2. [5] Let M and M be two Markov decision processes over the same set of states. Then, we will say that M is an -approximation of M if for any pair of states i and j and any action a, the followin inequalities are satis ed: a a a (ij) − PM (ij) PM (ij) + PM It can be shown by a strai htforward application of the Hoe din bound that if a state is known as de ned in De nition 1 then MS is a O(( (N T Rmax ))2 )approximation of MS . Moreover, Kearns and Sin h proved a Simulation Lemma that stated that, in this case, the expected T -return of any policy in MS is close to the expected return of the same policy in MS . The other key lemma in their analysis is the Exploit or Explore Lemma that states that, either the optimal policy achieves hi h return just by usin the set of known states S (which can be detected thanks to MS and the Simulation Lemma) or the optimal policy has hi h probability of leavin S (which a ain the al orithm can detect by ndin a exploration policy that quickly reaches the additional absorbin state in MS ). Thus, performin two on-line computations on MS , the
246
Carlos Domin o
al orithm is provided with either a way to compute a policy with near-optimal return for the next T steps or to obtain new statistics on an unknown or unvisited state. The computation time required for this o -line computations is shown to be bounded by O(N 2 T ). Puttin all these pieces to ether Kearns and Sin h showed that the E3 al orithm achieves, with probability lar er than 1 − , a return that is close to the return of the optimal policy in time polynomial in N T 1 and lo (1 ). The main factor in the overall bound comes from the de nition of known states, that is, the number of times that a state needs to be visited durin the exploration phase before we can attempt to compute an optimal policy. We refer to their paper [5] for further details. Our improvement a ects only the part concernin the exploration phase. In the followin section we will show how a more e cient samplin method will allow us to declare a state as known more quickly obtainin a si ni cant reduction on the overall runnin time while keepin the same theoretical uarantees of reliability. Our improvement does not a ect the rest of the al orithm and thus, the same lemmas can be used almost exactly the same way as in the ori inal E3 al orithm to proof the correctness of the overall al orithm.
4
Knowing the States Faster: Adding Adaptivity to the Exploration Phase
As we mentioned in the introduction, the key idea for improvin E3 relies on usin a di erent samplin method for estimatin the transition probabilities. This will result on a substantial reduction on the number of statistics need before a state is declared known. For this, we need rst to modify the notion of approximation iven in De nition 2. De nition 3. Let M and M be two Markov decision processes over the same set of states. Then, we say that M is an -stron approximation of M if for a (ij) , the followin any pair of states i and j, and any action a such that PM inequalities are satis ed: a (ij) (1 − )PM
a PM (ij)
a (1 + )PM (ij)
The di erences between this de nition and the ori inal one are the followin . We require that only the transitions that are not too small are approximated in a multiplicative way while, previously, every transition was required to be approximated but just in an additive way. The followin key lemma related to the one that was already proved in [5] holds under the de nition of approximation iven above. Lemma 1. (Modi ed Simulation Lemma) Let M be any Markov decision process over N states and let M be an O( (N T Rmax ))-stron approximation of M . Then for any policy , number of steps T and for any state i, UM (i T ) − UM (i T ) UM (i T ) + .
Faster Near-Optimal Reinforcement Learnin
247
Proof. The proof follows the same lines of the proof of the related lemma provided in [5]. For simplicity of notation, let us denote by = c (4N T Rmax) where c is some constant smaller than 1 and let us x throu hout the proof a policy and a start state i. We rst consider the contribution to the return of the transitions that are not approximated in M , that it is, the transitions whose transition probabilities are smaller than . Since we have N states, the total probability of all these transitions is at most N . Moreover, the probability that in T steps we cross any of these transitions can be bounded by N T . Thus, the total contribution to the expected return of any of these transitions is at most N T Rmax . By our de nition of this quantity is smaller than c . Let us consider the paths that do not cross any transition whose transition probability is smaller than . By our de nition of -stron approximation, for any path p of len th T the followin holds: (1 − )T PrM [p] PrM [p] (1 + )T PrM [p] P Recall that UM (i T ) = p PrM [p]UM (p). Since the inequality above holds for any path T , it also holds when we take the expected value. In other words, the followin inequality can be derived from the inequality above: c + (1 − )T UM (i T )
UM (i T )
(1 + )T UM (i T ) + c
where the factor c comes from our previous calculation of the error introduced by the small transitions not required to be approximated in the de nition of -stron approximation. Now we should show that our choice of to ether with the inequality obtained above implies the lemma. We will show it only for the upper bound, the lower bound can be shown in a similar manner. For showin the upper bound, we need to show that the followin two inequalities hold: (1 + )T UM (i T )
UM (i T ) +
2
and
c
2
The second inequality easily follows by choosin an appropriate value for c in the de nition of . For the rst one, showin that (1+ )T 1+ (2Rmax) holds will su ce since the avera e reward is bounded by Rmax . To see this, notice that from the Taylor expansion of lo (1 + ) one can show that T lo (1 + ) is less 1 + 2x. Thus, choosin than T 2 and standard calculus shows that 2x such that T (2Rmax) will su ce and this can be obviously satis ed by our choice of and an appropriate choice of c. As in [5], the appropriate de nition of known state should be derived from the Simulation Lemma. We will use the followin one. De nition 4. Let = O( (N T Rmax )). An state i will be denoted as wella (ij) , known when for any action a such that PM a (ij) (1 − )PM
a PM (ij)
a (1 + )PM (ij)
248
Carlos Domin o
Notice that a strai htforward application of Cherno bounds is not appropriate to determine how many times we need to visit a state before it can be declared as well-known. This is because the required approximation is multiplicative and, then the number of visits needed that we will obtain if we derive it from the Cherno bound will depend on the true transition probabilities, a number that is unknown to the al orithm. In fact, it is precisely what it is bein estimated. On the other hand, the Hoe din bound cannot be used here to derive the number of necessary steps since it only provides an additive approximation while we want a multiplicative one. For the statements of the Cherno and Hoe din bounds we refer the reader to [6]. To et around this di culty, we will use a sequential samplin al orithm based on the one proposed by Lipton et.al. [7,8] for database query estimation. This al orithm will substitute the followin steps of the E3 al orithm. Recall that when the al orithm is in the exploration phase, it does balanced wanderin . That is, it just executes the least used action from the state where the a ent is currently located, updates the estimates accordin to the landin state and, in case the number of times that the state has been visited becomes lar er than the required by the de nition of known state iven in Section 3, then it is declared as known. 1 2 3 4 5 6 7 8 9 10 11 12 13 14
AdaExploStep /* for state */ f state is visited for the rst time then = O( N T Rm x ); = 10 ln(2kN ) 2 for all states j S and actions a A do m = 0; NM ( j) = 0; apply action a (least used breakin ties randomly); let j the landin state, NM ( j) = NM ( j) + 1; m = m + 1; f (NM ( j) 3 ln(2kN )(1 + ) 2 ) then declare PM ( j) as estimated by PM ( j) = NM ( j) m ; f (m ) then declare all remainin PM ( j 0 ) as estimated by PM ( j 0 ) = NM ( j) m f all PM ( j) are estimated for all states j and actions a then declare as well-known.
Fi . 1. Adaptive exploration step for state i. Our approach is the followin . Every time we land on an unknown or unvisited state i in the exploration phase we will execute an Adaptive Exploration Step (AdaExploStep for short). A pseudo code for AdaExploStep is provided in Fi ure 1 and we discuss it now. First, if the state is unvisited, we will initialize a (ij) that will be used for estimatin the real transitions probabilivariables NM a ties PM (ij), the number of times every action a has been used from it ma and, the two parameters of the al orithm and . Then, we will choose the action to
Faster Near-Optimal Reinforcement Learnin
249
execute (the least used, breakin ties randomly), we will denote by j the landin a (ij) accordin ly. Then, we will check two conditions whose state and update NM meanin is the followin . The rst condition (line 9 of Fi ure 1) controls whether a a a (ij) = NM (ij) ma of PM (ij) is already close to it in a multhe estimator PM tiplicative sense as desired. Notice that for doin this we are just usin value a (ij). The second condition (line 11 of Fi ure 1) checks whether we have done NM enou h samplin so we can uarantee with hi h probability that all the transition probabilities that have not been declared estimated yet must be smaller than . Finally, when all the transition probabilities for all the actions are declared as estimated, the state is declared well-known. In the followin theorem we discuss the reliability and complexity of the procedure just described. Theorem 1. Let i be a state in a Markov decision process M and let = O( N T Rmax ). Then, if procedure AdaExploStep declares state i as well-known, with probability more than 1− , it does it correctly in at most m = O(k ln(2kN ) 2 ) steps where is the maximum between the smallest transition probability and . a (ij) Proof. Let us start provin the correctness, that is, that the estimates PM output by procedure AdaExploStep satisfy De nition 4 with hi h probability. For a (ij) and let us dethis, let us x rst one particular transition probability p = PM note by = O( N T Rmax ) the desired accuracy on the estimate. Furthermore, let us suppose that the al orithm declares p as estimated in the rst if-then cona (ij) becomes lar er than c = 3 ln(2kN )(1 + ) 2 . dition, that is, N (p) = NM Notice that since N (p) only increases at most by 1 at every step, at the stoppin step N (p) satis es c N (p) c + 1. Thus, since the estimator output by the al orithm is p = N (p) ma , it can be easily veri ed that for any values of ma satisfyin c+1 c ma p(1 + ) p(1 − )
estimate p is within the desired ran e, that is, (1− )p p (1+ )p. Therefore, the probability of error can be bounded by the probability of stoppin before l1 = c (p(1 + )) steps (and thus, p > p(1 + )) plus the probability of stoppin p(1 − )). Moreover, notice after l2 = (c + 1) (p(1 − )) steps (and thus, p that the stoppin condition is monotone in the followin sense. If it is satis ed at step l then it will also be satis ed at any step l0 > l and if it has not been satis ed at step l, then it has also not been satis ed at any step l0 l, no matter what are the results of the random trials. Therefore, we need to consider only the two extreme points l1 and l2 . Applyin the Cherno bound to bound those two probabilities and by our choice of c we can conclude that the probability that p does not correctly estimate p is less than (2kN ). Since there are at most kN transition probabilities bein estimated simultaneously, by the union bound, the probability that any of them fails is at most 2. We have just shown that, with hi h probability, any transition probability p that is declared as estimated by the rst condition satis es De nition 4. More over, it takes at most (c + 1) (p(1 + )) steps. By our choice of c and
250
Carlos Domin o
noticin that (1 + ) (1 − ) is smaller than 5 3 for any less than 0 25, it follows that the al orithm will estimate p in at most 5 ln(2kN ) ( 2 p) steps. Now, let us discuss the second if-then condition. The meanin of this condition is the followin . If we are doin too many steps in the same state and there are still transition probabilities that have not been yet declared as estimated, then, those transitions are small with hi h probability and we can declare the state as well-known satisfyin De nition 4 without further steps. To show the correctness of this condition, suppose that the al orithm stops in the second a (ij) a transition probability that was not yet declared condition and let q = PM as estimated. Thus, from the rst if-then condition (not yet satis ed) we know a (ij) should be smaller than c and from the second if-then conthat N (q) = NM dition (just satis ed) we know that ma d, where d = 10 ln(2kN ) 2 . Thus, the probability that q is lar er than can be bounded by the probability that q − N (q) d is lar er than − c d. Applyin the Hoe din bound and by our choice of d and c it can be derived that this probability is smaller than (kN ). A ain, usin the union bound we can bound by 2 the probability that a transition probability lar er than is incorrectly declared classi ed by the second condition. Finally, the probability that the al orithm makes a mistake in either bound is bound by 2 + 2 and the theorem follows. We have just seen that AdaExploStep correctly declares the states as wellknown. Moreover, if S is the set of well-known states, then MS will be a stron approximation of MS in the sense of De nition 3 and thus, by Lemma 1 it will appropriately simulate MS . Thus, it can be used to compute the o -line policies the same way as it was used in the ori inal E3 al orithm.
5
Conclusion
We have seen how we can modify the exploration phase of the E3 al orithm by an adaptive samplin method so that is possible to improve its overall time bound. Moreover, we have ar ued that due to the adaptiveness of our method, it should be more suitable for practical purposes and part of our future work will be to implement our version of E3 and test it experimentally. Notice that al orithm E3 as well as our improvement su er the problem of bein polynomial in the number of states N , somethin that mi ht be impractical in certain problems. One possible way around this problem is to consider factored MDP, that it is, MDP whose transition model can be factored as a dynamic Bayesian network. A eneralization of the E3 al orithm to that case has been recently obtained in [4]. Our method seems to be applicable also to improve that eneralization althou h some technical points need to be carefully checked and this will be part of our future job.
Faster Near-Optimal Reinforcement Learnin
6
251
Acknowledgments
I would like to thanks Denis Therien for invitin me to the McGill Annual Workshop on Computational Complexity where I learned all the neat thin s about reinforcement learnin from this year main speaker Michael Kearns. I also would like to thank Michael for kindly providin me all the details about the E3 al orithm. Finally, thanks to the anonymous referees for several comments that help me to improve the presentation of the paper.
eferences 1. Carlos Domin o, Ricard Gavalda and Osamu Watanabe. Practical Al orithms for On-line Selection. In Proceedin s of the First International Conference on Discovery Science, DS’98. Lecture Notes in Arti cial Intelli ence 1532:150 161, 1998. 2. Carlos Domin o, Ricard Gavalda and Osamu Watanabe. Adaptive Samplin Methods for Scalin Up Knowled e Discovery Al orithms. To appear in Proceedin s of the Second International Conference on Discovery Science, DS’99, December 1999. 3. Leslie Pack Kaeblin , Michael L. Littman and Andrew W. Moore. Reinforcement Learnin : A Survey. Journal of Arti cial Intelli ence Research 4 (1996) 237 285. 4. Michael Kearns and Daphne Koller. E cient Reinforcement Learnin in Factored MDPs. To appear in the Proc. of the International Joint Conference on Arti cial Intelli ence, IJCAI’99. 5. Michael Kearns and Satinder Sin h. Near-Optimal Reinforcement Learnin in Polynomial Time. In Machine Learnin : Proceedin s of the 16th International Conference, ICML’99, pa es 260 268, 1998. 6. M.J. Kearns and U.V. Vazirani. An Introduction to Computational Learnin Theory. Cambrid e University Press, 1994. 7. Richard J. Lipton and Je rey F. Nau hton. Query Size Estimation by Adaptive Samplin . Journal of Computer and System Science, 51:18 25, 1995. 8. Richard J. Lipton, Je rey F. Nau hton, Donovan Schneider and S. Seshadri. E cient samplin strate ies for relational database operations. Theoretical Computer Science, 116:195 226, 1993. 9. R.S. Sutton and A. G. Barto. Reinforcement Learnin : An Introduction. Cambrid e, MA. MIT Press. 10. Abraham Wald. Sequential Analysis. Wiley Mathematical, Statistics Series, 1947. 11. Osamu Watanabe. From Computational Learnin Theory to Discovery Science. In Proc. of the 26th International Colloquium on Automata, Lan ua es and Pro rammin , Invited talk of ICALP’99. Lecture Notes in Computer Science 1644:134 148, 1999.
A Note on Support Vector Machine De eneracy Ryan Rifkin, Massimiliano Pontil, and Alessandro Verri Center for Biolo ical and Computational Learnin , MIT 45 Carleton Street E25-201, Cambrid e, MA 02142, USA r f,pont l,verr @a .m t.edu
Abs rac . When trainin Support Vector Machines (SVMs) over nonseparable data sets, one sets the threshold b usin any dual cost coe cient that is strictly between the bounds of 0 and C. We show that there exist SVM trainin problems with dual optimal solutions with all coe cients at bounds, but that all such problems are de enerate in the sense that the optimal separatin hyperplane is iven by w = 0, and the resultin (de enerate) SVM will classify all future points identically (to the class that supplies more trainin data). We also derive necessary and su cient conditions on the input data for this to occur. Finally, we show that an SVM trainin problem can always be made de enerate by the addition of a sin le data point belon in to a certain unbounded polyhedron, which we characterize in terms of its extreme points and rays.
1
Introduction
We are iven l examples (x1 y1 ) (xl yl ), with x IRn and y −1 1 for all i. The SVM trainin problem is to nd a hyperplane and threshold (w b) that separates the positive and ne ative examples with maximum mar in, penalizin misclassi cations linearly in a user-selected penalty parameter C > 0.1 This formulation was introduced in [2]. For a ood introduction to SVMs and the nonlinear pro rammin problems involved in their trainin , see [3] or [1]. We train an SVM by solvin either of the followin pair of dual quadratic pro rams: (P) min w b
1 2
w
2
+ C(
y (w x + b)
=1
1− 0
)
(D) max
1−
1 2
D
y=0 C 0
=( 1 where we used the vector notations = ( 1 l) l ). D is the symmetric positive semide nite matrix de ned by D j y yj x xj . Throu hout this note, we use the convention that if an equation contains i as an unsummed subscript, the correspondin equation is replicated for all i 1 l . 1
INFM-DISI, Universita di Genova, Via Dodecaneso 35, 16146 Genova, Italy Actually, we penalize linearly points for which y (w x + b) < 1; such points are not actually misclassi cations unless y (w x + b) < 0.
O. Watanabe, T. Yokomori (Eds.): ALT’99, LNAI 1720, pp. 252 263, 1999. c Sprin er-Verla Berlin Heidelber 1999
A Note on Support Vector Machine De eneracy
253
In practice, the dual pro ram is solved.2 However, for this pair of primaldual problems, the KKT conditions are necessary and su cient to characterize optimal solutions. Therefore, w b , and represent a pair of primal and dual optimal solutions if and only if they satisfy the KKT conditions. Additionally, any primal and dual feasible solutions with identical objective values are primal and dual optimal. The KKT conditions (for the primal problem) are as follows:
w−
yx =0
(1)
y =0
(2)
=1
=1
−
=0
(3)
y (x w + b) − 1 + y (x w + b) − 1 +
0 =0
(4) (5)
=0 0
(6) (7)
C−
are La ran e multipliers associated with the ; they do not appear The explicitly in either (P) or (D). The KKT conditions will be our major tool for investi atin the properties of solutions to (P) and (D). Suppose that we have solved (D) and possess a dual optimal solution . Equation (1) allows us to determine w for the associated primal optimal solution. C. Then, by equation Further suppose that there exists an i such that 0 > 0, and by equation (6), = 0. Because = 0, equation (5) tells us (3), = 0, we see that we can determine the that y (x w + b) − 1 + = 0. Usin threshold b usin the equation b = 1 − y (x w). Once b is known, we can determine the by notin that = 0 if = C (by equations (3) and (6)), and that = 1 − y (x w + b) otherwise (by equation (5)). However, this is not strictly necessary, as it is w and b that must be known in order to classify future instances. We note that our ability to determine b and is crucially dependent on the strictly between 0 and C. Additionally, the optimality condiexistence of a tions, and therefore the SVM trainin al orithm derived in Osuna’s thesis [3], as well. On pa e 49 of his thesis Osuna depend on the existence of such a states that We have not found a proof yet of the existence of such , or conditions under which it does not exist. Other discussions of SVM’s ( [1], [2]) also implicitly assume the existence of such a . In this paper, we show that there need not exist a strictly between bounds. Such cases are a subset of de enerate SVM trainin problems: those problems 2
SVMs in eneral use a nonlinear kernel mappin . In this note, we explore the linear simpli cation in order to ain insi ht into SVM behavior. Our analysis holds identically in the nonlinear case.
254
Ryan Rifkin, Massimiliano Pontil, and Alessandro Verri
where the optimal separatin hyperplane is w = 0, and the optimal solution is to assi n all future points to the same class. We derive a stron characterization of SVM de eneracy in terms of conditions on the input data. We o on to show that any SVM trainin problems can be made de enerate via the addition of a sin le trainin point, and that, assumin the two classes are of di erent cardinalities, this new trainin point can fall anywhere in a certain unbounded polyhedron. We provide a stron characterization of this polyhedron, and ive a mild condition which will insure non-de eneracy.
2
Support Vector Machine De eneracy
In this section, we explore SVM trainin problems with a dual optimal solution 0 C for all i. satisfyin We be in by notin and dismissin the trivial example where all trainin points belon to the same class, say class 1. In this case, it is easily seen that = 0, = 0, w = 0, and b = 1 represent primal and dual optimal solutions, both with objective value 0. De nition 1. A vector is a 0 C -solution for an SVM trainin problem P 0 C for all i and = 0 (note that this includes cases if solves (D), = C for all i). where We demonstrate the existence of problems havin example where the data lie in IR2 : x y (2 3) 1 (2 2) −1 (1 2) 1 (1 3) −1
D=
0 C -solutions with an
13 −10 8 −11 −10 8 −6 8 8 −6 5 −7 −11 8 −7 10
Suppose C = 10. The reader may easily verify that = (10 10 10 10), w = 0, b = −1, = (0 2 0 2) are feasible primal and dual solutions, both with objective value 40, and are therefore optimal. Actually, iven our choice of and w, we may set b anywhere in the closed interval [−1 1], and set = (1 + b 1 − b 1 + b 1 − b). We have demonstrated the possibility of 0 C -solutions, but the above example seems hi hly abnormal. The data are distributed at the four corners of a unit square centered at (1 5 2 5), with opposite corners bein of the same class. The optimal separatin hyperplane is w = 0, which is not a hyperplane at all. We now proceed to formally show that all SVM trainin problems which admit 0 C -solutions are de enerate in this sense. The followin lemma is obvious from inspection of the KKT conditions: Lemma 1. Suppose that is a 0 C -solution to an SVM trainin problem P1 with C = C1 . Given a new SVM trainin problem P2 with identical input is dual optimal for P2 . The correspondin primal data and C = C2 , (C2 C1 ) optimal solution(s) is (are) unchan ed.
A Note on Support Vector Machine De eneracy
255
We see that 0 C -solutions are not dependent on a particular choice of C. This in turn implies the followin : Lemma 2. If
is a 0 C -solution to an SVM trainin problem P, D
= 0.
Proof:Since D is symmetric positive semide nite, we can write D = R RT , where is a dia onal matrix with the (nonne ative) ei envalues of D in descendin order on the dia onal, R is an ortho onal basis of correspondin ei envectors = 0, then for some index k, k 0 and Rk = 0. of D, and RRT = I. If D For any value of C, let C be the 0 C -solution obtained by adjustin appropriately. This solution is dual optimal for a problem havin input data identical to P, with a new value of C, by Lemma 1. l CD
C
=
j
Rj
2
C
j=1
Rk = k C 2 Rk k
C
2 1
2
De ne S to be the number of non-zero elements in . As we vary C, the optimal dual objective value of our family of 0 C -solutions is iven by:
f (C) =
C
1−
SC −
1 2
1 2 kC
CD 2
Rk
C 1
2
However, if C >
k
2S Rk
1
2
f (C ) 0. This is a contradiction, for = 0 is feasible in P with objective value zero, and zero is therefore a lower bound on the value of any optimal solution to P, re ardless of the value of C. Theorem 1. If is a 0 C -solution to an SVM trainin problem P, w = 0 in all primal optimal solutions. Proof: Any optimal solution must, alon with ploitin this, we see:
=
0=D 0= D
, satisfy the KKT conditions. Ex-
256
Ryan Rifkin, Massimiliano Pontil, and Alessandro Verri l
l
=
Dj
j
=1 j=1 l
l
=
y yj x xj
j
=1 j=1 l
l
=(
yx) (
=
j yj xj ) j=1
=1
=w w w=0
This is a key result. It states that if our dual problem admits a 0 C solution, the optimal separatin hyperplane is w = 0. In other words, it is of no value to construct a hyperplane at all, no matter how expensive misclassi cations are, and the optimal classi er will classify all future data points usin only the threshold b. Our data must be arran ed in such a way that we may as well de-metrize our space by throwin away all information about where our data points are located, and classify all points identically. The converse of this statement is false: iven an SVM trainin problem P that admits a primal solution with w = 0, it is not necessarily the case that all dual optimal solutions are 0 C -solutions, nor even that a 0 C -solution necessarily exists, as the followin example, constructed from the rst example by splittin a data point into two new points whose avera e is one of the ori inal points, shows: x y (2 3) 1 (2 2) −1 (1 1 5) 1 (1 2 5) 1 (1 3) −1
D=
13 −10 6 5 9 5 −11 −10 8 −5 −7 8 6 5 −5 3 25 4 75 −5 5 9 5 −7 4 75 7 25 −8 5 −11 8 −5 5 −8 5 10
A ain lettin C = 10, the reader may verify that settin = (10 10 5 5 10), w = 0, b = −1, = (0 20 0 20 0 0) are feasible primal and dual solutions, both with objective value 40, and are therefore optimal. With more e ort, the reader may verify that = 10 10 5 5 10 is the unique optimal solution to the dual problem, and therefore no 0 C -solution exists. Althou h our initial motivation was to study problems with optimal solutions havin every dual coe cient at bounds, we ain additional insi ht by studyin the followin , broader class of problems. De nition 2. An SVM trainin problem P is de enerate if there exists an optimal primal solution to P in which w = 0. By Theorem 1, any problem that admits a 0 C -solution is de enerate. As in the 0 C -solution case, one can use the KKT conditions to easily show that
A Note on Support Vector Machine De eneracy
257
the de eneracy of an SVM trainin problem is independent of the particular choice of the parameter C, and that w = 0 in all primal optimal solutions of a de enerate trainin problem. For de enerate SVM trainin problems, even thou h there is no optimal separatin hyperplane in the normal sense, we still call those data points that = 0 support vectors. Given an contribute to the expansion w = 0 with SVM trainin problem P, de ne K to be the index set of points in class i, i 1 −1 . Lemma 3. Given a de enerate SVM trainin problem P, assume without loss K1 . Then all points in class −1 are support vectors; of enerality that K-1 = C if i K-1 . Additionally, if K-1 = K1 , the (unique) dual furthermore, optimal solution is = C. Proof:Because w = 0, the primal constraints reduce to: yb
1−
If K-1 K1 , the optimal value of b is 1, and is positive for i K -1 . = C for i K-1 (by Equations 6 and 3). Therefore, Assume K-1 = K1 . We may (optimally) choose b anywhere in the ran e [−1 1]. If b 0, all points in class 1 have = C, and if b 0, all points in class = C. In either case, there are at least K-1 points in a sin le class −1 have = C. But equation ( 2) says that the sum of the for each class satisfyin may be reater then C, we conclude that every must be equal, and since no is equal to C in both classes. Finally, we derive conditions on the input data for a de enerate SVM trainin problem P. Theorem 2. Given an SVM trainin problem P, assume without loss of enerK1 . Then: ality that K-1 a. P is de enerate if and only if there exists a set of multipliers Ω for the points in K1 satisfyin : 1 0 x = 2K-1
x 2K1
= K -1 2K1
b. P admits a 0 C -solution if and only if P is de enerate and the (a) may all be chosen to be 0 or 1.
in part
Proof: (a, ) Suppose P is de enerate. Consider a modi cation of P with identical input data, but C = 1; this problem is also de enerate. All points in class −1 are at 1, by Lemma 3. Lettin are support vectors, and their associated
258
Ryan Rifkin, Massimiliano Pontil, and Alessandro Verri
be any dual optimal solution to P, we see that lettin = for i K1 and applyin Equation ( 2) demonstrates the existence of the . satisfyin the condition, we easily see that = C for (a, ) Given = C for i K1 induces a pair of optimal primal and dual solutions i K-1 to P with w = 0 usin the KKT conditions. (b, ) Given a 0 C -solution, w = 0 in an associated primal solution by = C for i K1 satis es the requirements on Ω. Theorem 1, and settin = C for i K1 , and apply the KKT conditions. (b, ) Let
3
The De eneratin Polyhedron
Theorem 2 indicates that it is always possible to make an SVM trainin problem de enerate by addin a sin le new data point. We now proceed to characterize the set of individual points whose addition will make a iven problem de enerate. K1 , and we denote For the remainder of this section, we assume that K-1 2K-1 x by V, and K-1 by n. [0 1], satisfyin n − 1 Suppose we choose, for each i K1 , an n. It is clear from the conditions of Theorem 2 that if we add a new 2K1 data point V − x 2K1 (8) xc = n− 2K1
that the problem becomes de enerate, where the new point has a multiplier iven , and that all sin le points whose additions would make by c = n − 2K1 the problem de enerate can be found in such a manner. We denote the set of points so obtained by XD . We introduce the followin notation. For k n, we let Sk denote the set containin all possible sums of k points in K1 . Given a point s Sk , we de ne 0 1 with the property s (x ) = 1 if and only an indicator function s : K1 if x is one of the k points of K1 that were summed to make x. The re ion XD is in fact a polyhedron whose extreme points and extreme rays are of the form V −x for x Sn−1 and n , respectively. More speci cally, we have the followin theorem; the proof is not di cult, but it is rather technical, and we defer it to Appendix A: Theorem 3. Given a non-de enerate problem P, consider the polyhedron PD
sp (V sp 2Sn−1
− sp ) + sr 2Sn
sr (V
− xr )
sp
sr
0
sp sp 2Sn−1
=1
Then PD = XD . An example is shown in Fi ure 1. The dark re ion represents the set of those points that, when added to the class 1, will make the problem de enerate. This set can be obtained followin the construction in Appendix A.
A Note on Support Vector Machine De eneracy
259
On the one hand, the idea that the addition of a sin le data point can make an SVM trainin problem de enerate seems to bode ill for the usefulness of the method. Indeed, SVMs are in some sense not robust. This is a consequence of the fact that because errors are penalized in the L1 norm, a sin le outlier can have arbitrarily lar e e ects on the separatin hyperplane. However, the fact that we are able to precisely characterize the de eneratin polyhedron allows us to provide a positive result as well. We be in by notin that in the example of Fi ure 1, the entire polyhedron of points whose addition make the problem de enerate is located well away from the initial data. This is not a coincidence. Indeed, usin Theorem 3, we may easily derive the followin theorem: K1 , suppose Theorem 4. Given a non-de enerate problem P with K−1 there exists a hyperplane w throu h V n, the center of mass of K−1 , such that all points in K1 lie on one side of w, and the closest distance between a point in K1 and w is d. Then all points in the de eneratin polyhedron PD lie at least ( K−1 − 1) d from w on the other side of w from K1 . Usin Theorem 2 we can easily show that if the center of mass of the points in the smaller class (V n) does not lie in the convex hull of the points in the lar er class, our problem is not de enerate, and we may apply Theorem 4 to bound below the distance at which an outlier would have to lie from V n in order to make the problem de enerate. We conclude that if the class with lar er cardinality lies well away from and entirely to one side of a hyperplane throu h the center of mass of the class of smaller cardinality, our problem is nonde enerate, and any sin le point we could add to make the problem de enerate would be an extreme outlier, lyin on the opposite side of the smaller class from the lar er class.
4
Nonlinear SVMs and Further Remarks
The conditions we have derived so far apply to the construction of a linear decision surface. It should be clear that similar ar uments apply to nonlinear kernels. In particular, de enerate SVMs will occur if and only if the data satisfy the conditions of Theorem 2 after under oin the nonlinear mappin to the hi hdimensional space. It is not necessary that the data be de enerate in the ori inal input space, althou h examples could be derived where they were de enerate in both spaces, for a particular kernel choice. The important messa e of Theorem 2, however, is that while de enerate SVMs are possible, the requirements on the input data are so strin ent that one should never expect to encounter them in practice. On another note, if a de enerate SVM does occur, one simply sets the threshold b to 1 or −1, dependin on which class contributes more points to the trainin set. Thus in all cases, we are able to determine the threshold b. Of course, the wisdom of this approach depends on the data distribution. If our two classes lie lar ely on top of each other, than classifyin accordin to the lar er class may indeed be the best we can do (assumin our examples were drawn
260
Ryan Rifkin, Massimiliano Pontil, and Alessandro Verri
10
5
0
−5
−10
−15
−20
−25
−30 −30
−25
−20
−15
−10
−5
0
5
10
Fi . 1. A sample problem, and the de eneratin polyhedron: whenever a point in the polyhedron is added to the class 1 (circle), the problem has the de enerate solution w = b = 0 randomly from the input distribution). If, instead, our dataset looks more like that of Fi ure 1, we are better o removin outliers and resolvin . Finally, a brief remark on complexity is in order. The quadratic pro ram (D) can be solved in polynomial time, and solvin this pro ram will allow us to determine whether a iven SVM trainin problem P is de enerate. However, the problem of determinin whether or not a 0 C -solution exists is not so easy. Certainly, if P is not de enerate, no 0 C -solution exists, but the converse is false. Determinin the existence of a 0 C -solution may be quite di cult: if we require the x to lie in IR1 , determinin whether a 0 C -solution exists is already equivalent to solvin the weakly NP-complete problem SUBSET-SUM (see [4] for more information on NP-completeness).3
3
Because the problem is only weakly NP-complete, iven a bound on the size of the numbers involved, the problem is polynomially solvable.
A Note on Support Vector Machine De eneracy
261
References 1. C. Bur es. A tutorial on support vector machines for pattern reco nition. In Data Minin and Knowled e Discovery. Kluwer Academic Publishers, Boston, 1998. (Volume 2). 2. C. Cortes and V. Vapnik. Support vector networks. Machine Learnin , 20:1 25, 1995. 3. E. Osuna. Support Vector Machines: Trainin and Applications. PhD thesis, Massachusetts Institute of Technolo y, 1998. 4. M. Garey and D. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness W. Freeman and Company, San Francisco, 1979.
A
Proof of Theorem 3
Theorem 5. Given a non-de enerate problem P, consider the polyhedron sp (V
PD
− sp ) +
sp 2Sn−1
Proof: (a, PD
sr ,
sr 2Sn
xp
0
sr
sp sp 2Sn−1
and
satisfyin
sr
sp
sr
0
=1
sp sp 2Sn−1
=
and set c
and, for i
sp
sr 2Sn
XD ) Given a set of
1, we de ne A
− xr )
sr (V
=
1 1+A
K1 , we set =
c(
sp
sp (x
)+
sp 2Sn−1
1 for each i
Then 0
= 2K1
sp
sr (x
))
sp 2Sn−1
K1 , and 1 n − 1 + nA =n− 1+A 1+A
which is in [n − 1 n), so we conclude that the assi ned w are valid. Finally, substitutin into Equation ( 8), we nd: V − n−
V −
x 2K1
1 1+A
=
( sp 2K1 sp 2Sn−1
)+ sr 2Sn
1 1+A
2K1
= (1 + A)V −
sp s sp 2S
=
sp (x
sp (V sp 2Sn−1
p
−
n−1
p
−s )+ sr 2Sn
sr s sr 2S sr (V
r
n
− sr )
sr
sr (x
))x
262
Ryan Rifkin, Massimiliano Pontil, and Alessandro Verri
We conclude that PD XD . PD ) Our proof is by construction: iven a set of (b, XD show how to choose sp and sr so that: sp
0
sp sp 2Sn−1
=1
sr
0
V − n−
sp
Sn−1
xr
Sn
,i
K1 , we
x 2K1
=
− sp ) +
sp (V sp 2Sn−1
2K1
sr (V
− sr )
sr 2Sn
If we impose the reasonable separability conditions: V n−
=
2K1
sp V sp 2Sn−1
+
sr V
sr 2Sn
x 2K1
n−
=
sp s
p
+
sp 2Sn−1
2K1
sr s
r
sr 2Sn
we can easily derive the followin : ( sr
2K1
=
n−
sr 2Sn
+ 1) − n A 2K1
We are now ready to describe the actual construction. We will rst assi n the then the sp . We describe in detail the assi nment of the sr , the assi nment of the sp is essentially similar. We be in by initializin each sp to 0. At each step of the al orithm, we consider the residual : sr ,
V − n−
x 2K1 2K1
−
sr (V sr 2S
− sr )
(9)
n
Note that by expandin each sr in the n points of K1 which sum to it, we can represent (9) as a multiple of V minus a linear combination of the points of we will maintain the invariant that this linear combination is actually a K1 nonne ative combination. Durin a step of the al orithm, we select the n points of K1 that have the lar est coe cients in this expansion. If there is a tie, we expand the set to include all points with coe cients equal to the nth lar est coe cient. Let j be the number of points in the set that share the nth lar est
A Note on Support Vector Machine De eneracy
coe cient, and let k ( j n−k+j
263
n) be the total size of the selected set. We select the
r
points s containin the remainin max(k−j 0) points with the lar est coe cients, and n−k +j of the j points which contain the nth lar est coe cient. We will then add equal amounts of each of these sr to our representation until some pair of coe cients in the residual that were unequal become equal. This can happen in one of two ways: either the smallest of the coe cients in our set can become equal to a new, still smaller coe cient, or the second smallest coe cient in the set can become equal to the smallest (this can only happen in the case where k > n.) At each step of the al orithm, the total number of di erent coe cients in the residual is reduced by at least one, so, within K1 steps, we will be able to assi n all the sr (note that at each step of our al orithm, we j of the sr ). The only way the al orithm could break down is increase n−k+j if, at some step, there were fewer than n points in K1 with nonzero coe cients in the residual. Trivially, the al orithm does not break down at the rst step there must always be at least n points with non-zero coe cients initially. To show that the al orithm does not break down at a later step, assume that after assi nin coe cients to the sr totalin k ( A), we are left with j ( n) non-zero coe cients. Notin that our al orithm requires that each of the j remainin points with non-zero coe cients is part of each sr with a non-zero coe cient, we can see that the the residual value of each of these j points is no more then 1 n−w − k. We derive the followin bound on the initial sum of the coe cients, which we call Isum : j(
Isum =
1 n−
2K1
j n−
− k) + kn + k(n − j)
2K1
n−1 n−
+k
n−1 n−
+
2K1
2K1
=
2K1
n−
+1−n 2K1
2K1
n−
2K1 i
But this is a contradiction, Isum must be equal to
i K1
n−
i
. We conclude
i K1
that we are able to assi n the for the sp .
sr
successfully. Extremely similar ar uments hold
Learnability of Enumerable Classes of Recursive unctions from Typical Examples Jochen Nessel University of Kaiserslautern, Postfach 3049, D-67653 Kaiserslautern, Germany nessel@ nformat k.un -kl.de
Abs rac . The paper investi ates whether it is possible to learn every enumerable classes of recursive functions from typical examples. Typical means, there is a computable family of nite sets, such that for each function in the class there is one set of examples that can be used in any suitable hypothesis space for this class of functions. As it will turn out, there are enumerable classes of recursive functions that are not learnable from typical examples. The learnable classes are characterized. The results are proved within an abstract model of learnin from examples, introduced by Freivalds, Kinber and Wieha en. Finally, the results are interpreted and possible connections of this theoretical work to the situation in real life classrooms are pointed out.
1
Introduction
This work started with the followin question. Assume a teacher has to teach ve pupils the concepts from a iven concept class. Is it always possible to come up with one nite set of examples for each concept, such that all pupils will learn the intended concept when iven the correspondin set of examples? This seems to be a very natural question; in fact every teacher will ask herself the question, as to which examples she should present to her pupils to have all of them learn, for example, arithmetic. In inductive inference, a pupil is mostly represented by a learnin machine M and a hypothesis space , which is usually some numberin . The machine is fed examples of some unknown concept which has to be represented in and outputs one or more hypotheses, which are interpreted with respect to . The machine has learned the concept successfully, if its last hypothesis is a correct description for what it had to learn. Some of the results in learnin theory depend fundamentally on the choice of an appropriate hypothesis space; see for example [6], [7]. This problem of ndin a suitable hypothesis space for a tar et class could be avoided, if there were examples so typical for the concepts in the tar et class, that they would su ce to learn in any hypothesis space. In this paper, we will investi ate this problem for the relatively mana eable case of enumerable sets of recursive functions. The used learnin model was introduced by Freivalds, Kinber and Wieha en; cf. [2]. One of their intentions was to O. Watanabe, T. Yokomori (Eds.): ALT’99, LNAI 1720, pp. 264 275, 1999. c Sprin er-Verla Berlin Heidelber 1999
Learnability of Enumerable Classes of Recursive Functions
265
model the teacher pupil scenario encountered real life classrooms; more on this in the next section. Even in this rather easy case, there are enumerable classes of recursive functions, that do not have typical examples. The paper ives a recursion theoretic characterization of the enumerable classes learnable from typical examples. It would be nice thou h, to have a deeper understandin of which classes fall within this characterization. The next section ives formal de nitions, some basic results and further motivation re ardin the models used for this investi ation. In section three we ive the main results. Due to lack of space, only a few proofs could be included entirely in the main part of the paper, the rest is sketched. The last section is devoted to conclusions, open problems and an attempt to ive connections of this theoretical work to real life teachin . It mi ht be ar ued that this has no place in work on machine learnin . But the author is convinced, that if machine learnin mi ht ive insi hts into how humans learn or problems they mi ht face while learnin , the possibility to discuss this insi hts should not be passed by.
2
De nitions, Notations, and Basic Results
In the followin , familiarity with standard mathematical and recursion theoretic notation and concepts, as iven for example in [10], is assumed. P n will denote the set of partial recursive functions of n ar uments; by de nition P = P 1 . R will stand for the set of recursive functions. Sometimes, functions are identi ed with their raphs; for example 0201 may stand for the function that is everywhere zero with exception of ar ument 1, where it assumes value 2. For a function f , f (x) means that f is de ned on ar ument x. A function f is an initial se ment of a function R, if domain (f ) = 0 n for some n and furthermore f . For any recursive f , let f n stand for a standard encodin of the initial se ment of f with len th n. This will later ease the de nition of the learnin machines, that can then be thou ht o as machines with a xed number of input parameters. All functions in P 2 are called numberin s. Let and ran e over numberin s and let be any xed standard acceptable numberin ; cf. [10]. A numberin = j is recursive. A numberin is has decidable equality, if the set (i j) called one-one, if every function appears at most once in it. A numberin is said to be reducible to a numberin , written , if there exists r R = r( ) for all i. The function r is then called a reduction. Two such that numberin s and are called equivalent, written , if both and hold. = f . If P P , then is a De ne P = f there exists i such that subnumberin of . Let U ran e over subsets of R. U is said to be enumerable, if there is a numberin satisfyin P = U . Let be any numberin . A numberin ex contains ood examples for if for all i, and (2) there is e R the followin conditions are ful lled: (1) ex such that e(i) = card (domain (ex )) for all i.
266
Jochen Nessel
In particular, part (2) implies that domain (ex ) is nite and recursive. Theredomain (ex ) can easily be fore, the set of ood examples (x ex (x)) x computed from i, i.e. there is an e ective al orithm that computes on input i the ood examples iven by ex , returns them and stops; cf. [2]. We will say ex are ood examples for synonymously for there is a numberin ex containin ood examples for . An inference machine is any computable device that takes iven, nite sets of examples to natural numbers, which will be interpreted as pro rams with respect to some previously selected numberin . De nition 1. (See [2].) U is called learnable from ood examples with respect to (written U Gex-Fin ) if there exist ood examples ex for and an inference U and all nite A such that machine M , such that for all i such that A the followin holds: M (A) and M(A) = . ex . Let Gex-Fin = U U Gex-Fin for some Note that we require only functions in the tar et class U to be identi ed. Furthermore, we require M to learn from any nite superset of the ood examples, as lon as this set is contained in the tar et function. This is done in order to avoid some codin tricks, like presentin the only example (i (i)) in order to learn , which would have nothin in common with the learnin problem we would like to model. Other models consider learnin to be a limitin process. De nition 2. (See [3].) An inference machine M is said to identify a recursive function f with respect to in the limit if, for all n, M (f n ) = in and there exists an in0 such that im = in0 for all m n0 , and furthermore n0 = f . In other words, an inference machine learnin some f can chan e its mind about a correct description for f a nite number of times, but must eventually conver e to an index for f in . De nition . U is learnable in the limit with respect to (written U EX ) if there is an inference machine, that identi es every f U with respect to . . De ne EX = U U EX for some There are two major di erences between these two de nitions: (1) a oodexample-machine receives one nite set of examples, whereas a Gold-machine over time will see every value of the tar et function; and (2): a ood-examplemachine is only allowed one uess, whereas a Gold-machine can chan e its uess nitely many times. So it would be natural to expect the relation Gex-Fin EX. But surprisin ly, Freivalds, Kinber and Wieha en in [2] proved Theorem 1. EX
Gex-Fin.
Beside the formal proof, there is an intuitive ar ument supportin this result: when the ood-example-machine ets its examples, which are computed from
Learnability of Enumerable Classes of Recursive Functions
267
a pro ram of the function to be learned, it knows that these examples are the important part of the tar et function and can concentrate its e orts on processin this important information. On the other hand, the Gold-machine mi ht not be able to distin uish the important part of what it sees from unimportant information and therefore cannot learn a function it is presented, even thou h it may chan e its mind a nite number of times. Furthermore, the result states, that for every function in every EX-learnable class, there is such an important part of that function which su ces to identify it with respect to the other functions in U . Theorem 1 is one reason the Gex-model was used for this study, since it contains everythin that can be learned in the limit if all examples are demonstrated. The other reason is, that the Gex model seems to model the usual pupil teacher situation pretty well, which is interestin in its own ri ht. To see this, ima ine the ood examples ex = (x ex (x)) x domain (ex ) to be computed by a teacher, who wants a pupil (M ) i.e., the learnin -al orithm M the pupil knows, to ether with the pupils knowled e-base to learn concept . The pupil is then iven a superset of those examples by the teacher, processes them, and hopefully comes up with a representation of the intended concept. We will follow this thou hts in the conclusions some more. Suppose we want to learn an enumerable class U . It is rather easy to see that there exist in nitely many enumerations for any such U . Furthermore, the followin holds: Proposition 1. Let U be any enumerable class and any numberin satisfyin = j is recursive. P = U . Then U Gex-Fin i (i j) R2 . Then a) is Proof. (Sketch) Assume a) U Gex-Fin , where b) = j i ex ; while b) to ether with easily seen to imply j and exj the properties of ood examples implies that the latter test is recursive. If equality in is decidable, it is very easy to compute ood examples, since j = j yields the existence of an x such that (x) = j (x) . Let us now consider an arbitrary enumerable class U of recursive functions. The followin is a well known result from recursion theory; see [5] for references. Proposition 2. For every enumerable class U there exists a numberin = j is recursive. that P = U and (i j)
such
So, by Proposition 1 we et that for any enumerable class U there are inference machines able to learn U in the Gex-Fin sense with respect to . Now it is easy to see that there are in nitely many numberin s enumeratin U and all have recursive equality. Let f be any function from U . Is there a set of examples so typical for f , such that a teacher could present those examples to any machine able to learn U and hence f and it comes up with a correct description for f ? The results in the next section will show, that there exist classes of recursive functions that do not have typical examples.
268
3
Jochen Nessel
Results
De nition 4. Let U be an enumerable class of recursive functions. Hyp(U ) =
U
Gex-Fin
The abbreviation Hyp should remind the fact, that this set contains all suitable hypothesis spaces for the class U . Now Propositions 2 and 1 imply that Hyp(U ) = for all enumerable classes U . Next we de ne what it means for an enumerable class of functions to be learnable with typical examples. De nition 5. Let U be an enumerable class of recursive functions. We say U Gex-Fin with typical examples, if there exist Hyp(U ) and ood examples ex for U with respect to such that U Gex-Fin with the ood examples iven by ex, and for all Hyp(U ) there exist ood examples ex0 such that U Gex-Fin with examples ex0 and furthermore, for all i j, we have that = j implies ex = ex0j . In other words, U is learnable from typical examples, if there is a way to choose the examples for any f U to be equal in all suitable hypothesis spaces. This seems to capture our intention pretty well, since a teacher can now present this set of examples to all inference machines and all of them will learn f . So, the examples are so typical for f with respect to the other functions in U , that every machine able to learn U , can do so when iven the typical examples. Note that this de nition implies the typical examples to be uniformaly computable for each admissible hypothesis space. The rst result in this section characterizes the enumerable sets of recursive functions that may be learned with typical examples. Theorem 2. Let U be an enumerable class of recursive functions. U with typical examples i holds for all Hyp(U ).
Gex-Fin
Proof. The reduction r to be de ned takes -indices to -indices by just searchin for a function with the same set of ood examples, which exists by assumption and de nition of typical examples. From Proposition 1 we et that the set Hyp(U ) contains every one-onenumberin of U , since for those numberin s equality is obviously decidable. Let be any one-one-numberin of U . Proposition 1 implies U Gex-Fin ; let ex be the ood examples witnessin this. Pick any Hyp(U ). Then, by assumption, via some r R. De ne ex0 = exr( ) for all i. Since ex are ood examples for all i, ex0 contains ood examples for . Furthermore, for and r( ) = = j then r(i) = r(j), because is one-one, and therefore ex0 = ex0j . if This yields that equality in is decidable and, applyin Proposition 1, we et U Gex-Fin with typical examples. Proofs for the followin theorem are already known, see [5] and the references therein for a short survey; especially [9]. A new proof is iven here that allows the observations we need in order to make our point concernin the existence of typical examples.
Learnability of Enumerable Classes of Recursive Functions
269
Theorem . There exists an enumerable class of recursive functions that has one-one numberin s and such that . The two constructed numberin s bein one-one implies Hyp(P ). But now Theorem 2 implies that P is not learnable with typical examples. Proof. We will construct two numberin s a ainst all functions possibly witnessin all partial recursive functions.
and in parallel and dia onalize . Recall that is a numberin of
:= := and := 0. Set D := and n := 0. Initialize: For all i, set n , n := n + 1. Step s: Set n := 1s 01 , n := s, n := 1s 0, D := D For i = 1 to n − 1, i D do: Compute j = i (i) for at most s steps. := 0, i.e. extend with another (1) If j is still unde ned, then zero. (2) If j = i, then := 11 and terminate , set set n := , n := , n := n + 1. := 01 and terminate . (3) If j = i, then next i (a) First note that and are computable, everywhere de ned and P = P ; this follows immediately from the construction. (b) Furthermore, and are one-one: the functions in both numberin s take as values only 0 and 1. For every s, at most two functions start with 1s 0. Furthermore, every function be ins with 1s 0 for some suitable s. If there are two such functions, one continues with 01 and the one is terminated by endin it with 11 ; cf. (2) in the construction above. (c) Finitely, j there is i such that = j = N : obviously, in every step s some n will assume the value s. So we have by (a) and (b) that P = P is an enumerable set of functions and both and are one-one. It remains to prove that . Assume by way of contradiction, there is a recursive function c satisfyin for all n. By (c) there exists i such that = c. Since i is recursive, n = c (n) = j. There are two possible cases: i (i) = 1 i 0k 11 for some suitable k, since the construction will be i = j : Then = 1 i 01 and therefore = . Contradiction. terminated in (2). But = = 1 i 01 by (3) of the construction. Since is i = j : In this case follows. Contradiction. one-one, j = So, and hence the theorem follows. Corollary 1. Let and be the numberin s constructed in the proof of TheoGex-Fin with ood examples ex and P Gex-Fin with rem 3. Assume P ood examples ex0 . Then there exist in nitely many i such that (1) ex = (2)
and ex0 .
,
270
Jochen Nessel
Proof. Due to space restriction we only ive an informal ar ument. If = , = j and j = . The set i = is then there exists j such that not recursive and therefore so is the question, whether there is a j as mentioned and have the same be innin and since it is not above. The functions known if they will di er later, the ood examples once for and once for have to be selected from this common part. If the two functions di er, i.e. if there exists j as indicated above, then the ood examples exj (ex0j , resp.) can and j ( and j , resp.). So, be selected as to contain a di erence between and are not equal, but the examples were picked from the common part. This yields the assertion. Let us interpret this result. Pick any i satisfyin conditions (1) and (2) of ex0 and ex ex0 . Hence, the corollary. Note that (1) implies ex 0 ex ex is an admissible input for any inference machine M1 witnessin P Gex-Fin as well as for any inference machine M2 witnessin P Gex-Fin . Since both numberin s are one-one and by de nition of the inference process, we and M2 (exi [ex0i ) = . So, both machines learn a function et M1 (exi [ex0i ) = = , they have learned di erent functions from the input ex ex0 , but since from the same set of examples. The corollary states that this will happen for in nitely many concepts in P , re ardless of how the teacher selects the examples for the machines M1 and M2 and their respective hypothesis spaces and . Now we will prove the theorems for the eneral case. A ain we stress that the fact stated in Theorem 4 is already known. On the other hand, the proof iven here had to be made from scratch in order to uarantee the properties we need to formulate and prove Corollary 2.
N there exist numberin s Theorem 4. For each 2 a followin properties for all 0 i j a:
1
a
with the
R2 and is one-one, (1) (2) P i = P j and j . (3) i = j implies 1 a in Proof. Let a 2 be iven. We will construct the numberin s a parallel by dia onalizin a ainst all possible 2 -tuples of reductions amon the . There are easier ways to prove part (3) of the theorem, but this will enable us to prove an analo ue to Corollary 1. The construction below uses lists. Let [] denote the empty list. For some non-empty list L = [a b c d], as you would expect, Head (L) = a and Tail (L) = [b c d]. Now we will ive a short explanation of the variables used in the construction and hope this will increase readability a little bit. In we code the reductions we will dia onalize a ainst with the set of functions be innin with 1 i 0. Of course we could et this information by scannin the be innin part of each function, but this keeps notation easier. In L we keep a list of reductions, in the order we want to dia onalize a ainst them. The variable m will denote the bi est value, for which all the functions startin
Learnability of Enumerable Classes of Recursive Functions
271
with 1 i 0 have been already de ned. If we write c j , we will mean reduction cij which is supposed to reduce to j . Finally, we use bx and by to store the index of the bi est function be innin with 1 i 0 in x and y . By bi est we mean the function f with the lar est x domain (f ) such that f (x) = 1 up to the ar uments de ned until now. Note that it will be possible to compute those, since we have knowled e of m . 1 a , i 0. Initialize: j := for all j := m := n := 0 and L = [] for all i 0. c1a c23 ca−1 a Step s: n := s = c12 c13 ca−1 a rest ], where rest is the list resulting from Ln = [c12 c23 c34 c1a c23 ca−1 a ] erasing c12 , c23 , c34 , , ca−1 a from the list [c12 c13 [ For example Ln = [c12 c23 c34 c45 c13 c14 c15 c24 c25 c35 ] ] for j = 1 to a do: jn := 1 n 0; mn = n ; end for; n := n + 1; for i = 1 to n − 1, = 0 do: if L = [] then [ all reductions have been taken care o ] 1 a , z n, that begin with 1 i 0, de ne jz (m + for all jz , j 1) := 0. end for m := m + 1. else [ there still are reductions we need to diagonalize against ] c := Head (L ) = cxy if x + 1 = y then [ still in the rst part of the diagonalization ] compute r := c (i) for at most s steps. if r = i then [ so we have x = y = 1 i 0X0, where ‘X’ is some suitable sequence of zeros and ones of length m − − 2 ] x := 1 i 0X000; j i j x; n := 1 0X010 for all 1 j i := 1 0X010 for all y j a; j i j a; n := 1 0X000 for all y j 1 a , z n, that start with 1 i 0 and have for all z , j not yet been de ned for arguments m + 1 and m + 2, let j j z (m + 1) := z (m + 2) := 0. end for n := n + 1, m := m + 2; L := Tail (L ). else [ i.e., x + 1 < y, the second part of the diagonalization ] bx := index of biggest function beginning with 1 i 0 in x . by := index of xbx in y . compute r := c (bx ) for at most s steps. if r = by then [ again x = y = 1 i 0X0; see above ] x i bx := 1 0X000; y i by := 1 0X010; j i 1 a y ; n := 1 0X010 for all j
272
Jochen Nessel y i n := 1 0X000; 1 a , z n, that start with 1 i 0, and for all jz , j have not yet been de ned for arguments m + 1 and m + 2, let jz (m + 1) := jz (m + 2) := 0. end for n := n + 1, m := m + 2; L := Tail (L ). if the computation of c (i), or resp. c (bx ) does not terminate, then 1 a , z n, that egin with 1 i 0, de ne jz (m + for all jz , j 1) := 0. end for m := m + 1. end for
R2 and P i = P j for all i j 1 a follows imme(a) Note that diately from construction. is one-one. This can be ar ued in the same way as in the proof (b) Every of Theorem 2. c1a c23 ca−1 a ) of indices of recursive (c) For any a2 -tuple (c12 c13 to j . A ain, this can easily be seen, since the functions, cij does not reduce construction takes care that every possible reduction is wron . j for i j 1 a . Assume It remains to prove that i = j implies i < j, otherwise just exchan e i and j. Suppose by way of contradiction, that to j . Hence k is an index of a recursive function. Let (k k) k reduces a be an 2 -tuple. Obviously, it only contains indices of recursive functions. By (c) to j . This we have that, for all indices c j in this tuple, cij does not reduce contradicts our choice of k. The theorem follows. a be the numberin s constructed in the Corollary 2. Let a 2 and 1 Gex-Fin i with ood examples ex for all proof of Theorem 4. Assume i 1 a . Then there exist in nitely many n such that
(1) exn (2) n =
j n for all i j j n for all i j
1 1
a and a .
Proof. (Sketch) A self referential ar ument shows, that the construction used in the proof of Theorem 4 would make a mistake, if conditions (1) and (2) of the formulation of the corollary were not ful lled. This corollary somewhat stren thens the remarks Safollowin Corollary 1. Here k we have the situation that, for in nitely many n, ( j=1 exjn ) n and n = n for all i k 1 a . Condition (2) now implies that every Sa machine learns a di erent function when presented with the examples j=1 exjn . The other comments apply here as well. Furthermore note, that there exist families exj j0 2f1 ag of nite sets, 0 which we also mi ht call ood examples for now, such that exj j0 i 0 0 1 a and j j 0 0. j = j 0 , for all i i
Learnability of Enumerable Classes of Recursive Functions
273
To see this, one checks the proof of Theorem 4, which yields that for every function f in the enerated numberin s only nitely many x exist, such that f (x) = 0. So there exist families containin examples with all and only those x. It is easy to see, that now these sets of ood examples are equal if and only if so are the correspondin functions. This obviously yields the assertion stated above. But none of this families is computable, since this would clearly contradict the corollary. On the other hand: it would be easy to compute these families in the limit . And hence, a teacher with lots of experience mi ht know a ood part of this typical examples for his area of expertise. One mi ht ask if this phenomenon can also be achieved for in nitely many numberin s. As the followin theorem, cf. [9], shows, there is indeed an in nite sequence of pairwise non-equivalent numberin s, all of them enumeratin the same class of recursive functions. The proof used here is conceptionally much simpler than the ori inal one in [9]. Theorem 5. There exists a sequence , i i j the followin properties are satis ed:
0, of numberin s such that for all
R2 , j R2 and both numberin s are one-one, (1) (2) P i = P j , j . (3) i = j implies Proof. (Sketch) This is a ain a dia onalization, but this time it is made sure can not be reduced to j . In essence, the ar ument used to that, for i < j, prove Theorem 3 is repeated an in nite number of times. The only thin to take care of is the accountin of which functions were used in the construction, in order to ful ll condition (2). But there is no analo ue for Corollaries 1 and 2. An analo ue to the corollaries in the nite case would require for any choice S of the ood examples a sequence of indices nk , k 0, such that nk = jnk and k0 exknk ni for all i j. Note that in the corollaries above we have that one index n ful lls this requirement and do not need a whole sequence. But requirin one index n would complicate the previous proof and is not needed for the followin ar ument. For each in the sequence constructed in Theorem 5, P i Gex-Fin i holds by Proposition 1. Furthermore, this can be achieved with ood examples ex that satisfy, for all i j, the followin conditions: card ( exj ) > max ( i j ) and furthermore exj is an initial se ment of j . To see this, assume P i Gex-Fin i with ood examples ex for some i. De ne exj , for all j, by exj = (x j (x)) 0 x max ( i j max (domain (exj )) ) . Since the inference machine witnessin P i Gex-Fin i has to learn from all nite supersets of the ood examples ex as well, obviously ex also are ood examples for P i with S respect to . This yields that for every sequence nk , the set E = k0 exknk will be in nite. For every x there is at least one y such that (x y) E, since the ood examples are initial se ments. Therefore E can be contained in at most one function. In fact, E mi ht not even be a function, but a relation. An analo ue to the corollaries in the nite case therefore can not hold.
274
4
Jochen Nessel
Conclusion, Open Problems, and Interpretation
Theorem 4 and Corollary 2 witness the followin fact: for every n 2, there is an enumerable class of recursive functions and n hypothesis spaces, such that, no matter how the teacher chooses the examples for some concepts within the class, each machine witnessin the learnability for one of those hypothesis spaces will learn a di erent function when confronted with this set of examples. Even worse, they all produce seemin ly identical hypotheses consistent with the iven examples, so that it is impossible for the teacher to know, whether the machines have successfully learned the intended function or not. Of course, there are unanswered questions. What do concept classes look like that ful ll the corollaries? What are necessary or/and su cient conditions for those? Or, rephrased in recursion theoretic terms: iven two recursive numberin s an , what is needed to be able to test pointwise equality between them? I.e., is there an f R such that f (i) = 1 = ? i Another interestin question is, which classes of recursive functions have the property, that all their numberin s with decidable equality are reducible to one another, or, equivalently, that all their one-one numberin s are equal with respect to . This would characterize the classes of functions that are learnable with typical ood examples, by usin Theorem 2, in a di erent way. For enumerable classes, necessary and su cient conditions are known in order to assure that all of their numberin s are equivalent with respect to ; cf. [5] and the references therein. But it is easy to see that there is an enumerable class U of functions, such that all one-one numberin s of U are equivalent, but not all enumerations constant zero function , where of U are. For example, let U = f i N f is always 0, with exception of i, where it takes value 1. Obviously, all one-one numberin s of U are equivalent, and it is easy to construct two numberin s of U that are not equivalent, by exploitin the fact that the constant zero function is an accumulation point for U . (The function f is an accumulation point for U , if for all n N there exists mn such that mn < mn+1 and f (x) = n (x) for all x < mn and f (mn ) = n (mn ) for some n in U .) In the comments followin Corollary 2 it is mentioned, that typical examples could be computed in the limit. It could be interestin to nd an intuitive de nition for limit computable examples , without shiftin most of the inference process into this limitin computation. The reader only interested in mathematical results may now stop readin . The followin tries to ive some connections of the obtained results to real life teachin and interpret them in that context. As mentioned before, the pair (M ) mi ht be thou ht of as a pupil: the pupils learnin al orithm M and its knowled e base . In a classroom the followin mi ht happen: a teacher ives two pupils (M1 ) and (M2 ) the examples for some concepts he wants them to learn. The pupils process those examples and return an identical hypothesis i, as it happens even in the theoretical case, see the comments followin Corollary 2 on pa e 270. Now, if the teacher does not know which knowled e base a pupil is usin , he mi ht not know, if both pupils
Learnability of Enumerable Classes of Recursive Functions
275
have learned the intended concept. As a consequence, he will have to submit them both to a series of tests, in order to check what they learned. And eventually, he will have to correct mistakes the pupils made. Normally this is achieved by presentin new examples to the pupil in order to make it see its mistake and learn the correct concept. The corollaries show that there are concept classes and sets of pupils, such that for some concepts the teacher has to compute a di erent set of examples for each pupil. But this is only feasible in small classrooms, since otherwise the teacher just will not have the time to dedicate enou h time to each student and compute a di erent set example for each one. So the model reflects the well know fact, that smaller classrooms improve learnin performance. In addition one mi ht notice that all students have the same learnin potential, since all knowled e bases contain the same set of functions. Hence no pupil could be considered stupid , they just enerate and test ideas in a di erent way. This mi ht be a reason why some pupils like some teachers and learn better with them: teacher and pupil use a similar knowled e base , the teacher for eneratin and the pupil for testin the examples. So it is to expect that they will achieve ood learnin performance. Of course, there mi ht, and most likely will, be other reasons, but it would be interestin to test this hypothesis in a real classroom. I am rateful to R. Wieha en, C. Smith and the referees for many valuable su estions and to W. and A. Nessel for carefully readin a draft of this work.
References 1.
Jain, S., Lan e, S., Nessel, J. (1997), Learnin of r.e. Lan ua es from Good Examples, in Li, Maruoka, Eds., Ei hth International Workshop on Al orithmic Learnin Theory, 32 - 47, Lecture Notes in Arti cial Intelli ence 1316, Sprin er Verla . 2. Freivalds, R., Kinber, E.B., Wieha en, R. (1993), On the power of inductive inference from ood examples. Theoretical Computer Science 110, 131 - 144. 3. Gold, E.M. (1967), Lan ua e identi cation in the limit. Information and Control 10, 447 - 474. 4. Goldman, S.A., Mathias, H.D. (1996), Teachin a smarter learner. Journal of Computer and System Sciences 52, 255 - 267. 5. Kummer, M. (1995), A Learnin -Theoretic Characterization of Classes of Recursive Functions, Information Processin Letters 54, 205 - 211. 6. Lan e, S., Nessel, J., Wieha en, R. (1998), Learnin recursive lan ua es from ood examples, Annals of Mathematics and Arti cial Intelli ence 23, 27 - 52. 7. Lan e, S., Zeu mann, T. (1993), Lan ua e Learnin in Dependence on the Space of Hypotheses, Proceedin s of the Sixth Annual ACM Conference on Computational Learnin Theory, 127 - 136, ACM Press. 8. Lan e, S., Zeu mann, T. (1993), Learnin recursive lan ua es with bounded mindchan es, International Journal of Foundations of Computer Science 2, 157 - 178. 9. Marchenkov, S. S. (1972), The computable enumerations of families of recursive functions, Al ebra i Lo ika 11, 588-607; En lish Translation Al ebra and Lo ic 15, 128 - 141, 1972. 10. Ro ers, H. Jr. (1987), Theory of Recursive Functions and E ective Computability, MIT Press, Cambrid e, Massachusetts.
On the Uniform Learnability of Approximations to Non-recursive Functions Frank S ephan 1 1
and Thomas Zeugmann 2
Ma hema isches Ins i u , Universi ¨ a Heidelberg, Im Neuenheimer Feld 294, 69120 Heidelberg, Germany
[email protected] .de 2 Depar men of Informa ics, Kyushu Universi y, Kasuga 816-8580, Japan
[email protected]
Abs rac . Blum and Blum (1975) showed ha a class B of sui able recursive approxima ions o he hal ing problem is reliably EX-learnable. These inves iga ions are carried on by showing ha B is nei her in NUM nor robus ly EX-learnable. Since he de ni ion of he class B is qui e na ural and does no con ain any self-referen ial coding, B serves as an example ha he no ion of robus ness for learning is qui e more restrictive han in ended. Moreover, varian s of his problem ob ained by approxima ing any given recursively enumerable se A ins ead of he hal ing problem K are s udied. All corresponding func ion classes U(A) are s ill EX-inferable bu may fail o be reliably EX-learnable, for example if A is non-high and hypersimple. Addi ionally, i is proved ha U(A) is nei her in NUM nor robus ly EX-learnable provided A is par of a recursively inseparable pair, A is simple bu no hypersimple or A is nei her recursive nor high. These resul s provide more evidence ha here is s ill some need o nd an adequa e no ion for na urally learnable func ion classes.
1. Introduction Though algori hmic learning of recursive func ions has been in ensively s udied wi hin he las hree decades here is s ill some need o elabora e his heory fur her. For he purpose of mo iva ion, le us shor ly recall he basic scenario. An algori hmic learner is fed growing ini ial segmen s of he graph of he arge func ion f . Based on he informa ion received, he learner compu es a ? ??
Suppor ed by he Deu sche Forschungsgemeinschaf (DFG) under Research Gran no. Am 60/9-2. Suppor ed by he Gran -in-Aid for Scien i c Research in Fundamen al Areas from he Japanese Minis ry of Educa ion, Science, Spor s, and Cul ure under gran no. 10558047. Par of his work was done while visi ing he Labora oire d’Informa ique Algori hmique: Fondemen s e Applica ions, Universi e Paris 7. This au hor is gra efully indeb ed o Maurice Niva for providing nancial suppor and inspiring working condi ions.
O. Watanabe, T. Yokomori (Eds.): ALT’99, LNAI 1720, pp. 276 290, 1999. c Sprin er-Verla Berlin Heidelber 1999
On he Uniform Learnabili y of Approxima ions o Non-recursive Func ions
277
hypo hesis on each inpu . The sequence of all compu ed hypo heses has o converge o a correc , ni e and global descrip ion of he arge f . We shall refer o his scenario by saying ha f is EX-learnable (cf. De ni ion 1). Clearly, wha one is really in eres ed in are powerful learning algori hms ha canno only learn one func ion bu all func ions from a given class of func ions. Gold [11] provided he rs such powerful learner, i.e., he identi cation by enumeration algori hm and showed ha i can learn every class con ained in NUM . Here NUM deno es he family of all func ion classes ha are subse s of some recursively enumerable class of recursive func ions. There are, however, learnable classes of recursive func ions which are no con ained in NUM . The perhaps mos prominen example is he class SD of self-describing recursive func ions, i.e., of all hose func ions ha compu e a program for hemselves on inpu 0 . Clearly, SD is EX-learnable. Since Gold’s [11] pioneering paper a huge varie y of learning cri eria have been proposed wi hin he framework of induc ive inference of recursive funcions (cf., e.g., [3,6,8,9,15,19,21]). By comparing hese inference cri eria o one ano her, i became popular o show separa ion resul s by using func ion classes wi h self-referen ial proper ies. On he one hand, he proof echniques developed are ma hema ically qui e elegan . On he o her hand, hese separa ing examples may be considered o be a bi ar i cial, because of he use of self-describing proper ies. Hence, Barzdins sugges ed o look a versions of learning ha are closed under compu able ransforma ions (cf. [20,28]). For example, a class U is robustly EX-learnable, i , for every compu able opera or such ha (U) is a class of recursive func ions, he class (U) is EX-learnable, oo (cf. De ni ion 4). There have been many discussions which opera ors are admissible in his con ex (cf., e.g., [10,14,16,20,23,28]). A he end, i urned ou o be mos sui able o consider only general recursive opera ors, ha is, opera ors which map every o al func ion o a o al one. The resul ing no ion of robus EX-learning is he mos general one among all no ions of robus EX-inference. Nex , we s a e he wo main ques ions ha are s udied in he presen paper. (1) Wha is he overall heory developed so far elling us abou he learnabili y of na urally de ned func ion classes? (2) Wha is known abou he robus EX -learnabili y of such na urally de ned func ion classes? Clearly an answer o he rs ques ion should ell us some hing abou he usefulness of he heory, and an answer o he second problem should, in par icular, provide some insigh in o he na uralness of robus EX-learning. However, our knowledge concerning bo h ques ions has been severely limi ed. For funcion classes in NUM every hing is clear, i.e., heir learnabili y has been proved wi h respec o many learning cri eria including robus EX-learning. Nex , le us consider one of he few na ural func ion classes ou side NUM ha have been considered in he li era ure, i.e., he class C of all recursive complexi y func ions. Then, using Theorem 2.4 and Corollary 2.6. in [23], one can conclude ha C is no robus ly EX-learnable for many complexi y measures including space, since
278
Frank S ephan and Thomas Zeugmann
here is no recursive func ion ha bounds every func ion in C for all bu ni ely many argumen s. On he o her hand, C i self is s ill learnable wi h respec o many inference cri eria by using he iden i ca ion by enumera ion learner. The la er resul already provides some evidence ha he no ion of robus EX-learning may be oo res ric ive. Never heless, he si ua ion may be comple ely di eren if one looks a classes of 0 1 -valued recursive func ions, since heir learnabili y di ers some imes considerably from he inferabili y of arbi rary func ion classes (cf., e.g., [17,26]). As far as hese au hors are aware of, one of he very few na ural classes of 0 1 -valued recursive func ions ha may be a candida e o be no included in NUM has been proposed by Blum and Blum [6]. They considered a class B of approxima ions o he hal ing problem K and showed ha B is reliably EX-learnable. This class B is qui e na ural and no self-describing. I remained, however, open whe her or no B is in NUM . Wi hin he presen work, i is shown ha B is nei her in NUM nor robus ly EX-learnable. Moreover, we s udy generaliza ions of Blum and Blum’s [6] original class by considering classes U(A) of approxima ions for any recursively enumerable se A. In par icular, i is shown ha all hese classes remain EXlearnable bu no necessarily reliably EX-inferable (cf. Theorems 14 and 16). Fur hermore, we show U(A) o be nei her in NUM nor robus ly EX-learnable provided A is par of a recursively inseparable pair, A is simple bu no hypersimple or A is nei her recursive nor high (cf. Theorems 13 and 17). Thus he resul s ob ained enlarge our knowledge concerning he learnabili y of na urally de ned func ion classes. Addi ionally, all hose classes U(A) which are no in NUM as well as B are na ural examples for a class which is on he one side no self-describing and on he o her side no robus ly learnable. So all hese U(A) provide some incidence ha he presen ly discussed no ions of robus and hyperrobus learning [1,7,10,14,16,23,28] des roy no only coding ricks bu also he learnabili y of qui e na ural classes. Due o he lack of space, many proofs are only ske ched or omi ed. We refer he reader o [25] for a full version of his paper.
2. Preliminaries Unspeci ed no a ions follow Rogers [24]. IN = 0 1 2 and IN deno e he se of all na ural numbers and he se of all ni e sequences of na ural numbers, respec ively. 0 1 s ands for he se of all ni e 0 1 -valued sequences and for all x IN we use 0 1 x for he se of all 0 1 -valued sequences of leng h x. The classes of all par ial recursive and recursive func ions of one, and wo argumen s over IN are deno ed by P P 2 R and R2 , respec ively. f P is said o be mono one provided for all x y IN we have, if bo h f (x) and f (y) are de ned hen f (x) f (y). R0 1 and Rmon deno es he se of all 0 1 valued recursive func ions and of all mono one recursive func ions, respec ively. f (n)), for any n Fur hermore, we wri e f n ins ead of he s ring (f (0) IN and f R. Some imes i will be sui able o iden ify a recursive func ion wi h
On he Uniform Learnabili y of Approxima ions o Non-recursive Func ions
279
he sequence of i s values, e.g., le = (a0 ak ) IN j IN and p R0 1 ; k, hen we wri e jp o deno e he func ion f for which f (x) = ax , if x f (k + 1) = j , and f (x) = p(x − k − 2), if x k + 2 . Fur hermore, le P and i is a pre x of he sequence of values associa ed IN ; we wri e wi h , i.e., for all x k , (x) is de ned and (x) = ax . P 2 , hen Any func ion P 2 is called a numbering. Moreover, le (i x) and se P = i i IN as well as we wri e i for he func ion x P , hen here is a number i such ha R = P \ R. Consequen ly, if f program f = i . If f P and i IN are such ha i = f , hen i is called a for f . Le be any numbering, and i x IN; if i (x) is de ned (abbr. i (x) ) hen we also say ha i (x) conver es. O herwise, i (x) is said o diver e (abbr. i (x) ). A numbering P 2 is called a G¨odel numbering or accep able numbering P 2 , here is a c R such ha (cf. [24]) i P = P , and for any numbering IN. In he following, le be any xed G¨ odel numbering. As i = c(i) for all i usual, we de ne he hal ing problem o be he se K = i i IN i (i) . Any func ion P 2 sa isfying dom( i ) = dom( i ) for all i IN and (i x y) i x y IN i (x) y is recursive is called a complexity measure (cf. [5]). R2 )[U P ] deno e he family of all Fur hermore, le NUM = U ( subse s of all recursively enumerable classes of recursive func ions. Nex , we de ne he concep s of learning men ioned in he in roduc ion. IN be a recursive machine. De nition 1. Le U R and M : IN (a) (Gold [11]) M is an EX-learner for U i , for each func ion f U , M converges syn ac ically o f in he sense ha here is a j IN wi h j = f and j = M (f n ) for all bu ni ely many n IN. (b) (Angluin [2]) M is a conservative EX-learner for U i M EX -learns U and M makes in addi ion only necessary hypothesis chan es in he sense ha , whenever M ( ) = M ( ) hen he program M ( ) is inconsis en wi h he da a by ei her M( ) (x) or M( ) (x) = (x) for some x dom( ). (c) (Barzdins [4], Case and Smi h [8]) M is a BC-learner for U i , for each func ion f U , M converges seman ically o f in he sense ha M(f n ) = f for all bu ni ely many n IN. A class U is EX -learnable i i has a recursive EX -learner and EX deno es he family of all EX -learnable func ion classes. Similar we de ne when a class is conserva ively EX -learnable or BC -learnable. We wri e BC for he family of all BC -learnable func ion classes. No e ha EX BC (cf. [8]). As far as we are aware of, i has been open whe her or no conserva ive learning cons i u es a res ric ion for EX -learning of recursive func ions. The nega ive answer is provided by he nex proposi ion. Proposition 2. EX = conserva ive-EX . Never heless, whenever sui able, we shall design a conserva ive learner ins ead of jus an EX -learner, hus avoiding he addi ional general ransforma ion given by he proof of Proposi ion 2.
280
Frank S ephan and Thomas Zeugmann
Nex , we de ne reliable inference. In ui ively, a learner M is reliable provided i converges if and only if i learns. There are several varian s of reliable learning, so we will give a jus i ca ion of our choice below. De nition 3 (Blum and Blum [6], Minicozzi [21]). Le U R ; hen U is said o be reliably EX-learnable if here is a machine M R such ha (1) M EX -learns U and (2) for all f R, if he sequence (M (f n ))n2IN converges, say o j , hen
j
= f.
By REX we deno e he family of all reliably EX -learnable func ion classes. No e ha one can replace he condi ion f R in (2) of De ni ion 3 by f P or all o al f . This resul s in a di eren model of reliable learning, say PEX and T EX , respec ively. Then for every U R0 1 such ha U PEX or U T EX one has U NUM (cf. [6,12,26]). On he o her hand, here are classes U R0 1 such ha U REX NUM (cf. [12]). As a ma er of fac , our Theorem 6 below oge her wi h Blum and Blum’s [6] resul B REX provides a much easier proof of he same resul han Grabowski [12]. Finally, we de ne robus EX -learning. This involves he no ion of general recursive opera ors. A general recursive opera or is a compu able mapping ha maps func ions over IN o func ions over IN and every o al func ion has o be mapped o a o al func ion. For a formal de ni ion and more informa ion abou general recursive opera ors he reader is referred o [13,22,27]. De nition 4 (Jain, Smith and Wieha en [16]). Le U R; hen U is said o be robustly EX-learnable if (U) is EX -learnable for every general recursive opera or . By robus -EX we deno e he family of all robus ly EX -learnable func ion classes.
3. Approximatin the Haltin Problem Wi hin his sec ion, we deal wi h Blum and Blum’s [6] class B . Firs , we de ne he class of approxima ions o he hal ing problem considered in [6]. R be such ha for all i
De nition 5. Le
(i) (x) =
Now, we se B =
(i)
i
1 0
IN
if i (x) and x (x) if i (x) and [ x (x) o herwise.
IN and
i
i (x) i (x)]
Rmon .
Blum and Blum [6] have shown B REX bu lef i open whe her or no B NUM . I is no , as our nex heorem shows. Theorem 6. B
NUM .
Proof. Firs , recall ha K is par of a recursively inseparable pair (cf. [22, Exercise III.6.23.(a)]). Tha is, here is an r.e. se H such ha K \H = and for
On he Uniform Learnabili y of Approxima ions o Non-recursive Func ions
281
every recursive se A H we have A \ K = . Now, we x any enumera ion and h0 h1 h2 of K and H , respec ively. Suppose o he k0 k1 k2 R . Nex , we con rary, ha here exis s a numbering R2 such ha B de ne for each e a func ion e P as follows. For all e x IN le e (x)
= Search for he leas n such ha for n = s + y ei her (A), (B) or (C) happens:
(A) y = hs (B) y = ks (C) e (y) > 1
e (y)
=1 (y) = 0 e
y>x e (x)
= s + y. (x) = e y (y) + y . (x) = 0 . e
If (A) happens rs , hen se If (B) happens rs , hen le If (C) happens rs , hen le
Claim 1. e R for all e IN. R. Now le If here is a leas one y such ha e (y) > 1 , hen e R0 1 and suppose ha here is an x IN wi h e (x) . Then here are no e H s y such ha y = hs and e (y) = 1 . Hence, M = y y IN e (y) = 0 and M is recursive. Thus, M \ K = . So here mus be a y > x such ha IN wi h y = ks . Thus (B) mus happen, and since y = ks , e (y) = 0 and an s we conclude y (y) . Hence, e (x) , oo, a con radic ion. This proves Claim 1. B . Then Claim 2. Let e be any number such that e = (i) for some (i) IN. e (x) > i (x) for all x Assume any i e as above, and consider he de ni ion of e (x). Suppose e (x) = s + y for some s y such ha y = hs and e (y) = 1 . Since e (y) = K , we ge a con radic ion y (y) i (y), and hence y (i) (y) = 1 implies o K \ H = . Thus, his case canno happen. Consequen ly, in he de ni ion of e (x) condi ion (B) mus have happened. Thus, some s y such ha y > x, y = ks and e (y) = 0 have been found. Since y = ks , we conclude y (y) and hus (x) > y (y). Because of e (y) = (i) (y) = 0 , we ob ain i (y) < y (y) by he de ni ion of (i) . Now, pu ing i Rmon . all oge her, we ge (x) > y (y) > i (y) i (x), since y > x and i This proves Claim 2. Claim 3. For every b R there exists an i IN such that b(x) < i (x) for all x IN. Le r R be such ha for all j x IN we have r(j) (0)
=
0 j (0)
+1
if [ j (0) o herwise
i
Rmon and
b(0)]
and for x > 0 0
if
j (n)
is de ned for all n < x b(x)] j (x − 1) j (x) if j (n) is de ned for all n < x b(x)] [ j (x) j (x − 1) j (x) o herwise. [
r(j) (x)
=
j (x)
+1
j (x)
282
Frank S ephan and Thomas Zeugmann
By he xed poin heorem [24] here is an i IN such ha r(i) = i . Now, one 1 Rmon and b(x) < i (x) for all x IN induc ively shows ha i = 0 , i and Claim 3 follows. Finally, by Claim 1, all e R, and hus here is a func ion b R such ha IN and all bu ni ely many x IN (cf. [6]). Toge her b(x) e (x) for all e wi h Claim 2, his func ion b con radic s Claim 3, and hence B NUM . The nex resul can be ob ained by looking a U(K) in Theorems 15 and 17. Theorem 7. B is REX -inferable but not robustly EX -learnable. Theorems 6 and 7 immedia ely allow he following separa ion, hus reproving Grabowski’s [12] Theorem 5. Corollary 8. NUM \ (R0 1 )
REX \ (R0 1 ).
Finally, we ask whe her or no he condi ion i Rmon in he de ni ion of he class B is necessary. The a rma ive answer is given by our nex heorem. Tha IN and i R . is, ins ead of B , we now consider he class B = (i) i Theorem 9. B is not BC -learnable. Nex , we generalize he approach under aken so far by considering classes U(A) of approxima ions o any recursively enumerable (abbr. r.e.) se A.
4. Approximatin Arbitrary r.e. Sets The de ni ion of Blum and Blum’s [6] class uses implici ly he measure K de ned as K (x) = x (x) for measuring he speed by which K is enumera ed. Using his no ion K , he class B of approxima ions of K is de ned as B= f
R0 1 (
e
Rmon ) ( x) [f (x) = 1
K (x)
e (x)]
Our main idea is o replace K by an arbi rary r.e. se A and o replace K by a measure A of ( he enumera ion speed of) A. Such a measure sa is es he following wo condi ions: The se x A
(x y) ( y) [
A (x) A (x)
y is recursive. y].
Here, A is in ended o be aken as he func ion i of some index i of A, bu some imes we migh also ake he freedom o look a some o her func ions A sa isfying he wo requiremen s above. The na ural de ni ion for a class U(A) corresponding o he class B in he case A = K based on an underlying func ion A is he following. De nition 10. Given an r.e. se A, an enumera ion A and a o al func ion e , le 1 if A (x) e (x) fe (x) = 0 if [ A (x) e (x)].
On he Uniform Learnabili y of Approxima ions o Non-recursive Func ions
Now U(A) consis s of all hose fe where
e
283
Rmon .
Nex , comparing U(K) o he original class B of Blum and Blum [6] one can easily prove he following. For every f B here is a func ion U(K) such ha for all x IN we have f (x) = 1 implies (x) = 1 . Hence, he approxima ion is a leas as good as f . The converse is also rue, i.e., for each U(K) here is an f B such ha (x) = 1 implies f (x) = 1 for all x IN. Therefore, we consider our new classes of approxima ions as na ural generaliza ions of Blum and Blum’s [6] original de ni ion. Moreover, no e ha here is a func ion enA which compu es for every e of a mono one e a program enA (e) for he func ion f associa ed wi h e : enA (e) (x) =
1 0
if A (x) if [ A (x) o herwise.
e (x) e (x) ]
( y < x)[ e (y) ( y < x)[ e (y)
e (y
+ 1)] e (y + 1)]
Now, if A is recursive, every hing is clear, since we have he following. Theorem 11. If A is recursive then U(A)
NUM .
The direc generaliza ion of Theorem 6 would be ha U(A) is no in NUM for every non-recursive r.e. se A and every measure A . Unfor una ely, here are some special cases where his is s ill unknown o us. We ob ained many in ermedia e resul s which give incidence ha U(A) is no in NUM for any non-recursive r.e. se A. Firs , every non-recursive se A has a su cien ly slow enumera ion such ha U(A) NUM for his underlying enumera ion and he corresponding A . Second, for many classes of se s we can direc ly show ha U(A) NUM , wha ever measure A we choose. Besides he cases where A is par of a recursively inseparable pair or A is simple bu no hypersimple, he case of he non-recursive and non-high se s A is in eres ing, in par icular, since he proof di ers from ha for he wo previous cases. Recall ha a se A is simple i A is bo h r.e. and in ni e, A is in ni e bu here is no in ni e recursive se R disjoin o A. A se A is hypersimple i A an is bo h r.e. and in ni e, and here is no func ion f R such ha f (n) for all n IN, where a0 a1 is he enumera ion of A in s ric ly increasing order (cf. Rogers [24]). Using his de ni ion of hypersimple se s, one can easily show he following lemma. Lemma 12. A set A IN is hypersimple i ( ) A is r.e. and both A and A are in nite (b ) for all functions R with (x) x for all x IN there exist in nitely many x IN such that x x + 1 (x) A. Now, we are ready o s a e he announced heorem. Theorem 13. U(A) is not in NUM for the followin r.e. sets A. ( ) A is part of a recursively inseparable pair. (b ) A is simple but not hypersimple. (c ) A is neither recursive nor hi h.
284
Frank S ephan and Thomas Zeugmann
Proof. We ske ch only he proof of Asser ion (c) here. Assume by way of R . con radic ion U(A) NUM . Thus, here is a R2 such ha U(A) Assume wi hou loss of generali y ha 0 A. The A-recursive func ion dA (x) = x and y A is o al and recursive rela ive o A. If now max A (y) y m(x) dA (x), hen he func ion genera ed by m in accordance o De ni ion 10 is equal o he charac eris ic func ion of A. A(x) = fm (x) =
1 0
if A (x) m(x) if [ A (x) m(x)].
So one can de ne he following A-recursive func ion h: h(x) = min y
x ( j
x) ( z)[(x
z
y)
j (z)
= A(z)]
Since A is no recursive, no func ion j can be a ni e varian of A(x), and hus h is o al. Using h we nex he following o al A-recursive func ion by h(x) dA (y). Since A is no high, here is a func ion b R such ha (x) = y=x b(x) (x) for in ni ely many x. By Claim 3 in he demons ra ion of Theorem 6, here exis s an e IN such e Rmon and e (x) > b(x) for all x IN. (x) for in ni ely many x. Thus, e (x) R here exis s an x > k such ha Nex , for every k e (x) > (x). Consider all y = x x + 1 h(x). By he de ni ion of and by e Rmon , we have e (y) dA (y) for all hese y . Thus, by he choice of dA and he de nih(x). ion of enA (e) we arrive a enA (e) (y) = A(y) for all y = x x + 1 Bu now he de ni ion of he func ion h guaran ees ha k (z) = enA (e) (z) for some z wi h x z h(x). Consequen ly, enA (e) di ers from all fk in con radic ion o he assump ion U(A) NUM .
5. Reliable and EX -Learnability of U(A) Blum and Blum [6] showed B REX . The EX -learnabili y of U(A) alone can be generalized o every r.e. se A, bu his is no possible for reliabili y. Bu before dealing wi h REX -inference, we show ha every U(A) is EX -learnable. Theorem 14. U(A) is EX -learnable for all r.e. sets A. Proof. If A is recursive, hen U(A) NUM (cf. Theorem 11) and hus EX learnable. So le A be non-recursive and le A be a recursive enumera ion of A. An EX -learner for he class U(A) is given as follows. On inpu , disqualify all e such ha here are x dom( ) and y sa isfying one of he following hree condi ions: and enA (e) (x) = (x) (a) enA (e) (x) (b) (x) = 0 , A (x) y and [ e (x) y] (c) e (x + 1) y and [ e (x) y]. Ou pu enA (e) for he smalles e no ye disquali ed.
On he Uniform Learnabili y of Approxima ions o Non-recursive Func ions
285
The algori hm disquali es only such indices e where enA (e) is ei her de ned and false or unde ned for some x dom( ). Thus he learner is conserva ive. Since he correc indices are never disquali ed, i remains o show ha he (y) for some y . incorrec ones are. This clearly happens if enA (e) (y) = O herwise le z be he rs unde ned place of enA (e) . This unde ned place is ei her due o he fac ha e (x) > e (x + 1) for some x < z or ha e (z) . In he rs case, e is even ually disquali ed by condi ion (c), in he second case, ei her e (x + 1) for some rs x z , hen e is again even ually disquali ed by condi ion (c) or e (x) for some x A above z and so e is disquali ed by condi ion (b). Hence, he learning algori hm is correc . The resul ha B is reliably EX -learnable can be generalized o halves of recursively inseparable pairs and o simple bu no hypersimple se s. Theorem 15. U(A) is reliably EX-learnable if ( ) A is part of a recursively inseparable pair or (b ) A is simple but not hypersimple. Proof. The cen ral idea of he proof is ha condi ions ( ) and (b) allow o iden ify a class of func ions which con ains all recursive func ions which are oo di cul o learn and on which he learner hen signals in ni ely of en divergence. The recursive func ions ou side his class urn ou o be EX -learnable and con ain he class U(A). The learner M does no need o succeed on func ions f R0 1 or if f (x) = 1 for almos all x A. Now, he second condi ion can be checked indirec ly for f R0 1 and he A in he precondi ion of he heorem. form a recursively inseparable pair. In case ( ), le A and B = b0 b1 If f (x) = 1 for almos all x A hen f (bs ) = 1 for some bs . So one de nes . ha disquali es if (x) 2 for some x or if (bs ) = 1 for some s In case (b), he se A is simple bu no hypersimple. By Lemma 12 here is a func ion R wi h (x) x for all x IN such ha A in ersec s every in erval x x + 1 (x) . Bu if f (x) = 1 for almos all x A, hen, by he simplici y of A, f (x) = 1 for almos all x and here is an x wi h f (y) = 1 for all y x x+1 (x) . So one de nes ha disquali es if (x) 2 for some x or if here is an x and (y) = 1 for all y x x+1 (x) . The reliable EX -learner N is a modi ca ion of he learner M from Theorem 14 which copies M on all excep hose which disqualify on hem, N always ou pu s a guess for 01 and hus ei her converges o some 01 or diverges by in ni ely many changes of he hypo hesis. Le e( ) be a program for 01 and le e( ) if is disquali ed N( ) = M ( ) o herwise. For he veri ca ion, no e ha for every f U(A) we have f (x) = 0 for all f is disquali ed and herefore N is an x A. Thus, if f U(A) hen no EX -learner for U(A).
286
Frank S ephan and Thomas Zeugmann
Assume now ha N converges o an e0 on some recursive func ion f . If his happens for a func ion f such ha some 0 f has been disquali ed hen f . Thus, N converges o a f = 01 and so also e0 = 01 for some correc program for f in his case. O herwise, no 0 f is disquali ed. Since N copies he indices of M and hose are all of he form enA (e), here is a leas e wi h e0 = enA (e). If f (x) = 0 for in ni ely many x A, hen M converges only o enA (e) if = f and he algori hm is correc in ha case. enA (e) Finally, consider he subcase ha f (x) = 0 for only ni ely many x A. Consequen ly, in case (a) f (x) = 1 for some x B and in case (b) here mus be an x such ha f (y) = 1 for all y = x x + 1 (x). In bo h cases, some 0 f is disquali ed, hus his case canno occur. Hence, N is reliable. Theorem 16. If A is hypersimple and not hi h then U(A)
REX .
Proof. Le A be a hypersimple non-high se , le A be a corresponding measure, and assume o he con rary ha U(A) REX . Then also he union 11
U(A)
0 1
is EX -learnable, since every class in NUM is also in REX and REX is closed under union (cf. [6,21]). Given an EX -learner M for he above union, one can de ne he following func ion h1 by aking h1 (x) = min s
0 1 x) ( y
x (
M( ) (y)
s
M( ) (y)
x) [M ( 1s ) = M ( )
= (y)]
The func ion h1 is o al since any guess M ( ) ei her compu es he func ion 11 or is even ually replaced by a new guess on 11 . No e ha h1 Rmon and h1 (x) x for all x. Since A is no recursive, here is no o al func ion domina ing A . Thus one can de ne a recursive func ion h2 (x) by aking h2 (x) = he smalles s such ha h1 (y + h1 (y)) < s
A (y)
here is a y wi h x +
A (y
+ 1) +
y +
s A (y
+ h1 (y)) < s
Since A is hypersimple, we direc ly ge from Lemma 12 ha h2 R. Consider for every f U(A) he index i o which M converges and an index j wi h f = enA (j) . Assume now ha M has converged o i a z x. Consider he y s from = f (0) f (y). If M ( 1h1 (y) ) = M ( ) hen he de ni ion of h2 and le 0 y y+1 y + h1 (y) wi h f (y 0 ) = 0 . As a consequence, here is some y 0 0 Rmon , we know j (y) < s. O herwise, j (y ) < A (y ) < h2 (y). Since j h1 (y) and i (x) has converged. Since y h2 (x), we conclude i (x) i (x) h1 (h2 (x)). So one can give he following de ni ion for f by case-dis inc ion where he rs case is aken which is applicable and where = f (0) f (z).
On he Uniform Learnabili y of Approxima ions o Non-recursive Func ions
e(i j
) (x)
=
(x) i (x) 1 0
287
if x dom( ) if i (x) h1 (h2 (x)) h2 (x) if A (x) j (x) o herwise.
Since he search-condi ions in he second and hird case are bounded by a recursive func ion in x, he family of all e(i j ) con ains only o al func ions and i s universal i j x e(i j ) (x) is compu able in all parame ers. Furhermore, for he correc i j as chosen above, e(i j ) equals he given f since, for all x > z , ei her i (x) converges wi hin h1 (h2 (x)) s eps o f (x) or h2 (x). I follows ha his family covers U(A) and ha U(A) A (x) j (x) is in NUM which, a con radic ion o Theorem 13, since A is nei her recursive nor high.
6. Robust Learnin A ma hema ical elegan proof me hod o separa e learning cri eria is he use of classes of self-describing func ions. On he one hand, hese examples are a bi ar i cial, since hey use coding ricks. On he o her hand, na ural objec s like cells con ain a descrip ion of hemselves. Never heless, from a learning heore ical poin some cri icism remains in order, since a learner needs only o fe ch some code from he inpu . Therefore, Barzdins sugges ed o look a res ric ed versions of learning: For example, a class S is robus ly EX -learnable, i , for every opera or , he class (S) is EX -learnable. There were many discussions, which opera ors are admissible in his con ex and how o deal wi h hose cases where maps some func ions in S o par ial func ions. A he end, i urned ou ha i is mos sui able o consider only general recursive opera ors which map every o al func ion o a o al one [16]. This no ion is among all no ions of robus EX -learning he mos general one in he sense ha every class S which is robus ly EX -learnable wi h respec o any cri erion considered in he li era ure is also robus ly EX -learnable wi h respec o he model of Jain, Smi h and Wiehagen [16]. Al hough he class B is qui e na ural and does no have any obvious selfreferen ial coding, he class B is no robus ly EX -learnable so while on he one hand he no ion of robus EX -learning s ill permi s opological coding ricks [16,23], i does on he o her hand already rule ou he na ural class B . The provided example gives some incidence, ha here is s ill some need o nd a adequa e no ion for a na ural EX -learnable class. Every class in NUM is robus ly EX -learnable, in par icular he class U(A) for a recursive se A (cf. Theorem 11). The nex heorem shows ha U(A) is no robus ly EX -learnable for any nonrecursive se s A which are par of a recursively inseparable pair, which are simple bu no hypersimple or which are nei her recursive nor high. Thus, here he si ua ion is parallel o he one a Theorem 13.
288
Frank S ephan and Thomas Zeugmann
Theorem 17. U(A) is not robustly EX-learnable for the followin r.e. sets A. ( ) A is part of a recursively inseparable pair. (b ) A is simple but not hypersimple. (c ) A is neither recursive nor hi h.
7. Conclusions The main opic of he presen inves iga ions have been he class B of Blum and Blum [6] and he na ural generaliza ions U(A) of i ob ained by using r.e. se s A as a parame er. I is has been shown ha for large families of r.e. se s A, hese classes U(A) are no in NUM . Fur hermore, hey can be always EX -learned. Moreover, for some bu no all se s A here is also a REX -learner. Robus EX learning is impossible for all non-recursive se s A ha are par of recursively inseparable pair, for simple bu no hypersimple se s A and for all se s A ha are non-high and non-recursive. Since he classes U(A) are qui e na ural, his resul adds some incidence ha na ural learnabili y does no coincide wi h robus learnabili y as de ned in he curren research. Fu ure work migh address he remaining unsolved ques ion whe her U(A) is ou side NUM for all non-recursive se s A. Addi ionally, one migh inves iga e whe her U(A) is robus ly BC -learnable for some se s A such ha U(A) is no robus ly EX -inferable. I would be also in eres ing o know whe her or no U(A) can be reliably BC -learned for se s A wi h U(A) REX (cf. [18] for more informa ion concerning reliable BC -learning). Finally, here are some ways o generalize he no ion of U(A) o every K -recursive se A and one migh inves iga e he learning heore ic proper ies of he so ob ained classes.
References 1. A. Ambainis and R. Freivalds. Transforma ions ha preserve learnabili y. In Proccedin s of the 7th International Workshop on Al orithmic Learnin Theory (ALT’96) (S. Arikawa and A. Sharma, Eds.) Lec ure No es in Ar i cial In elligence Vol. 1160, pages 299 311, Springer-Verlag, Berlin, 1996. 2. D. Angluin. Induc ive inference of formal languages from posi ive da a. Information and Control, 45:117 135, 1980. 3. D. Angluin and C.H. Smi h. A survey of induc ive inference: Theory and me hods. Computin Surveys, 15:237 289, 1983. 4. J. Barzdins. Prognos ica ion of au oma a and func ions. Information Processin ’71, (1) 81 84. Edi ed by C. P. Freiman, Nor h-Holland, Ams erdam, 1971. 5. M. Blum. A machine-independen heory of he complexi y of recursive func ions. Journal of the Association for Computin Machinery, 14:322 336. 6. L. Blum and M. Blum. Towards a ma hema ical heory of induc ive inference. Information and Control, 28:125 155, 1975. 7. J. Case, S. Jain, M. O , A. Sharma and F. S ephan. Robus learning aided by con ex . In Proceedin s of 11th Annual Conference on Computational Learnin Theory (COLT’98), pages 44 55, ACM Press, New York, 1998.
On he Uniform Learnabili y of Approxima ions o Non-recursive Func ions
289
8. J. Case and C.H. Smi h. Comparison of iden i ca ion cri eria for machine induc ive inference. Theoretical Computer Science 25:193 220, 1983. 9. R. Freivalds. Induc ive inference of recursive func ions: Quali a ive heory. In Baltic Computer Science (J. Barzdins and D. Bj rner, Eds.), Lec ure No es in Compu er Science Vol. 502, pages 77 110. Springer-Verlag, Berlin, 1991. 10. M. Fulk. Robus separa ions in induc ive inference. In Proceedin s of the 31st Annual Symposium on Foundations of Computer Science (FOCS), pages 405 410, S . Louis, Missouri, 1990. 11. M.E. Gold. Language iden i ca ion in he limi . Information and Control, 10:447 474, 1967. 12. J. Grabowski. S arke Erkennung. In S ruk urerkennung diskre er kyberne ischer Sys eme, (R. Linder, H. Thiele, Eds.), Seminarberich e der Sek ion Ma hema ik der Humbold -Universi ¨ a Berlin Vol. 82, pages 168 184, 1986. 13. J.P. Helm. On e ec ively compu able opera ors. Zeitschrift f¨ ur mathematische Lo ik und Grundla en der Mathematik (ZML), 17:231 244, 1971. 14. S. Jain. Robust Behaviourally Correct Learnin . Technical Repor TRA6/98 a he DISCS, Na ional Universi y of Singapore, 1998. 15. S. Jain, D. Osherson, J.S. Royer and A. Sharma. Systems That Learn: An Introduction to Learnin Theory. MIT-Press, Bos on, MA., 1999. 16. S. Jain, C. Smi h and R. Wiehagen. On he power of learning robus ly. In Proceedin s of Eleventh Annual Conference on Computational Learnin Theory (COLT), pages 187 197, ACM Press, New York, 1998. 17. E.B. Kinber and T. Zeugmann. Induc ive inference of almos everywhere correc programs by reliably working s ra egies. Journal of Information Processin and Cybernetics, 21:91 100, 1985. 18. E.B. Kinber and T. Zeugmann. One-sided error probabilis ic induc ive inference and reliable frequency iden i ca ion. Information and Computation, 92:253 284, 1991. 19. R. Kle e and R. Wiehagen. Research in he heory of induc ive inference by GDR ma hema icians A survey. Information Sciences, 22:149 169, 1980. 20. S. Kur z and C.H. Smi h. On he role of search for learning. In Proceedin s of the 2nd Annual Workshop on Computational Learnin Theory (R. Rives , D. Haussler and M. Warmu h, Eds.) pages 303 311, Morgan Kaufman, 1989. 21. E. Minicozzi. Some na ural proper ies of s rong-iden i ca ion in induc ive inference. Theoretical Computer Science, 2:345 360, 1976. 22. P. Odifreddi. Classical Recursion Theory. Nor h-Holland, Ams erdam, 1989. 23. M. O and F. S ephan. Avoiding coding ricks by hyperrobus learning. Proceedin s of the Fourth European Conference on Computational Learnin Theory (EuroCOLT) (P. Fischer and H.U. Simon, Eds.) Lec ure No es in Ar i cial In elligence Vol. 1572, pages 183 197, Springer-Verlag, Berlin, 1999. 24. H.Jr. Rogers. Theory of Recursive Functions and E ective Computability. McGraw Hill, New York, 1967. 25. F. S ephan and T. Zeugmann. On he Uniform Learnabili y of Approxima ions o Non-Recursive Func ions. DOI Technical Repor DOI-TR-166, Depar men of Informa ics, Kyushu Universi y, July 1999. 26. T. Zeugmann. A-pos eriori charac eriza ions in induc ive inference of recursive func ions. Journal of Information Processin and Cybernetics (EIK), 19:559 594, 1983.
290
Frank S ephan and Thomas Zeugmann
27. T. Zeugmann. On he nonboundabili y of o al e ec ive opera ors. Zeitschrift f¨ ur mathematische Lo ik und Grundla en der Mathematik (ZML), 30:169 172, 1984. 28. T. Zeugmann. On Barzdins’ conjec ure. In Proceedin s of the International Workshop on Analo ical and Inductive Inference (AII’86) (K.P. Jan ke, Ed.), Lec ure No es in Compu er Science Vol. 265, pages 220 227. Springer-Verlag, Berlin, 1986.
Learnin Minimal Covers of Functional Dependencies with Queries? Montserrat Hermo1 and V ctor Lav n2 1
Dpto. Len uajes y Sistemas Informaticos UPV/EHU, P.O. Box 649, E-20080 San Sebastian, SPAIN j phehum@s .ehu.es 2 Dpto. Sistemas Informaticos y Pro ramacion Universidad Complutense E-28040 Madrid, SPAIN vlav
[email protected] m.ucm.es
Abs rac . Functional dependencies play an important role in the desi n of databases. We study the learnability of the class of minimal covers of functional dependencies (M CF D) within the exact learnin model via queries. We prove that neither equivalence queries alone nor membership queries alone su ce to learn the class. In contrast, we show that learnin becomes feasible if both types of queries are allowed. We also ive some properties concernin minimal covers.
1
Introduction
Functional dependencies were introduced by Codd [7] as a tool for designing relational databases. Based on this concept, a well developed formalism has arisen, the theory of normalization. This formalism helps to build relational databases that lack undesirable features, such as redundancy in the data and update anomalies. We study the learnability of the class M CF D (minimal covers of functional dependencies) in the model of learning with queries due to Angluin [1,2]. In this model the learner’s goal is to identify an unknown target concept c in some class C. In order to obtain information about the target, the learner has available two types of queries: membership and equivalence queries. In a membership query the learner supplies an instance x from the domain and gets answer YES if x belongs to the target, and NO otherwise. The input to an equivalence query is some hypothesis h, and the answer is either YES if h c or a counterexample in the symmetric di erence of c and h. The class C is learnable if the learner can identify any target concept c in time polynomial in the size of c and the length of the largest counterexample received. ?
Partially supported by the Spanish DGICYT throu h project PB95-0787 (KOALA).
O. Watanabe, T. Yokomori (Eds.): ALT’99, LNAI 1720, pp. 291 300, 1999. c Sprin er-Verla Berlin Heidelber 1999
292
Montserrat Hermo and V ctor Lav n
We prove that neither equivalence queries alone nor membership queries alone su ce to learn M CF D. For these negative results we use techniques similar to those in [3,6,8]. On the other hand, we show that M CF D is learnable using both types of queries. Our algorithm is a modi cation of Angluin et al.’s algorithm [4] to learn conjunctions of Horn clauses. We also show that the size of equivalent minimal covers of functional dependencies is polynomially related. Some related work can be found in [9,10] where the authors study how prior knowledge can speed up the task of learning. They propose functional dependencies ( determinations in their terminology) as a form of prior knowledge. They pose the question of whether prior knowledge can be learned. This paper investigates that direction. The paper is organized as follows. In Section 2 we introduce de nitions related to functional dependencies, and some algorithms that are folk-knowledge. We need them to prove some properties that will be used throughout the paper. In Sections 3 and 4 we prove negative results for membership queries and equivalence queries respectively. Finally, Section 5 shows the learning algorithm using membership and equivalence queries.
2
Preliminaries
In what follows, we give some de nitions, properties and algorithms related to functional dependencies, most of which can be found in any databases text book (see [11,5]). For de nitions concerning the model of learning via queries we refer the reader to [2]. 2.1
Functional Dependencies and Minimal Covers
An is a set of attributes. Each attribute A relation scheme R = A1 A2 Ai takes values from domain DOM (Ai ). An instance r of relation scheme R is DOM (An ). The size of an instance a subset of DOM (A1 ) DOM (A2 ) r is the number of n-tuples of r. Given R and X Y subsets of R, the functional dependency X − Y is a constraint on the values that instances of R can take. More precisely, we say X − Y , read X functionally determines Y , if for every instance r of R, and for every pair t1 t2 of tuples of r, t1 (X) = t2 (X) = t1 (Y ) = t2 (Y ) (where t1 (Z) = t2 (Z) means that tuples t1 and t2 coincide in the value of all attributes in Z). Given a functional dependency X − Y , we call X the antecedent of the functional dependency and Y the consequent. We say that a functional dependency X − Y is logically implied by a set of functional dependencies F if every instance r of R that satis es the dependencies in F also satis es X − Y . If r does not satisfy X − Y , then we say that r violates X − Y .
Learnin Minimal Covers of Functional Dependencies with Queries
293
De nition 1. The closure of a set of dependencies F , denoted by F + , is the set of functional dependencies that are lo ically implied by F . De nition 2. Let F and G be sets of dependencies. F is equivalent to G (F G) if F + = G+ . De nition 3. The closure of a set of attributes X, written X + , with respect to a set of dependencies F , is the set of attributes A such that X − A is in F + . Given a relation scheme R, X R and a set of functional dependencies F , the following algorithm (see [5]) computes X + with respect to F , in time polynomial in R and F . Al orithm Closure input X; X + = X; repeat OLDX + := X + ; for each dependency V − W in F do if V X + then X + := X + W ; end if end for until OLDX + = X + ; It is easy to test whether two sets of dependencies F and F 0 are equivalent: for each dependency X − Y in F (F 0 ), test whether X − Y is in F 0 (F ) using the above algorithm to compute X + with respect to F 0 (F ) and then checking whether Y X + . We will use this test in Subsection 2.2 to prove some properties concerning minimal covers. De nition 4. Let F be a set of dependencies. A set of dependencies G is a minimal cover for F if: 1. 2. 3. 4.
G F. The consequent of each dependency in G is a sin le attribute. For no dependency X − A in G, G − X − A F. For no dependency X − A in G and proper subset Y of X, (G − X − A ) Y − A F.
We outline a procedure to nd a minimal cover there can be several for a given set of dependencies F (see [11] for more details). First, using the property that a functional dependency X − Y holds if and only if X − A holds for all A in Y , we decompose all dependencies in F so that condition 2 of the de nition is ful lled. Then, for conditions 3 and 4, we check repeatedly whether dropping a dependency (or some attribute in the antecedent of a dependency) from F yields
294
Montserrat Hermo and V ctor Lav n
a set F 0 equivalent to F . If it is so we substitute F 0 for F , and keep applying the procedure until neither dependencies nor attributes can be eliminated. Note that it follows from the above that the size of a minimal cover for F is never much bigger that the size of F itself (at most a multiplicative factor in the number of attributes), what makes learning minimal covers as interesting as learning sets of general functional dependencies. 2.2
Some Properties of Minimal Covers
Now we prove that the size of equivalent minimal covers is polynomially related. First we need a lemma. Lemma 1. Let R be a relation scheme, let F and F 0 be minimal covers of functional dependencies over R. If F F 0 then any dependency X − A in F 0 can be inferred from F usin at most R dependencies of F 0 . Proof. Let us assume, by way of contradiction, that more than R dependencies of F 0 are needed to infer X − A. Then, at least two of them, say D1 and D2 , must have the same consequent. If we run Algorithm Closure to compute X + , the last one of D1 and D2 examined by the algorithm does not force the inclusion of any new attribute into X + , and thus is unnecessary. Corollary 1. Let R be a relation scheme, let F and F 0 be minimal covers of R F . functional dependencies over R. If F F 0 then F 0 F . Then there is some Proof. Suppose, by contradiction, that F 0 > R dependency D F 0 that, by Lemma 1, is not used to infer any of the dependencies in F . Let G be F 0 − D . Clearly F + can be inferred from G, and since F + = (F 0 )+ then D is redundant in F 0 , that is, F 0 is not a minimal cover. In Sections 3 and 4 we de ne some target classes containing sets of dependencies, that must be inequivalent and minimal covers. The following lemmas will allow us to ensure such requirements. An B Lemma 2. Let F be a set of functional dependencies over R = A1 such that the consequent of each dependency in F is the attribute B. If for all X − B in F it holds that X does not contain the antecedent of any other dependency of F , then F is a minimal cover for F . Proof. Obviously F satis es conditions 1 and 2 of De nition 4. To check that F satis es conditions 3 and 4 note that for every dependency X − B in F , and for all proper subset Y of X, X + with respect to F − X − B and Y + with respect to F do not contain B. Lemma 3. Let F1 and F2 be sets of dependencies whose consequents are the sin le attribute B. If there exists a dependency Y − B in F1 , such that Y does not contain the antecedent of any dependency of F2 , then both sets are inequivalent. Proof. Let Y − B be the dependency of the hypothesis. As B respect to F2 , F1 F2 .
Y + with
Learnin Minimal Covers of Functional Dependencies with Queries
3
295
Membership Queries
To show that M CF D is not learnable using membership queries alone, a standard adversary technique is used. We de ne a large target class, from which the target cover will be selected, in such a way that the answer to any membership query eliminates few elements from the target class. This will force the learner to make a superpolynomial number of queries to identify the target cover. Theorem 1. The class M CF D cannot be learned usin a polynomial number of polynomially-sized membership queries. An B , for n even, be a relation scheme and Proof. Let R = A1 A2 n let p be any polynomial. The target class Tn will contain 2 2 covers, all of them having dependencies A1 A2 −
B A3 A4 −
B
An−1 An −
B
Besides, each cover contains a distinguished dependency, whose antecedent has one attribute picked up from the antecedent of each dependency above, and B as the consequent. By Lemma 2 all covers in the target class Tn are minimal, and by Lemma 3 they are logically inequivalent. Using Tn as the target class, suppose the learner makes a membership query with instance r of size at most p(n). The adversary considers every pair of tuples t1 t2 in r, and answers according to the following rule: If t1 (B) = t2 (B) for every pair t1 t2 , then the answer is YES. No cover is eliminated from Tn . Otherwise, let S be the set of all pairs t1 t2 such that t1 (B) = t2 (B): S there exist attributes A2i+1 A2(i+1) (0 i If for some t1 t2 n − 1) such that t (A A ) = t (A A ), then the answer 1 2i+1 2 2i+1 2(i+1) 2(i+1) 2 is NO. Again no cover is eliminated from Tn . If none of the conditions above hold then the answer is YES. Note that this answer removes p(n) 2 covers from Tn in the worst case, the case being that r can be partitioned into pairs t1 t2 , where t1 and t2 coincide in the value of exactly n2 attributes. n
Therefore, to identify the target cover the learner must make at least membership queries.
4
22 p(n)
−1
Equivalence Queries
As in the case of membership queries, to prove nonlearnability with equivalence queries alone we use an adversary argument. First, we need a lemma. An B , Lemma 4. Let F be a minimal cover de ned over R = A1 A2 containin p dependencies, each of them havin consequent B, and antecedent An . There exists some instance r with at least n attributes from A1 A2 of R with the followin properties:
296
Montserrat Hermo and V ctor Lav n
r has two tuples t1 t2 . r satis es F . The number of attributes Ai at most 1 + n(ln p).
A1 A2
An for which t1 (Ai ) = t2 (Ai ) is
Proof. Since all antecedents of dependencies in F have at least n attributes An , there must be some attribute Ai that occurs in at least from A1 A2 pp of them. We now can delete from F the dependencies that have Ai in their n antecedent, and apply the same procedure to the remaining set of dependencies. After doing so k times we are left with a set of at most p(1 − p1n )k dependencies. Taking k = 1 + n(ln p), we obtain p(1 − p1n )k < 1. Therefore, there is a set X with at most 1+ n(ln p) attributes such that all the antecedents of dependencies in F have some attribute in X. The instance r = t1 t2 where t1 (A) = t2 (A) for all A X and t1 (R − X) = t2 (R − X) surely satis es F . Theorem 2. The class M CDF cannot be learned usin a polynomial number of polynomially-sized equivalence queries. An B be a relation scheme. We de ne the Proof. Let R = A1 A2 target class, Tn , that contains every cover G that satis es the following: Ai√n − B. G has n dependencies of the form Ai1 Ai2 The antecedents of the dependencies in G are pairwise disjoint. By Lemma 2 and Lemma 3 all covers in Tn are minimal and logically inequivalent. The cardinality of Tn is n p Tn = ( n) n+1 Now, to an equivalence query on input F , of size at most p = nc , where c is a constant, the adversary answers as follows: A1 A2 An , If there is some dependency X − Ai in F , where Ai then give as a counterexample the instance r = t1 t2 where t1 (Ai ) = t2 (Ai ) and t1 (R − Ai ) = t2 (R − Ai ). Clearly this counterexample does not remove any cover from Tn . Otherwise, if there is some dependency X − B in F and X < n, then return as a counterexample the instance r = t1 t2 where t1 (X) = t2 (X) R − X. No cover in Tn is violated by this and t1 (A) = t2 (A) for all A instance, although it violates F . If none of the cases above hold then Lemma 4 guarantees the existence of an instance r = t1 t2 that satis es F , and whose tuples disagree in the values of at most 1 + c n(ln n) attributes. In this case give that instance as counterexample. The covers that will be eliminated by the counterexample are those for which the antecedent of every dependency has at least one
Learnin Minimal Covers of Functional Dependencies with Queries
297
attribute in which the tuples of r disagree. Therefore, the number of covers that r eliminates from Tn is at most 1 + c n(ln n) (n − n) p En = n (( n − 1) ) n ( n) The fraction of covers removed from Tn , that is, p
(1 + c n(ln n)) ( n)
n
En jTn j ,
is at most
p
( n) n p (n − n) n
which is superpolynomially small in n.
5
The Learnin Al orithm
In this section we show that a slight modi cation of Angluin et al.’s algorithm HORN for learning conjunctions of Horn clauses, using membership and equivalence queries, yields an algorithm that learns M CF D. First, we discuss the meaning of positive and negative counterexamples in the setting of functional dependencies. Let us assume that the counterexamples are instances of two tuples (obviously, a one-tuple instance never can be a counterexample). In this case, a positive counterexample t1 t2 tells that no dependency having its antecedent contained in the set of attributes where t1 and t2 agree, and its consequent outside, can be in the target cover. In contrast, a negative counterexample indicates that at least one dependency satisfying the conditions just mentioned must be in the target. Note that the signi cance of these counterexamples is the same as the meaning of counterexamples in the case of Horn clauses, if we translate set of attributes where t1 and t2 agree into set of variables assigned true . Also note that there is no syntactic di erence between a conjunction of Horn clauses and a minimal cover for a set of functional dependencies1 , hence the input to equivalence queries has the same shape , no matter what oracle EQHORN or EQMCF D we use. 1
There is an exception to this statement. The counterpart to a Horn clause c with no positive literal should be a functional dependency f with the empty set as consequent. The di erence is not merely syntactic but also semantic, since f does not impose any constraint on the instance space, that is, f is superfluous unlike c. This fact rules out the strai htforward transformation of the tar et class used by An luin [3] to prove approximate n erprints for CN F (and implicitly for conjunctions of Horn clauses) into a tar et class for provin non-learnability with equivalence queries alone for M CF D. The reason is that, once transformed, the class would contain just one minimal cover: the empty one. Also the counterpart to a Horn clause c with no ne ative literals should be a functional dependency f with the empty set in the antecedent. However, in this case the meanin of both c and f is alike
298
Montserrat Hermo and V ctor Lav n
Therefore, were we to learn M CF D over an instance space containing only two-tupled examples, the transformation of HORN would be straightforward: substitute EQMCF D and M QMCF D for EQHORN and M QHORN respectively; whenever EQMCF D provides a counterexample t1 t2 , convert t1 t2 into a boolean vector by setting to true the attributes (variables) where t1 and t2 agree and false elsewhere; nally, perform the reverse mapping before asking any membership query. One last remark, if we wanted the learning algorithm to be proper, in the sense that inputs to equivalence queries be in the class M CF D, we should transform the hypotheses generated into minimal covers. This can be done in polynomial time. Now, we wish to address the problem of learning in the general case, that is, when the instance space is not restricted to contain only two-tupled instances. The key observation is that, to detect the violation of some dependency or the need of its inclusion in the current hypothesis, it su ces to consider pairs of tuples. When a k-tupled positive counterexample is received, we consider all k 2 pairs of tuples, and for each of them proceed to remove from the current hypothesis the dependencies that are violated. If the counterexample is negative then we ask k2 membership queries to detect a pair of tuples there must be at least one that violates some dependency in the target cover, and proceed accordingly, that is, trying to identify the dependency in order to include it in the current hypothesis. Thus, we have reduced the problem of learning M CF D over an unrestricted instance space to that of learning when the instances have two tuples. We present now the algorithm that learns M CF D. We follow the notation in [4] as much as possible. For x and y boolean vectors, true(x) is the set of attributes assigned true by x; x y is the boolean vector such that true(x y) = true(x) true(y). Given a two-tupled relation r, ketch(r) is the boolean vector whose true values correspond to the attributes for which the tuples of r agree. For boolean vector x, rel(x) maps x onto a relation r (there are many) such that ketch(r) = x. Finally, if x is a boolean vector such that true(x) = Ak , then F D(x) denotes the set of functional dependencies A1 A2 F D(x) = A1 A2
Ak −
B:B
true(x)
The algorithm maintains a sequence S of boolean vectors that are ketche of negative counterexamples, each of them violating distinct dependencies of the target cover. This sequence is used to generate a new hypothesis F by taking the union of F D(x) for all x in S. Since we want a proper learning algorithm, we must transform the hypothesis F thus generated into a minimal cover G, prior to any equivalence query. Note that when a positive counterexample is received we eliminate dependencies from F instead of G. In doing so we preserve the parallelism with algorithm HORN , where a hypothesis may contain clauses that are implied by other clauses in the same hypothesis. This is to prevent the algorithm from possibly entering an in nite loop. On the other hand, it is obvious
Learnin Minimal Covers of Functional Dependencies with Queries
299
that the counterexample provided by an equivalence query is independent of whether the input to that query is a minimal cover or not, as long as they are equivalent. (For more explanations and ideas behind the algorithm see [4]). Set S to be the empty sequence; /* i denotes the i-th boolean vector of S */ Set F to be the empty hypothesis; Set G to be the empty hypothesis; /* G is a minimal cover for F */ while EQMCF D (G) = YES loop Let r be the counterexample relation returned by the equivalence query; if r violates at least one functional dependency of F then /* r is a positive example */ remove from F every dependency that r violates; else /* r is a negative example */ ask (at most jrj 2 ) queries to M QMCF D until a negative answer is got for some t1 t2 in r; x:= sketch ( t1 t2 ); for each i in S such that true( i x) is properly contained in true( i ) loop M QMCF D (rel( i x)); end loop; if any of these queries is answered NO then let i =min j : M QMCF D (rel( j x)) = NO ; replace i with i x; else add x as the last element in the sequence S; end S if ; F = s2S F D( ); end if; set G to be a minimal cover for F ; end loop; return G; end; The correctness of the algorithm follows from the correctness of HORN and the comments above. About the query and time complexity, the algorithm makes as many equivalence queries as HORN makes. However, both the number of membership queries and the time complexity are increased, since the counterexamples can have an arbitrary number of tuples, and for each counterexample received the algorithm has to compute a minimal cover. In any case, the complexity is polynomial in the size of the target cover, the number of attributes of the relation scheme and the number of tuples of the largest counterexample.
300
Montserrat Hermo and V ctor Lav n
References 1. D. An luin. Learnin Re ular Sets from Queries and Counterexamples . Information and Computation, 75, 87-106, 1987. 2. D. An luin. 1988.
Queries and Concept Learnin
. Machine Learnin , 2(4), 319-342,
3. D. An luin. Ne ative Results for Equivalence Queries . Machine Learnin , 5, 121150, 1990. 4. D. An luin, M. Frazier and L. Pitt. Machine Learnin , 9, 147-164, 1992.
Learnin
Conjunctions of Horn Clauses .
5. R. Elmasri and S. B. Navathe. Fundamentals of Database Systems. BenjaminCummin s Pub. Redwood City, California, 1994. 6. J. Castro, D. Guijarro and V. Lav n. Learnin Nearly Monotone k-term DNF . Information Processin Letters, 67(2), 75-79, 1998. 7. E.F. Codd. A Relational model for Lar e Shared Data Banks . Comm. of the ACM, 13(6), 377-387, 1970. 8. D. Guijarro, V. Lav n and V. Ra havan. Learnin Monotone Term Decision Lists . To appear in Theoretical Computer Science. Proceedin s of EUROCOLT’97, 16-26, 1997. 9. S. Mahadevan and P. Tadepalli. Quantifyin Prior Determination Knowled e usin PAC Learnin Model . Machine Learnin , 17(1), 69-105, 1994. 10. P. Tadepalli and S. Russell. Learnin from Examples and Membership Queries with Structured Determinations . Machine Learnin , 32, 245-295, 1998. 11. J.D. Ullman. Principles of Database and Knowled e-Base Systems. Computer Science Press, Inc. 1988.
Boolean Formulas Are Hard to Learn for Most Gate Bases V ctor Dalmau Departament LSI, Universitat Politecnica de Catalunya, Modul C5. Jordi Girona Sal ado 1-3. Barcelona 08034, Spain, dalmau@ls .upc.es
Abs rac . Boolean formulas are known not to be PAC-predictable even with membership queries under some crypto raphic assumptions. In this paper, we study the learnin complexity of some subclasses of boolean formulas obtained by varyin the basis of elementary operations allowed as connectives. This broad family of classes includes, as a particular case, eneral boolean formulas, by considerin the basis iven by AND OR NOT . We completely solve the problem. We prove the followin dichotomy theorem: For any set of basic boolean functions, the resultin set of formulas is either polynomially learnable from equivalence queries or membership queries alone or else it is not PAC-predictable even with membership queries under crypto raphic assumptions. We identify precisely which sets of basic functions are in which of the two cases. Furthermore, we prove than the learnin complexity of formulas over a basis depends only on the absolute expressivity power of the class, ie., the set of functions that can be represented re ardless of the size of the representation. In consequence, the same classi cation holds for the learnability of boolean circuits.
1
Introduction
The problem of learnin an unknown boolean formula under some determined protocol has been widely studied. It is well known that, even restricted to propositional formulas, the problem is hard [4, 18] in the usual learnin models. Therefore researchers have attempted to learn subclasses of propositional boolean formulas obtained by enforcin some restrictions on the structure of the formula, specially subclasses of boolean formulas in disjunctive normal form (DNF). For example, k-DNF formulas, k-term DNF formulas, monotone-DNF formulas, Horn formulas, and their dual counterparts [1, 5, 2] have all been shown exactly learnable usin membership and equivalence queries in An luin’s model [1] while the question of whether DNF formulas are learnable is still open. Another important class of problems can be obtained by restrictin the number of occurrences of a variable. For example, whereas there is a polynomial-time al orithm to learn read-once formulas with equivalence and membership queries [3], the problem of learnin read-thrice boolean formulas is hard under crypto raphic assumptions [4]. O. Watanabe, T. Yokomori (Eds.): ALT’99, LNAI 1720, pp. 301 312, 1999. c Sprin er-Verla Berlin Heidelber 1999
302
V ctor Dalmau
In this paper we take a di erent approach. We study the complexity of learnin subclasses of boolean formulas obtained placin some restrictions in the elementary boolean functions that can be used to build the formulas. In eneral, boolean formulas are constructed by usin elementary functions from a complete basis, enerally AND OR NOT . In this paper we will allow formulas to use as a basis any arbitrary set of boolean functions. fm be a nite set of boolean functions. A More precisely, let F = f1 formula in FOR(F ) can be any of (a) a boolean variable, or (b) an expression of the form f ( 1 k ) where f is a k-ary function in F and 1 k are formulas in FOR(F ). For example, consider the problem of learnin a monotone boolean formula. Every such formula can be expressed as a formula in the class FOR( AND OR ). The main result of this paper characterizes the complexity of learnin FOR(F ) for every nite set F of boolean functions. The most strikin feature of this characterization is that for any F , FOR(F ) is either polynomially learnable with equivalence or membership queries alone or, under some crypto raphic assumptions, not polynomially predictable even with membership queries. This dichotomy is somewhat surprisin since one mi ht expect that any such lar e and diverse family of concept classes would include some representatives of the many intermediate learnin models such as exact learnin with equivalence and membership queries, PAC learnin with and without membership queries and PAC-prediction without membership queries. Furthermore, we ive an interestin classi cation of the polynomially learnable classes. We show that, in a sense that will be made precise later, FOR(F ) is polynomially learnable if and only if at least one of the followin conditions holds: (a) Every function f (x1 x2 form c0 (c1 x1 ) (c2 c (1 i n). (b) Every function f (x1 x2 form c0 (c1 x1 ) (c2 c (1 i n). (a) Every function f (x1 x2 form c0 (c1 x1 ) (c2 c (1 i n).
xn ) in F is de nable by an expression of the x2 ) (cn xn ) for some boolean coe cients xn ) in F is de nable by an expression of the x2 ) (cn xn ) for some boolean coe cients xn ) in F is de nable by an expression of the x2 ) (cn xn ) for some boolean coe cients
There is another rather special feature of this result. Learnability of boolean formulas over a basis F depends only on the set of functions that can be expressed as a formula in FOR(F ) but it does not depend on the size of the representation. As a consequence of this fact, the same dichotomy holds for representation classes which, in terms of absolute expressivity power, are equivalent to formulas, such as boolean circuits. As an intermediate tool for our study we introduce a link with some well known al ebraic structure, called clones in Universal Al ebra. In particular, we make use of very remarkable result on the structure of boolean functions proved by Post [21]. This approach has been successful in the study of other
Boolean Formulas Are Hard to Learn for Most Gate Bases
303
computational problems such as satis ability, tautolo y and countin problems of boolean formulas [23], learnability of quanti ed formulas [9] and Constraint Satisfaction Problems [13, 14, 15, 16, 12]. Finally we mention some similar results: a dichotomy result for satis ability, tautolo y and some countin problems over closed sets of boolean functions [23], the circuit value problem [11, 10], the satis ability of eneralized formulas [24], the inverse eneralized satis ability problem [17], the eneralized satis ability countin problem [7], the approximability of minimization and maximization problems [6, 19, 20], the optimal assi nments of Generalized Propositional Formulas [22] and the learnability of quanti ed boolean formulas [8].
2
Learnin Preliminaries
Most of the terminolo y about learnin comes from [4]. Strin s over X = D will represent both examples and concept names. A representation of concepts C is any subset of X X. We interpret an element u x of X X as consistin of a concept name u and an example x. The example x is a member of the concept u if and only if u x C. De ne the concept represented by u as KC (u) = x : u x C The set of concepts represented by C is KC = KC (u) : u X Alon these pa es we use two models of learnin , all of them fairly standard: An luin’s model of exact learnin with queries de ned by An luin [1], and the model of PAC-prediction with membership queries as de ned by An luin and Kharitonov [4]. To compare the di culty of learnin problems in the prediction model we use a sli ht eneralization of the prediction-preservin reducibility with membership queries [4]. and be De nition 1. Let C and C 0 be representations of concepts. Let elements not in X. Then C is pwm-reducible to C 0 , denoted C pwm C 0 , if and only if there exist four mappin s ,f ,h, and j with the followin properties: 1. There is a nondecreasin polynomial q such that for all natural numbers s and n and for u X with u s, (s n u) is a strin u0 of len th at most q(s n u ). 2. For all natural numbers s and n, for every strin u X with u s, and for every x X with x n, f (s n x) is a strin x0 and x KC (u) if and only if x0 KC ( (s n u)). Moreover, f is computable in time bounded by a polynomial in s, n, and x , hence there exists a nondecreasin polynomial t t(s n x ). such that x0 3. For all natural numbers s and n, for every strin u X with u s, for , h(s n x0 ) is a strin x X, and every x0 X, and for every b j(s n x0 b) is either or . Furthermore x0 KC ( (s n u)) if and only if j(s n x0 b) = , where b = if x KC (u) and b = otherwise. Moreover, h and j are computable in time bounded by a polynomial in s, n, and x0 . In (2), and independently in (3), the expression x with x KC (u) , as discussed in [4].
KC (u) can be replaced
304
V ctor Dalmau
The followin results are obtained adaptin sli htly some proofs in [4]. Lemma 1. The pwm-reduction is transitive, i.e., let C C 0 and C 00 be representations of concepts, if C pwm C 0 pwm C 00 then C pwm C 00 . Lemma 2. Let C and C 0 be representations of concepts. If C pwm C 0 and C 0 is polynomially predictable with membership queries, then C is also polynomially predictable with membership queries.
3
Clones
Let D be nite set called domain. An n-adic function over D is a map f : Dn − D. Let F D be the set of all the functions over the domain D. Let P be a class of functions over D. Let Pn be the n-adic functions in P. We shall say that P is a clone if it satis es the followin conditions: C1 For each n
m
1, P contains the projection function projn m , de ned by xn ) = xm
projn m (x1 C2 For each n m 1, each f composite function h = f [
1
Pn and each 1 n ] de ned by
xm ) = f ( 1 (x1
h(x1
xm )
n
Pm . Pm contains the
n (x1
xm ))
If F F D is any set of functions over D, there is a smallest clone containin all of the functions in F ; this is the clone enerated by F , and we denote it F fk is a nite set, we may write f1 fk for F , and refer to If F = f1 fk as enerators of F . The set of clones over a nite domain D is closed f1 under intersection and therefore it constitutes a lattice, with meet ( ) and join ( ) operations de ned by: C = 2I
C 2I
C = 2I
C 2I
There is a smallest clone which is the intersection of all clones; we shall denote it ID . It is easy to see that ID contains exactly the projections over D. 3.1
Boolean Case
Operations on a 2-element set, say D = 0 1 are boolean operations. The lattice of the clones over the boolean domain was studied by Post, leadin to a full description [21]. The proof is too lon to be included here. We will ive only the description of the lattice. A usual way to describe a poset (P ) (and a lattice in particular) is by depictin a dia ram with the relation covera e, where for every a b P we say
Boolean Formulas Are Hard to Learn for Most Gate Bases C1
C2 F42 F12 F43 F13 F4 F1
M1 C3
M2
F32
305
C4
M3
F22
F82
F72
M4
F62
F52 F83
F73
F33
D3
F23 F3 F2 S3
L5
L2
S6
L4 S5 O5
S1
F7
L1
D2
L3
O9 O8 O4
F53
F63
D1
P6
F6
P5 O6
F8 F5
P3 P1
O1
Fi . 1. Post Lattice
that a covers b if b a and if c is an element in P such that b c a, then either c = a or c = b. The dia ram of the lattice of clones on D = 0 1 , often called Post’s lattice is depicted in Fi ure 1. The clones are labeled accordin their standard names. A clone C is join irreducible i C = C1 C2 always implies C = C1 or C = C2 . In Fi ure 1, the join irreducible clones of the dia ram are denoted by . Since P = f 2P f , it follows that the join irreducible clones are enerated by a sin le operation, furthermore, it su ces to present a eneratin operation for each join irreducible clone of Post’s lattice. Table 2 associates to every join irreducible clone C, its eneratin operation C . For a n-ary boolean function f de ne its dual dual(f ) by dual(f )(x1
xn ) = f ( x1
xn )
Obviously, dual(dual(f )) = f . Furthermore, f is self-dual i dual(f ) = f . For a class F of boolean functions de ne dual(F ) = dual(f ) : f F The classes F and dual(F ) are called dual. Notice that dual(F ) = dual( F ).
4
The Dichotomy Theorem
A base F is a nite set of boolean functions f1 f2 fn (f 1 i n, denotes both the function and its symbol). We follow the standard de nitions of boolean circuits and boolean formulas. The class of boolean circuits over the basis F , denoted by CIR(F ), is de ned to be the set of all the boolean circuits
306
V ctor Dalmau D2 F1 F2 F2i ( 2) F5 F6 F6i ( 2) L4 O1 O4 O5 O6 P1 S1
y z) = (x y) (x z) (x y z) = x (y z) (y z) F2 (x y z) = x xi+1 ) = i+1 F2i (x1 j=1 (x1 (y z) F5 (x y z) = x (y z) F6 (x y z) = x xi+1 ) = i+1 F6i (x1 j=1 (x1 y z L4 (x y z) = x D2 (x F1
=x =1 (x) =0 O6 P1 (x y) = x S1 (x y) = x
z)
(y
V
xj−1
xj+1
xi+1 )
W
xj−1
xj+1
xi+1 )
O4 (x) O5 (x)
y y
Fi . 2. Generatin operations for meet irreducible clones
where every ate is a function in F . The class of boolean formulas over the basis F , denoted by FOR(F ) is the class of circuits in CIR(F ) with fan-out 1. xn , we denote Given a boolean circuit C over the input variables x1 x2 by [C] the function computed by C when the variables are taken as ar uments in lexico raphical order. In the same way, iven a nite set of boolean functions F we de ne [CIR(F )] as the class of functions computed by circuits in xn , CIR(F ). Similarly, iven a boolean formula over the variables x1 x2 we denote by [ ] the function computed by when the variables are taken as ar uments in lexico raphical order. In the same way, iven a nite set of boolean functions F we de ne [FOR(F )] as the class of functions computed by formulas in FOR(F ). Given a boolean circuit C CIR(F ), it is possible to construct a formula FOR(F ) computin the same function than C (the size of can be exponentially bi er than the size of C but we are not concerned with the size of the representation). Thus [FOR(F )] = [CIR(F )]. In fact, it is direct to verify that the class of functions [CIR(F )] contains the projections and it is closed under composition. Therefore, it constitutes a clone. More precisely, [CIR(F )] is exactly the clone enerated by F , in our terminolo y [CIR(F )] = [FOR(F )] = F
for all bases F
The use of clone theory to study computational problems over boolean formulas was introduced in [23] studyin the complexity of some computational problems over boolean formulas, such as satis ability, tautolo y and some countin problems. We say that an n-ary function f is disjunctive if there exists some boolean coe cients c (0 i n) such that f (x1
xn ) = c0
(c1
x1 )
(c2
x2 )
(cn
xn )
Boolean Formulas Are Hard to Learn for Most Gate Bases
307
Similarly, we say that an n-ary function f is conjunctive if there exists some boolean coe cients c (0 i n) such that f (x1
xn ) = c0
(c1
x1 )
(c2
x2 )
(cn
xn )
Accordin ly, we say that an n-ary function f is linear if there exists some boolean coe cients c (0 i n) such that f (x1
xn ) = c0
(c1
x1 )
(c2
x2 )
(cn
xn )
Let F be a set of boolean functions. We will say that F is disjunctive (resp. conjunctive, linear) i every function in F is disjunctive (resp. conjunctive, linear). We will say that F is basic i F is disjunctive, conjunctive or linear. For any set of boolean formulas (circuits) B, we de ne CB as the representation of concepts formed from formulas (circuits) in B. More precisely, CB contains all the tuples of the form C x where C represents a formula (circuit) in B and x is a model satisfyin C. In this section we state and prove the main result of the paper. Theorem 1. (Dichotomy Theorem for the Learnability of Boolean Circuits and Boolean Formulas) Let F be a nite set of boolean functions. If F is basic, then CCIR(F ) is both polynomially exactly learnable with n+ 1 equivalence queries, and polynomially exactly learnable with n+1 membership queries. Otherwise, CFOR(F ) is not polynomially predictable with membership queries under the assumption that any of the followin three problems are intractable: testin quadratic residues modulo a composite, invertin RSA encryption, or factorin Blum inte ers. We refer the reader to An luin and Kharitonov [4] for de nitions of the crypto raphic concepts. We mention that the non-learnability results for boolean circuits hold also with the weaker assumption of the existence of public-key encryption systems secure a ainst CC-attack, as discussed in [4].
Proof of Theorem 1 Learnability of formulas over disjunctive, conjunctive and linear bases is rather strai htforward. Notice that if a basis F is disjunctive (resp. conjunctive, linear) then every function computed by a circuit in CIR(F ) is also disjunctive (resp. conjunctive, linear). Thus, learnin a boolean circuit over a basic basis is reduced to ndin the boolean coe cients of the canonical expression. It is easy to verify that this task can be done in polynomial time with n + 1 equivalence queries or n + 1 membership queries, as stated in the theorem. Now, we study which portion of Post’s lattice is covered by these cases. Consider clone L1 with eneratin set L1 = O4 O5 O6 L4 . Since all the operations in the eneratin set of L1 are linear, then the clone L1 is linear. Similarly, clone P6 , enerated by operations O5 , O6 and P1 is conjunctive, since so is every operation in the eneratin set. Finally clone S6 , enerated by operations O5 , O6 and S1 is disjunctive, since so is every operation in the
308
V ctor Dalmau
eneratin set. Thus, every clone contained in L1 , P6 , or S6 is linear, conjunctive or disjunctive respectively. Actually, clone L1 (resp. P6 , S6 ) has been chosen carefully amon the linear (resp. conjunctive, disjunctive) clones. It corresponds to the maximal linear (resp. conjunctive, disjunctive) clone in Post’s lattice. It has been obtained as the join of all the linear (resp. conjunctive, disjunctive) clones and, in consequence, has the property that every linear (resp. conjunctive, disjunctive) clone is contained in L1 (resp. P6 , S6 ). Let us study non-basic clones. With a simple inspection of Post’s lattice we can infer that any clone not contained in L1 , S6 or P6 contains any of the followin operations: F2 , F6 , D2 . In Section 4.1 it is proved that if F contains any of the previous functions, then the class FOR(F ) is not PACpredictable even with membership queries under the assumptions of the theorem.
4.1
Three Fundamental Non-learnable Functions
Let B = AND OR NOT be the usual complete basis for boolean formulas. In [4], it is proved that CFOR(B) is not polynomially predictable under the assumptions of Theorem 1. In this section we eneralize the previous result to all bases F able to simulate any of the followin three basic non-learnable functions: F2 , F6 , and D2 . This three functions can be re arded as the basic causes for non-learnability in formulas. The technique used to prove non-learnability results is a two-sta e pwmreduction from FOR(B). First, we prove as an intermediate result that monotone boolean formulas are as hard to learn, as eneral boolean formulas. Lemma 3. The class CFOR(B) is pwm-reducible to CFOR(fAND ORg) . xn as input variables. We can Proof. Let be a boolean formula with x1 assume that NOT operations are applied only to input variables. Otherwise by usin repeatedly De Mor an’s law we can move every NOT function towards the input variables. xn y1 yn as input variables, obConsider the formula Ψ with x1 tained modifyin sli htly formula as follows: replace every NOT function applied to variable x , by the new input variable y . Finally, we de ne the formula xn y1 yn as input variables to be: with x1 (x1
xn y1
yn ) =
Ψ (x1
xn y1
yn )
(x
y)
1 n
(x
y)
1 n
Formula (x1
evaluates the followin function: xn y1
yn ) =
0 1
(x1 xn ) if i : 1 i n x = y if i : 1 i n x = y = 0 otherwise
Boolean Formulas Are Hard to Learn for Most Gate Bases
309
For every natural number s and for every concept representation u CFOR(B) such that u s, let be the boolean formula with n input variables represented by u, let u0 be the representation of the monotone boolean formula obtained from as described above. We de ne (s n u) = u0 . For every assi nments x and y of len th n we de ne f (s n x) = xx, h(s n xy) = x and j(s n xy b) =
b if x = y if i : 1 i otherwise
n x =y =0
Clearly, f , , h and j satisfy the conditions (1), (2) and (3) in De nition 1 and therefore de ne a pwm-reduction. Technical note: In the proof of this prediction with membership reduction and in the next ones, functions f h j have been de ned only partially to keep the proof clear. It is trivial to extend them to obtain complete functions preservin conditions (1), (2) and (3). Now, we have to see that CFOR(fNOT ANDg) is pwm-reducible to CFOR(F ) if F is able to enerate any of the followin functions: F2 , F6 or D2 . Let us do a case analysis. Theorem 2. Let F be a set of boolean functions. If CFOR(fAND ORg) is pwm-reducible to CFOR(F ) .
F2
F
then
Proof. Clone F includes the operation F2 (x y z) = x (y z). Thus, there exists some formula F2 in FOR(F ) over three variables x y z such that [ F2 ] = F2 . Clearly, with the operation F2 and the additional help of constant 0 it is possible to simulate functions AND and OR, AND(x y) = OR(x y) =
F2 F2
(0 x y) (x y y)
Let Ψ be an arbitrary monotone boolean formula in FOR( AND OR ) over xn . the input variables x1 x2 Let 2 be the boolean formula over the input variables x1 xn c0 obtained from Ψ by replacin every occurrence AND(x y) by F2 (c0 x y) and, similarly, every occurrence of OR(x y) by F2 (x y y). Finally, let 1 be the boolean formula de ned by: 1 (x1
x2
xn c0 ) =
2
(
2 (x1
x2
xn c0 ) c0 c0 )
By construction we have 2 (x1
1 (x1
x2
x2
xn 0) = Ψ (x1 x2 xn c0 ) =
Ψ (x1 x2 1
xn ) and xn ) if c0 = 0 otherwise
310
V ctor Dalmau
Now we are in position to de ne the pwm-reduction. Let be the function assi nin to every monotone boolean formula Ψ , an associated formula 1 in FOR(F ) constructed as described above. Let f be the function addin the value of the constant zero to the end of strin , i.e., f (s n x1 x2
xn ) = x1 x2
xn 0
Mappin h produces the inverse result. That is, iven an strin removes the last value (correspondin to the constant 0). xn c0 ) = x1
h(s n x1
xn
Finally, function j is de ned by j(s n x1
xn c0 b) =
b if c0 = 0 otherwise
Thus, it is immediate to verify that f , , h, and j de ne a pwm-reduction from CfAND ORg to CFOR(F ) . By duality we have, Theorem 3. Let F be a set of boolean functions. If CFOR(fAND ORg) is pwm-reducible to CFOR(F ) . Finally we study clones containin
1 F6
F
then
D2
F
then
D2 .
Theorem 4. Let F be a set of boolean functions. If CFOR(fAND ORg) is pwm-reducible to CFOR(F ) .
Proof. For this proof we will use the fact that operation D2 satis es the selfduality property. Thus, every function in D2 is self-dual. Clone F includes the majority operation D2 (x y z) = (x y) (x z) (y z). Thus, there exists some formula D2 in FOR(F ) over three variables x y z such that the function computed by FD2 is D2 . Clearly, with the operation D2 and the additional help of constants it is possible to simulate functions AND and OR, AND(x y) = OR(x y) =
D2 (0 D2 (1
x y) x y)
For every monotone boolean formula Ψ in FOR( AND OR ) over the input xn we construct an associated formula 1 over the variables variables x1 x2 xn c0 c1 de ned by x1 x2 1 (x1
x2
xn c0 c1 ) =
D2 ( 2 (x1
x2
xn c0 c1 ) c0 c1 )
xn c0 c1 obtained where 2 is the formula over the input variables x1 x2 from Ψ in a similar way to the previous proof: we replace every occurrence of AND(x y) by D2 (c0 x y) and every occurrence of OR(x y) by D2 (c1 x y).
Boolean Formulas Are Hard to Learn for Most Gate Bases
By construction we have (the case c0 = 1 self-duality of D2 ), 2 (x1
1 (x1
x2
xn c0 c1 ) =
x2
xn c0 c1 ) =
c1 = 0 is a consequence of the
xn ) Ψ (x1 x2 Ψ ( x1 x2 undetermined
if c0 = 0 c1 = 1 xn ) if c0 = 1 c1 = 0 otherwise
Ψ (x1 x2 Ψ ( x1
if c0 if c0 if c0 xn ) if c0
x2
311
xn )
= c1 = c1 =0 =1
=0 =1 c1 = 1 c1 = 0
Now we are in position to de ne the pwm-reduction. Let be the function assi nin to every monotone boolean formula Ψ in FOR( AND OR ), an associated formula 1 in FOR(F ) constructed as described above. Let f be the function addin the value of the constants to the end of strin , i.e., f (s n x1 x2
xn ) = x1 x2
xn 0 1
Mappin h removes the last two values (correspondin to the constants) and, moreover, h ne ates the values of the assi nment if the values of the constants are flipped, h(s n x1
xn c0 c1 ) =
x1 x1
xn
if c0 = 0 c1 = 1 xn otherwise
Finally, function j is de ned by
j(s n x1
xn c0 c1 b) =
0 if c0 b if c0 b if c0 1 if c0
=0 =0 =1 =1
c1 c1 c1 c1
=0 =1 =0 =1
Thus, it is immediate to verify that f , , h, and j de ne a pwm-reduction from CfAND ORg to CCIR(F ) .
References [1] D. An luin. Queries and Concept Learnin . Machine Learnin , 2:319 342, 1988. [2] D. An luin, M. Frazier, and L. Pitt. Learnin Conjunctions of Horn Clauses. Machine Learnin , 9:147 164, 1992. [3] D. An luin, L. Hellerstein, and M. Karpinski. Learnin Read-Once Formulas with Queries. Journal of the ACM, 40:185 210, 1993. [4] D. An luin and M. Kharitonov. When won’t Membership Queries help. Journal of Computer and System Sciences, 50:336 355, 1995. [5] U. Ber ren. Linear Time Deterministic Learnin of k-term DNF. In 6th Annual ACM Conference on Computational Learnin Theory, COLT’93, pa es 37 40, 1993.
312
V ctor Dalmau
[6] N. Crei nou. A Dichotomy Theorem for Maximum Generalized Satis ability Problems. Journal of Computer and System Sciences, 51(3):511 522, 1995. [7] N. Crei nou and M. Hermann. Complexity of Generalized Satis ability Countin Problems. Information and Computation, 125:1 12, 1996. [8] V. Dalmau. A Dichotomy Theorem for Learnin Quanti ed Boolean Formulas. Machine Learnin , 35(3):207 224, 1999. [9] V. Dalmau and P. Jeavons. Learnability of Quanti ed Formulas. In 4th European Conference on Computational Learnin Theory Eurocolt’99, volume 1572 of Lecture Notes in Arti cial Intelli ence, pa es 63 78, BerlinNew York, 1999. Sprin er-Verla . [10] L.M. Goldschla er. A Characterization of Sets of n-Input Gates in Terms of their Computational Power. Technical Report 216, Basser Department of Computer Science, The University of Sidney, 1983. [11] L.M. Goldschla er and I. Parberry. On the Construction of Parallel Computers from various bases of Boolean Circuits. Theoretical Computer Science, 43:43 58, 1986. [12] P. Jeavons, D. Cohen, and M.C. Cooper. Constraints, Consistency and Closure. Arti cial Intelli ence, 101:251 265, 1988. [13] P. Jeavons, D. Cohen, and M. Gyssens. A Unifyin Framework for Tractable Constraints. In 1st International Conference on Principles and Practice of Constraint Pro rammin , CP’95, Cassis (France), September 1995, volume 976 of Lecture Notes in Computer Science, pa es 276 291. Sprin er-Verla , 1995. [14] P. Jeavons, D. Cohen, and M. Gyssens. A Test for Tractability. In 2nd International Conference on Principles and Practice of Constraint Pro rammin CP’96, volume 1118 of Lecture Notes in Computer Science, pa es 267 281, Berlin/New York, Au ust 1996. Sprin er-Verla . [15] P. Jeavons, D. Cohen, and M. Gyssens. Closure Properties of Constraints. Journal of the ACM, 44(4):527 548, July 1997. [16] P. Jeavons and M. Cooper. Tractable Constraints on Ordered Domains. Arti cial Intelli ence, 79:327 339, 1996. [17] D. Kavvadias and M. Sideri. The Inverse Satis ability Problem. In 2nd Computin and Combinatorics COCOON’96, volume 1090 of Lecture Notes in Computer Science, pa es 250 259. Sprin er-Verla , 1996. [18] Michael Kearns and Leslie Valiant. Crypto raphic limitations on learnin Boolean formulae and nite automata. Journal of the ACM, 41(1):67 95, January 1994. [19] S. Khanna, M. Sudan, and L. Trevisan. Constraint Satisfaction: The Approximability of Minimization Problems. In 12th IEEE Conference on Computational Complexity, 1997. [20] S. Khanna, M. Sudan, and P. Williamson. A Complete Classi cation fo the Approximability of Maximation Problems Derived from Boolean Constraint Satisfaction. In 29th Annual ACM Symposium on Theory of Computin , 1997. [21] E.L. Post. The Two-Valued Iterative Systems of Mathematical Lo ic, volume 5 of Annals of Mathematics Studies. Princeton, N.J, 1941. [22] S. Reith and H. Vollmer. The Complexity of Computin Optimal Assi nments of Generalized Propositional Formulae. Technical Report TR196, Department of Computer Science, Universit¨ at W¨ urzbur , 1999. [23] S. Reith and K.W. Wa ner. The Complexity of Problems De ned by Subclasses of Boolean Functions. Technical Report TR218, Department of Computer Science, Universit¨ at W¨ urzbur , 1999. [24] T.J. Schaefer. The Complexity of Satis ability Problems. In 10th Annual ACM Symposium on Theory of Computin , pa es 216 226, 1978.
Findin Relevant Variables in PAC Model with Membership Queries David Guijarro1 , Jun Tarui2 , and Tatsuie Tsukiji3 1
3
Department LSI, Universitat Politecnica de Catalunya, Jordi Girona Sal ado, 1-3, Barcelona 08034, Spain
[email protected] 2 Department of Communications and Systems, University of Electro-communications, Cho u aoka, Chofu-shi, Tokyo 182, Japan
[email protected] School of Informatics and Sciences, Na oya University, Na oya 464-8601, Japan
[email protected] oya-u.ac.jp
Abs rac . A new research frontier in AI and data minin seeks to develop methods to automatically discover relevant variables amon many irrelevant ones. In this paper, we present four al orithms that output such crucial variables in PAC model with membership queries. The rst al orithm executes the task under any unknown distribution by measurin the distance between virtual and real tar ets. The second al orithm exhausts virtual version space under an arbitrary distribution. The third al orithm exhausts universal set under the uniform distribution. The fourth al orithm measures influence of variables under the uniform distribution. Knowin the number r of relevant variables, the rst al orithm runs in almost linear time for r. The second and the third ones use less membership queries than the rst one, but runs in time exponential for r. The fourth one enumerates hi hly influential variables in quadratic time for r.
1
Introduction: Terminolo y and Strate y
We propose several al orithms with their own character for automatically ndin relevant variables in the presence of many irrelevant ones. Recent application of such al orithms ran es from data minin in the enome analysis to information lterin in the network computin . In these applications, sample data consist of a hu e volume of variables, althou h the tar et phenomenon may depend on only a few of them. In order for a machine to nd such crucial variables, we study al orithms in PAC model with membership queries, and analyze their query and time complexities. To learn an unknown Boolean function that depends on only a small number of the potential variables, the learner may (1) nd a set of relevant variables and Author supported in part by the EC throu h the Esprit Pro ram EU BRA pro ram under project 20244 (ALCOM-IT) and the EC Workin Group EP27150 (NeuroColt II) and by the spanish DGES PB95-0787 (Koala). O. Watanabe, T. Yokomori (Eds.): ALT’99, LNAI 1720, pp. 313 322, 1999. c Sprin er-Verla Berlin Heidelber 1999
314
David Guijarro, Jun Tarui, and Tatsuie Tsukiji
(2) combine them into a hypothesis that approximates the tar et concept with an arbitrarily hi h accuracy. Most inductive learnin al orithms do not separate these conceptually di erent tasks, but rather let them depend on each other. On the other hand, some AI learnin papers separate them and execute (1) as a preprocess for (2) (see [5] for a survey). This paper pursues (1) for the learnin oal. Let us be in with xin notion of relevant variables. We say that two Boolean instances are nei hbor of each other if they have the same bit len th and di ers by exactly one bit. De nition 1. A variable x is relevant to a function f if there exist nei hbor instanes A and A0 such that x(A) = x(A0 ) and f (A) = f (A0 ). Let Rel(r n) be the class of Boolean functions with n variables that depend on only r of them, or equivalently, have at most r relevant variables. In this paper, the tar et concept is an arbitrary function in Rel(r n). In other words, xn , and told that there the learner is iven the n Boolean variables, say x1 are at most r relevant variables to the tar et concept 1 . In this paper we work on Probably Approximately Correct (PAC) Learnin introduced by Valiant [11], where the tar et concept f and the tar et distribution D are postulated and hidden from the learner. The learner receives a sequence of examples (A f (A)), (B f (B)), independently and randomly from D. From these examples, the learner must build a hypothesis that can approximate the tar et concept by an arbitrary accuracy with respect to D. This paper sets a weaker oal for learnin . For a Boolean function h let rel(h) denotes the set of relevant variables to h. De nition 2. A set V of variables is called an -dominator, 0 exists a function h such that rel(h) V and ProbA h(A) = f (A) A is randomly chosen accordin to D.
1, if there , where
A weaker oal of learnin is to nd an (1 − )-dominator. Our al orithms may use inner coin flips and achieve this weaker oal with probability 1− for arbitrary iven constants 0 1. Unfortunately, random examples are not enou h as information resource for identifyin relevant variables of a iven Boolean concept, because: There can be computationally hard problems even thou h we can informationtheoretically identify the function. For instance consider the class of conjunctions of an (lo n)-bit majority and an (lo n)-bit parity. This class (contained in Rel(2 lo n n)) is believed to be hard to predict within polynomial even under the uniform distribution [3]. The problem there is that we may have enou h information to identify the function (usin a universal set) but it is computationally hard to discover the set of relevant variables. 1
We can tune our al orithms up into adaptive versions for r by uessin and doublin r, that is to execute the al orithms by uessin r = 20 ,21 ,22 , until they succeed.
Findin Relevant Variables in PAC Model with Membership Queries
315
Moreover, due to reduction from set cover, it is NP-hard in eneral to discover a set of relevant variables on which one can build an accurate hypothesis [8]. We will thus allow al orithms to use more active query in addition with random examples, that are membership queries introduced by An luin [1]. De nition 3. A membership query for an instance A about the tar et concept f is a request for the value MQf (A) = f (A) to the membership oracle MQf . In empirical test, reverse en ineer or information search, however, membership queries are much more expensive than random examples. We will thus present econimical al orithms in Section 3 that spend at most r lo n membership queries. The al orithms in this paper follow a common learnin strate y; Find witnesses for relevance and execute binary search on them. The al orithms di er in methods to discover witnesses for relevance. De nition 4. For an instance A and a set V of variables AV and AV are instances such that x(A) if x V , 1 − x(A) if x V , x(AV ) = x(AV ) = 0 otherwise. x(A) otherwise. De nition 5. A witness for relevance outside of V about the tar et concept f is a pair (A B) of instances A and B such that AV = BV and f (A) = f (B). Given a witness (A B) for relevance, binary search nds a relevant variable as follows. It flips half of the di erent bits between A and B, and ask the membership query for the obtained instance C. If f (A) = f (C) then it repeats the ar ument on a new witness (A C), otherwise on (B C). Either case reduces the number of di erent bits in the witnesses by half. This divide-and-query ar ument repeats until reachin to a variable x such that x(A) = x(B). Lemma 1. Given an arbitrary witness for relevance outside of V , the binary search outputs a relevant variable x V by usin at most lo n membership queries in O(n lo n) time. The proof is folklore (see e. . [7, Lemma 2.4]). Note that lo n is a query complexity lower bound in ndin one relevant variable from a iven witness for relevance, hence r lo n is a lower bound in ndin r relevant variables.
2
Measurin Tar ets
the Distance between Virtual and Real
This section provides a distribution-free al orithm that runs in time almost linear to r. It measures the distance between virtual and real tar ets and if the distance is lar e then the al orithm discovers a new relevant variable.
316
David Guijarro, Jun Tarui, and Tatsuie Tsukiji
De nition 6. The virtual tar et fV on a set V of variables about the tar et concept f is a Boolean function fV with rel(fV ) V such that fV (AV ) = f (AV ) holds for any instance A. In the mistake bound model with membership queries, Blum, Hellerstein and Littlestone [4] measures ProbA f (A) = hV (A) for a temporal hypothesis h to nd a new relevant variable that is not yet implemented in h. Here we measure ProbA f (A) = fV (A) , the distance between f and fV . If it is lar e, then our al orithm nds a witness for relevance (A AV ) with hi h probability. . Draw a sample O of m Lemma 2. Let m be any inte er with (1 − )m examples. If V is not an (1 − )-dominator then f (A) = f (AV ) happens for some (A f (A)) O with probability reater than 1 − . Proof. Suppose that V is not an (1 − )-dominator. Then the distance between f and fV is reater than , hence f (A) = f (AV ) happens for each (A f (A)) O with probability < (1 − )m . Now we implement this lemma in an al orithm MsrDist and analyze its performance. Procedure MsrDist Input: Inte ers n, r and m0 . Output: A set of relevant variables. Set V := and m := 0. Do until either V = r or m = m0 Draw a random example (A f (A)) and increment m by 1. Let := MQf (AV ). If f (A) = then Do Set m := 0. Execute the binary search on (A AV ) and put an obtained variable into V . EndDo EndDo Output V . EndProc Theorem 1. MsrDist nds an (1 − )-dominator (a set of relevant variables) with probability reater than 1 − . For some m0 = O( −1 lo (r )), MsrDist uses at most m0 r random examples, at most r(m0 + lo n) membership queries and runs in O(nr(m0 + lo n)) time. r. Then, Proof. Fix an arbitrary m0 = O( −1 lo (r )) such that (1 − )m0 in view Lemma 2, in each sta e of the Do loop in the procedure MsrDist where
Findin Relevant Variables in PAC Model with Membership Queries
317
V is not an (1 − )-dominator, each example (A f (A)) may be a witness (A AV ) for relevance outside of V with probability reater than r. Since MsrInf can draw m0 examples in the sta e, it will succeed in nd a witness with probability 1 − r. Since there are at most r sta es, due to reater than 1 − (1 − er )m0 the union bound, every sta e will succeed with probability 1 − r r 1 − , so with this probability MsrDist outputs an (1 − )-dominator.
3
Exhaustin Virtual Version Spaces
This section presents two al orithms that saves membership queries than one in the previous section. One al orithm is distribution-free and exhausts virtual version spaces, while another works under the uniform distribution and exhausts an r-universal. Intuitively, a version space is the set of hypotheses under consideration in inductive learnin . Haussler [9] proposed to exhaust the version space by throwin the hypotheses away that are inconsistent with drawn examples until only accurate hypotheses may remain. In this section, for a set V of variables, De nition 7. the virtual version space on V is the set of Boolean functions h with rel(h) V . Let V be the set of already found relevant variables. Then virtual version spaces may expand at each moment that a new relevant variable is discovered and added to V . Unless V is an (1 − )-dominator, exhaustin the virtual version space on V is shown to provide a witness for relevance outside of V with hi h probability. r
. Draw a sample O of Lemma 3. Let m be any inte er such that (1 − )m 22 m random examples. If V is not an (1 − )-dominator then there exists a witness (A B) for relevance outside of V with (A f (A)),(B f (B)) O with probability at least 1 − . Proof. Suppose that V is not an (1 − )-dominator. Then for every function h in the virtual version space on V , h(A) = f (A) holds with probability reater than , so O is consistent with h with probability < (1 − )m . The union bound thus implies that O does not provide any witness for relevance outside of V with r . probability < (1 − )m 22 We now implement this lemma in an al orithm ExhVVS that nds an (1 − )-dominator. For each new example (A f (A)), ExhVVS checks over the old examples (B f (B)) in a stock that whether (A B) is a witness for relevance outside of V . If ExhVVS nds a new witness it discards all the old examples and make the stock empty.
318
David Guijarro, Jun Tarui, and Tatsuie Tsukiji Procedure ExhVVS Input: Inte ers n and r. Output: A set of relevant variables. Set O := , V := , s := 1 and m := 0. Initialize m0 := the minimum inte er such that (1 − )m0 22
r.
Do until either V = r or m = m0 Draw a random example (A f (A)). For each B O Do If AV = BV and f (A) = f (B) then Do Update O := , m := 0 and s := s + 1.
s
Update m0 := the minimum inte er such that (1 − )m0 22
r.
Execute the binary search on (A B) and put an obtained variable into V . EndDo EndDo Put (A f (A)) into O and increment m by 1. EndDo Output V . EndProc
Theorem 2. The procedure ExhVVS outputs an (1 − )-dominator (a set of relevant variables) with probability reater than 1− . ExhVVS uses at most m0 = O(r2r lo (1 ) lo (r )) random examples, r lo n membership queries and runs in O(n(m0 + r lo n)) time. s
Proof. If V = s then ExhVVS draws m0 random examples so that (1 − )m0 22 r. Therefore, due to Lemma 3, if V is not an (1 − )-dominator then it derives a witness for relevance outside of V with probability at least r. The remainin ar ument is the same with the proof of Theorem 1. Under the uniform distribution, a modi cation of ExhVVS can nd all the relevant variables by exhaustin an r-universal set. De nition 8. A set U of instances is called r-universal if every r-bits on every set of r variables occurs in some instance A in U . Damaschke [7] worked in the exact learnin model with only membership queries and studied the numbers of (adaptive and non-adaptive) membership queries for exhaustin r-universal set. Here we show that, under the uniform distribution, an enou h number of random examples provides an r-universal set and that any of those sets is a su cient source for witnesses for all relevant variables. . Draw m inLemma 4. Let m be any inte er such that (1 − 2−r )m 2r nr stances independently and randomly under the uniform distribution over the n-bit instance space. Then it forms an r-universal set with probability at least 1 − .
Findin Relevant Variables in PAC Model with Membership Queries
319
Proof. We say that the sample hits an r-bits on a set of r variables if an assi nment A of some example (A f (A)) in the sample sets those variables to those bits. Then, due to probabilistic independence of examples, the sample does not hit a iven r-bits with probability (1− 2−r )m . Since there are 2r nr possibility of such r-bits’s, the union bound implies that the sample does not hit some r-bits . Or equivalently, the sample hits with probability at most (1 − 2−r )m 2r nr every r-bits with probability at least 1 − . Let ExhUniv be a modi cation of ExhVVS that does not discard old examples at all; ExhUniv omits updatin O and m0 in the lines 9 and 10 of the procedure ExhVVS. Theorem 3. Under the uniform distribution, ExhUniv nds all the relevant variables with probability at least 1 − by at most m0 = O(r2r lo n lo (1 )) random examples and r lo n membership queries in O(n(m0 + r lo n)) time. Proof. Let R be the set of relevant variables and let s = R r. Let m0 satisfy . Then, due to Lemma 4, the sample of size m0 presents (1 − 2−s )m0 2s ns an s-universal set with probability 1 − . Such a sample induces every s-bits on R, so in particular ExhUniv discovers a witness (A B) for relevance of every variable x R such that A and B are nei hbors at x, hence x itself by binary searchin on (A B).
4
Measurin Influence of Variables
An al orithm in this section measures influence of variables to the tar et concept under the uniform distribution and outputs hi hly influential ones. Such an approach for learnin has been taken in many AI and data minin research papers and achieves ood empirical success (see [5, Section 2.4]). Based on this approach, we will desi n an al orithm that nds an (1 − )-dominator under the uniform distribution in time almost quadratic for r, the number of relevant variables. De nition 9. The influence Inf(x) of a variable x is the number of instances A as a fraction of the set of all instances such that f (A) = f (A0 ) for the nei hbor instance A0 of A at x. Therefore, Inf(x) = 0 if and only if a variable x is irrelevant, and Inf(x) = 1 if and only if f = x for some function irrelevant with respect to x. To show existence of hi hly influential variables, Ben-Or and Linial [2] applied the ed eisoperimetric inequality (see [6, Section 16] for ed e-isoperimetric inequalities). Lemma 5 (Ed e Isoperimetric Inequality). For a iven Boolean function xn ) choose b 0 1 and 1 k such that f (x1 2−k−1
ProbA f (A) = b
under the uniform distribution on A. We then have
2−k Pn
i=1
Inf f (xi )
k21−k .
320
David Guijarro, Jun Tarui, and Tatsuie Tsukiji
In order to implement this inequality in a learnin al orithm, we need to relatize it on a iven set V of variables. P Lemma 6. If V is not an (1 − )-dominator then x62V Inf(x) > 2 . Proof. Let h be a Bayes optimal predictor of f on V . That is, (1) rel(h) V f (B) = 1 B = A 1 2. We suppose and (2) h(A) = 1 if and only if Prob B V V P 2 and prove that h approximates f with accuracy 1 − . x62V Inf(x) For any instance A we let p(A) = 1 − max ProbB f (B) = 1 BV = AV
ProbB f (B) = 0 BV = AV
Then expectation of p(A) on A is EA [p(A)] = ProbA h(A) = f (A) , so it is . enou h to claim that EA [p(A)] For any instance A let fA V be the function that xes the input-bits of f rel(f ) − V . Lemma 5 then promises 2p(A) on V by A. Hence rel(fA V ) P Inf (x), so takin avera e on A for both sides derives fA V x62V X Inf f (x) 2 2EA [p(A)] x62V
so we obtain EA [p(A)]
.
Uehara et. al. [10] apply the followin lemma for ndin relevant variables to fundamental Boolean functions (conjunctions, parities, etc). Lemma 7 (Uehara et. al. [10]). Let R be any set of r elements in the n element set X. Choose a subset W of X with W = n r uniformly at random. Then V W = 1 happens with probability > 1 e. Now, we present an al orithm that applies Lemma 6 and Lemma 7 and nds hi hly influential variables. Procedure MsrInf Input: Inte ers n, r and m0 . Output: A set of relevant variables. Set X := x1
xn , V := , m := 0, n0 := n and r 0 := r.
Do until either r 0 = 0 or m = m0 Draw a random example (A f (A)) and increment m by 1. Choose a set of variables W Let
X − V with W = n0 r 0 uniformly at random.
W
:= MQf (A ).ppp
If f (A) =
then Do
Update m := 0, n0 := n0 − 1 and r 0 := r 0 − 1. Execute the binary search on (A AW ), put the obtained variable in V and remove the variable from X. EndDo EndDo Output V . EndProc
Findin Relevant Variables in PAC Model with Membership Queries
321
Theorem 4. Under the uniform distribution, MsrInf nds an (1− )-dominator (a set of relevant variables) with probability reater than 1 − by at most m0 = O((r ) lo (r )) random examples and r(m0 + lo n) membership queries in O(nr(m0 + lo n)) time. r. Let Proof. Fix an arbitrary m0 = O( −1 lo (r )) such that (1 − er )m0 R be any set of r variables containin all the relevant variables. In view of Lemma 7, choosin W as in the procedure, we have (R − V ) W = 1 with probability reater than 1 e. Moreover, if it is so, (R − V ) W = x happens equally likely for each x R − V . Therefore, drawin A accordin to D and choosin W asP in the procedure, f (A) = f (AW ) happens with probability 1 Inf f (x). Thus if V is not an (1 − )-dominator then reater than 2er Px62V Lemma 6 promises x62V Inf(x) > 2 , so (A AW ) happens to be a witness for relevance outside of V with probability reater than er . Therefore, in each sta e of the Do loop, if V is not an (1 − )-dominator then a witness for relevance is successfully found with probability reater than 1 − r. The remainin ar ument is the same with the proof 1 − (1 − er )m0 of Theorem 1. The procedure MsrInf enumerates variables in order of their influence to the tar et. For example, suppose that there are 100 relevant variables where only three of them have influence 0.1 and the other 97 have only 0.001. MseInf ets the three in precedence with the other 97 with probability > (0 9)3 = 0 729.
Acknowled ments Osamu Watanabe contributed to this research from the be innin . We would like to thank him for motivation and many technical improvements. We would also like to thank Ryuhei Uehara for his helpful comments.
References [1] D. An luin. Queries and concept learnin . Machine Learnin , 2:319, 1987. [2] M. Ben-Or and N. Linial. Collective coin flippin . Advances in Computin Research, 5, 1989. [3] A. Blum, M. Furst, M.l. Kearns, and R. J. Lipton. Crypto raphic primitives based on hard learnin problems. In Proc. CRYPTO 93, pa es 278 291, 1994. LNCS 773. [4] A. Blum, L. Hellerstein, and N. Littlestone. Learnin in the presence of nitely or in nitely many irrelevant attributes. Journal of Computer and System Sciences, 50(1):32 40, 1995. [5] A.L. Blum and P. Lan rey. Selection of relevant features and examples. Machine Learnin . to be appeared. [6] B. Bollobas. Combinatorics. Cambrid e Univ. Press, Cambrid e, 1986. [7] P. Damaschke. Adaptive versus nonadaptive attribute-e cient learnin . In Proceedin s of the 30th Annual ACM Symposium on Theory of Computin (STOC98), pa es 590 596, 1998.
322
David Guijarro, Jun Tarui, and Tatsuie Tsukiji
[8] A. Dha at and L. Hellerstein. PAC learnin with irrelevant attributes. In Proceedin s of the 35th Annual Symposium on Foundations of Computer Science, pa es 64 74, 1994. [9] D. Haussler. Quantifyin inductive bias: AI learnin al orithms and Valiant’s model. Arti cial Intelli ence, 36(2):177 221, 1988. [10] R. Uehara, K. Tsuchida, and I. We ener. Optimal attribute-e cient learnin of disjunction, parity and threshold functions. In Proceedin s of the 3rd European Conference on Computational Learnin Theory, volume 1208 of LNAI, pa es 171 184, 1997. [11] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134 1142, 1985.
General Linear Relations among Di erent Types of Predi tive Complexity Yuri Kalnishkan Department of Computer Science, Royal Holloway, University of London, Egham, Surrey TW20 0EX, United Kingdom, yura@d s.rhbn .a .uk
Abstra t. In this paper we introduce a general method that allows to prove tight linear inequalities between di erent types of predictive complexity and thus we generalise our previous results. The method relies upon probabilistic considerations and allows to describe (using geometrical terms) the sets of coe cients which correspond to true inequalities. We also apply this method to the square-loss and logarithmic complexity and describe their relations which were not covered by our previous research.
1
Introdu tion
This paper generalises the author’s paper [4]. In [4] we proved tight inequalities between the square-loss and logarithmic complexities. The key point of that paper is a lower estimate on the logarithmic complexity, which follows from the coincidence of the logarithmic complexity and a variant of Kolmogorov complexity. That estimate could not be extended for other types of predictive complexity. The main theorem (Theorem 7) of [4] pointed out the accidental coincidence of two real-valued functions but [4] did not explain the deeper reasons beyond this fact. In this paper we use a di erent approach to lower estimates of predictive complexity. It relies upon probabilistic considerations and may be applied to a broader class of games. The results of [4] become a particular case of more general statements and receive a more profound explanation. As an application of the general method we establish linear relations between the square-loss and logarithmic complexity we have not covered in [4]. When giving the motivations for considering predictive complexity, we will briefly repeat some points from [4]. We work within an on-line learning model. In this model, a learning algorithm makes a prediction of a future event, than observes the actual event, and su ers loss due to the discrepancy between the prediction and the actual outcome. The total loss su ered by an algorithm over a sequence of several events can be regarded as the complexity of this sequence with respe t to this algorithm. An ?
Supported partially by EPSRC through the grant GR/M14937 ( Predictive complexity: recursion-theoretic variants ) and by ORS Awards Scheme.
O. Watanabe, T. Yokomori (Eds.): ALT’99, LNAI 1720, pp. 323 334, 1999.
Springer-Verlag Berlin Heidelberg 1999
324
Yuri Kalnishkan
optimal or universal measure of complexity cannot be de ned within the class of losses of algorithms, so we need to consider a broader class of complexities , namely, the class of optimal superloss processes. In many reasonable cases, this class contains an optimal element, a function which provides us with the intrinsi measure of complexity of a sequence of events with respect to no particular learning strategy. The concept of predictive complexity is a natural development of the theory of prediction with expert advice (see [1, 3, 5]) and it was introduced in the paper [7]. In the theory of prediction with expert advice we merge some given learning strategies. Roughly speaking, predictive complexity may be regarded as a mixture of all possible strategies and some superstrategies . The paper [10] introduces a method that allows to prove the existence of predictive complexity for many natural games. This method relies upon the Aggregating Algorithm (see [8]) and works for all so-called mixable games. It is still an open problem whether the mixability is a necessary condition for the existence of predictive complexity but, in this paper, we restrict ourselves to mixable games. In Sect. 2 we give the precise de nition of the environment our learning algorithms work in. A particular kind of environment (a particular game) is speci ed by choosing an outcome space, a hypothesis space, and a loss function. A loss function measures the loss su ered by a prediction algorithm in this environment and thus it is of interest to compare games with the same outcome space and the same hypothesis space but di erent loss functions. In this paper we compare the values of predictive complexity of strings w.r.t. di erent games and therefore we compare the inherent learnability of an object in di erent environments. Our goal is to describe the set of pairs (a b) such that the inequality aK1 ( )+ + K2 ( ) holds for complexities K1 and K2 speci ed by mixable games G1 b and G2 . In Sect. 3 we formulate necessary and su cient conditions for inequal+ K2 ( ) and a1 K1 ( ) + a2 K2 ( ) + b to hold. We ities aK1 ( ) + b establish both geometrical and probabilistic criteria. It is remarkable that both inequalities hold if and only if their counterparts hold on the average . In Sect. 4 we apply our results to relations between the square-loss and logarithmic complexity we have not investigated before.
2
De nitions
The notations in this paper are generally the same as in the paper [4] but some extra notations will also be introduced. We will denote the binary alphabet 0 1 by B and nite binary strings by bold lowercase letters, e.g. y. The expression denotes the length of and B denotes the set of all nite binary strings. We use the following notations, typical for works on Kolmogorov complexity. We will write f ( ) + g( ) for real-valued functions f and g if there is a constant 0 such that f ( ) g( ) + for all from the domain of these functions
General Linear Relations among Di erent Types of Predictive Complexity
325
(the set B throughout this paper). We consider mostly logarithms to the base 2 and we denote log2 by log. We begin with the de nition of a game. A game G is a triple (Ω Γ ), where Ω is called an out ome spa e, Γ stands for a hypothesis spa e, and : Ω Γ IR + is a loss fun tion. We suppose that a de nition of computability over Ω and Γ is given and is computable according to this de nition. Admitting the possibility of ( γ) = + is essential (cf. [9]). We need this assumption to take the very interesting logarithmic game into consideration. The ontinuity of a function f : M IR + in a point x0 M such that f (x0 ) = + is the property lim ! 0 2M f (x) = + (the continuity in the extended topology). Throughout this paper, we let Ω = B = 0 1 and Γ = [0 1]. We will consider the following examples of games: the square-loss game with (
γ) = ( − γ)2
(1)
and the logarithmi game with (
− log(1 − γ) if − log γ if
γ) =
(2)
A works according to the following protocol:
A prediction algorithm
FOR t = 1 2 (1) A hooses a hypothesis γt Γ (2) A observes the a tual out ome (3) A suffers loss ( t γt ) END FOR. Over the rst T trials,
=0 =1
A su
Ω
t
ers the total loss T
LossA (
1
2
T)
=
(
t
γt )
(3)
t=1
y de nition, put LossA ( ) = 0, where denotes the empty string. A function IR + is called a loss pro ess w.r.t. G if it coincides with the loss L : Ω LossA of some algorithm A. Note that any loss process is computable. We say that a pair (s0 s1 ) [− + ]2 is a superpredi tion if there exists (0 γ) and s1 (1 γ). If we consider a hypothesis γ Γ such that s0 γ Γ : p0 = (0 γ) and p1 = (1 γ) the set P = (p0 p1 ) [− + ]2 (cf. the canonical form of a game in [8]), the set S of all superpredictions is the set of points that lie north-east of P . We will loosely call P the set of predictions. The set of predictions P = (γ 2 (1 − γ)2 ) γ [0 1] and the set of superpredictions S for the square-loss game are shown on Fig. 1. IR + is called a superloss pro ess w.r.t. G if the A function L : Ω following conditions hold:
326
Yuri Kalnishkan 1.4
1.2
S
1
0.8
0.6
0.4
P
0.2
0.25
0.5
0.75
1
1.25
1.5
Fig. 1. The sets of predictions and superpredictions for the square-loss game. L( ) = 0 for any Ω , the pair (L( 0) − L( ) L( 1) − L( )) is a superprediction w.r.t. G, and L is semicomputable from above. We will say that a superloss process K is universal if it is minimal to within an additive constant in the class of all superloss processes. In other words, a superloss process K is universal if for any other superloss process L there exists a constant such that Ω : K( )
L( ) +
(4)
The di erence between two universal superloss processes w.r.t. G is bounded by a constant. If superloss processes w.r.t. G exist we may pick one and denote it by KG . It follows from the de nition that, for any L which is a superloss process w.r.t. G and any prediction algorithm A, we have KG ( )
KG ( )
+ +
L( )
LossG A( )
(5) (6)
where LossG denotes the loss w.r.t. G. One may call KG the omplexity w.r.t. G. Note that universal processes are de ned for concrete games only. Two games G1 = (Ω Γ 1 ) and G2 = (Ω Γ 2 ) with the same outcome and hypothesis spaces but di erent loss functions may have di erent sets of universal superloss processes (e.g. G1 may have universal processes and G2 may have not). We now proceed to the de nition of a mixable game. For any A [− + ]2 and any (u v) IR2 , the shift A + (u v) is the set (x + u y + v) (x y) A . For the sequel, we also need the de nition of the expansion aA = (ax ay) (x y) A , where a IR. For any B [− + ]2 , the A- losure of B is the set A + (u v)
clA (B) = (u v)2IR2 : BA+(u v)
(7)
General Linear Relations among Di erent Types of Predictive Complexity
327
We let A0 = (x y)
[−
+
]2 x
A = (x y)
[−
+
]2
On Fig. 2 you can see the sets A0 and A for
0 or y +
y
0
(8)
1
(9)
= 1 3.
3 4 2.5 2
A0
2
A1
3
1.5 -4
-2
2
4 1
-2 0.5 -4 0.5
1
1.5
2
2.5
3
Fig. 2. The sets A0 and A1 3 .
(0 1) su h that De nition 1 ([8, 9]). A game G is mixable if there exists for any straight line l IR2 passing through the origin the set (clAβ S clA0 S)\l ontains no more than one element, where S is the set of superpredi tions for G. Proposition 2 ([8]). For any mixable game, there exists a universal superloss pro ess.
Proposition 3 ([2, 7]). The logarithmi and the square-loss games are mixable and therefore the omplexities Klog and Ksq exist.
3
General Linear Inequalities
In this section we prove some general results on linear inequalities. Throughout this section G1 and G2 are any games with loss functions 1 and 2 , sets of predictions P1 and P2 , and sets of superpredictions S1 and S2 , respectively. The closure and the boundary of M IR2 in the standard topology of IR2 are denoted by M and M , respectively.
32
3.1
Yuri Kalnishkan
Case aK1 ( ) + b
+
K2 ( ).
The following theorem is the main result of the paper. Theorem 4. Suppose that the games G1 and G2 are mixable and spe ify the omplexities K1 and K2 ; suppose that the loss fun tion 1 ( γ) is ontinuous in the se ond argument; then the following statements are equivalent: K2 ( ), (i) > 0 x B : K 1 ( ) + (ii) P1 S2 , (p) (p) (p) (p) EK2 ( 1 (iii) p [0 1] > 0 n IN : EK1 ( 1 n ) + n ), (p) (p) are results of n independent Bernoulli trials with the where 1 n probability of 1 being equal to p. Loosely speaking, the inequality K1 ( ) graph (
γ)
1 (0
1 (1
+
γ)) γ
K2 ( ) holds if and only if the [0 1]
(10)
lies north-east of the graph (
2 (0
γ)
2 (1
γ)) γ
[0 1]
(11)
Proof. The implication (i) (iii) is trivial. S2 and therefore S1 S2 . Let us prove that (ii) (i). Suppose that P1 Let L be a superloss process w.r.t. G1 . It follows from the de nition, that, for B , we have (L( 0) − L( ) L( 1) − L( )) S1 S2 . One can easily any check that L( ) + (1 − 2|1x| ) is a superloss process w.r.t. G2 . If we take L = K1 and apply (5) we will obtain (i). It remains to prove that (iii) (ii). Let us assume that condition (ii) is violated i.e. there exists γ (0) [0 1] such that (
1 (0
γ (0) )
1 (1
γ (0) ) = (u0 v0 )
S2
(12)
Since 1 is continuous, without loss of generality we may assume that γ (0) is a computable number. We will now nd p0 [0 1] such that EK2 (
(p0 ) 1
(p0 ) n )
− EK1 (
(p0 ) 1
(p0 ) n )
= Ω(n)
(13)
We need the following lemmas. Lemma 5. Let onvex.
G be a mixable game; then the set of superpredi
tions for
G is
Proof (of the lemma). (0 1) be the number Let S be the set of superpredictions for G and let from De nition 1. Consider two points D E S and the line segment [D E] IR2 . We will prove that [D E] IR2 .
General Linear Relations among Di erent Types of Predictive Complexity
329
If one of the points lies north-east of another i.e. D E = (x1 y1 ) (x2 y2 )
(14)
where x1 x2 and y1 y2 ), then there is nothing to prove. Now suppose that D = (x1 y1 ), E = (x2 y2 ), x1 < x2 , and y1 > y2 (see Fig. 3). There exists a A0 . Let us prove that the closed set shift A0 of the set A such that D E M bounded by the line segment [D E] and the segment of A0 lying between D and E is a subset of clAβ S.
D r1
M
F
Q
E A0 r2
Fig. 3. The set Q \ M from the proof of Lemma 5 is coloured grey. Consider a shift A00 of the set A such that D E A00 . Trivially, two di erent shifts of A can have no more than one point in common. It follows from y1 and r2 = D E A00 that A00 intersects the rays r1 = (x1 y) y (x2 y) y y2 . The continuity of A yields M A00 . [D E] such that F S, then the If there exists a point F = (x0 y0 ) y0 (see Fig. 3) has no common whole quadrant Q = (x y) x x0 and y points with S and the set Q \ M (clAβ S clA0 S) violates the condition of De nition 1.
Lemma 6. Let M IR2 be a onvex set losed in the standard topology of IR2 and (u0 v0 ) M . Suppose that for any u v 0 we have M + (u v) M . Then there exist p0 [0 1] and m2 IR su h that, for any (u v) M , we have p0 v + (1 − p0 )u
m2 > m1 = p0 v0 + (1 − p0 )u0
(15)
Proof (of the lemma). The lemma can be derived from the Separation Theorem for convex sets (see e.g. [6]) but we will give a self-contained proof. Let us denote (u0 v0 ) by D. It follows from M being closed, that there exists a point E M which is closest to D. Clearly, all the points of M lay on one side of the straight line l
330
Yuri Kalnishkan
which is perpendicular to DE and passes through E and D lays on the other side (see Fig. 4). The straight line l should come from the north-west to the south-east and therefore normalising its equation one may reduce it to the form p0 v + (1 − p0 )u = m2 , where p0 [0 1].
M l E D
Fig. 4. The set M from the proof of Lemma 6 is coloured grey.
Lemma 7. Suppose that a game G is mixable and the set S of superpredi tions for G lies north-east of the straight line pv + (1 − p)u = m i.e. (u v) where p
S : pv + (1 − p)u
[0 1]. If K is the omplexity w.r.t. EK(
(p) 1
(p) n )
m
(16)
G, than mn
(17)
Proof (of the lemma). Consider a superloss process L and a string . The point (L( 0) − L( ) L( 1) − L( )) = (s1 s2 ) is a superprediction. We have E(L( where to p.
(p)
(p)
is a result of one
) − L( )) = p0 s2 + (1 − p0 )s1
(18)
ernoulli trial with the probability of 1 being equal
Lemma 8. Let G be a mixable game with the loss fun tion and the omplexity K. Suppose that γ (0) [0 1] is a omputable number su h that p0 (1 γ (0) ) + (1 − p0 ) (0 γ (0) ) = m where p0
[0 1]; then there exists EK(
(p0 ) 1
> 0 su h that (p0 ) n )
mn +
(19)
General Linear Relations among Di erent Types of Predictive Complexity
331
Proof (of the lemma). The proof is by considering the strategy which makes the prediction γ (0) on each trial and applying (6). The theorem follows. For any game G with the loss function , any positive real a, and any real b, one may consider the game Ga b with the loss function a b = a + b. Any L( ) is a superloss process is a superloss process w.r.t. G if and only if aL( ) + b w.r.t. Ga b . This implies the following corollary. Corollary 9. Under the onditions of Theorem 4 the following statements are equivalent: K2 ( ) (i) > 0 x B : aK1 ( ) + b + (ii) aP1 + (b b) S2 (p) (p) (p) (p) (iii) p [0 1] > 0 n IN : aEK1 ( 1 EK2 ( 1 n ) + bn + n ) (p) (p) are results of n independent Bernoulli trials with the where 1 n probability of 1 being equal to p. It is natural to ask whether the extra term b x can be replaced by a smaller term. The next corollary follows from the proof of Theorem 4 and clari es the situation. Corollary 10. Suppose that under the onditions of Theorem 4 the following statement holds: IR su h that p (n) = For any p [0 1] there exists a fun tion p : IN o(n)(n + ) and for any n IN the inequality aEK1 ( (p)
(p) 1
(p) n )
+ bn +
EK2 (
p (n)
(p) 1
(p) n )
(20)
(p)
holds, where 1 n are results of n independent Bernoulli trials with the probability of 1 being equal to p. Then the inequality aK1 ( ) + b
+
K2 ( )
holds. Proof. The corollary follows from (13) Corollary 11. If under the onditions of Theorem 4 there exists a fun tion f : IN IR su h that f (n) = o(n) (n + ) and, for any B , the inequality aK1 ( ) + b
+ f(
)
K2 ( )
holds, then the inequality aK1 ( ) + b holds.
+
K2 ( )
(21)
332
a
Yuri Kalnishkan
The next statement shows a property of the set of all pairs (a b) such that + K2 ( ) holds. 0 and the inequality aK1 ( ) + b
Corollary 12. Under the onditions of Theorem 4 the set (a b) a
B : aK1 ( ) + b
>0
0and
+
K2 ( )
(22)
is losed in the topology of IR2 . 1.
Proof. The proof is by continuity of 3.2
Case a1 K1 ( ) + a2 K2 ( )
+
b
.
In the previous subsection we considered nonnegative values of a. In this sub+ K2 ( ) with negative a or, in section we study the inequality aK1 ( ) + b other words, the inequality a1 K1 ( ) + a2 K2 ( ) + b x with a1 a2 0. Theorem 13. Suppose that games [0 1], we have
G1
and
G2
are mixable and, for any γ
1 (0
γ) =
1 (1
1 − γ)
(23)
2 (0
γ) =
2 (1
1 − γ)
(24)
where 1 and 2 are the loss fun tions. Then, for any a1 a2 statements are equivalent:
0, the following
(i) >0 B : a1 K 1 ( ) + a2 K 2 ( ) b + , (ii) a1 1 (0 1 2) + a2 2 (0 1 2) b, (1 2) (1 2) (1 2) (1 2) )+a2 EK2 ( 1 ) bn+ , (iii) > 0 n IN : a1 EK1 ( 1 n n (1 2) (1 2) are results of n independent Bernoulli trials with the where 1 n probability of 1 being equal to 1 2. Proof. The proof is similar to the one of Theorem 4 but a little simpler. Lemma 14. Suppose that a game G is mixable, is its loss fun tion, and K is the omplexity w.r.t. G. If, for any γ [0 1], we have (0 γ) = (1 1 − γ), then, for any B , we have K( )
+
(0 1 2)
(25)
Proof (of the lemma). The proof is by considering the strategy which makes the prediction 1 2 on each trial and applying (6). Clearly, the sets of superpredictions S1 and S2 for G1 and G2 lay northeast of the straight lines x 2 + y 2 = 1 (0 1 2) and x 2 + y 2 = 2 (0 1 2), respectively. It follows from Lemma 7 that, for any n IN, the inequalities EK1 ( EK ( 2
hold. The theorem follows.
(1 2) 1 (1 2) 1
(1 2) ) n (1 2) ) n
1 (0
1 2)n
(26)
2 (0
1 2)n
(27)
General Linear Relations among Di erent Types of Predictive Complexity
4
333
Appli ation to the Square-Loss and Logarithmi Complexity
In this section we will apply our general results to the square-loss and logarithmic games. Theorem 15. If a
0, then the inequality aKlog ( ) + b
+
Ksq ( )
(28)
max( 14 − a 0).
holds if and only if b
Proof. We apply Corollary 9. Let p [0 1]. To estimate the expectations, we need the values Epsq := min (p(1 − γ)2 + (1 − p)γ 2 )
(29)
0γ1
= p(1 − p)
(30)
and Eplog := min (−p log γ − (1 − p) log(1 − γ)) 0γ1
= −p log p − (1 − p) log(1 − p)
(p) 1 (p) EKsq ( 1
(p) n )
(32) 2
> 0 such that, for any
− Eplog n
1
(33)
Epsq n
2
(34)
It follows from Lemmas 7 and 8 that there are p [0 1] and for any n IN, we have EKlog (
(31)
(p) n )
−
1
+ Ksq ( ) holds if and only if for any Therefore the inequality aKlog ( ) + b log sq p [0 1] the inequality aEp + b Ep holds.
Lemma 16. For any a
0, we have
1 sup (Epsq − aEplog ) = max( − a 0) 4 p2[0 1] The theorem follows. The next theorem corresponds to Subsect. 3.2. Theorem 17. For any a1 a2 > 0 and any b the inequality a1 Ksq ( ) + a2 Klog ( ) holds for any
B if and only if a1 4 + a2
Proof. The proof is by applying Theorem 13.
+
b.
b
(35)
334
5
Yuri Kalnishkan
A knowledgements
I would like to thank Prof. V. Vovk and Prof. A. Gammerman for providing guidance to this work. I am also grateful to Dr. A. Shen for helpful discussions.
Referen es [1] N. Cesa-Bianchi, Y. Freund, D. Haussler, D. P. Helmbold, R. E. Schapire, and M. K. Warmuth. How to use expert advice. Journal of the ACM, (44):427 4 5, 1997. [2] A. DeSantis, G. Markowski, and M. N. Weigman. Learning probabilistic prediction functions. In Pro eedings of the 1988 Workshop on Computational Learning Theory, pages 312 32 , 19 . [3] D. Haussler, J. Kivinen, and M. K. Warmuth. Tight worst-case loss bounds for predicting with expert advise. Technical Report UCSC-CRL-94-36, University of California at Santa Cruz, revised December 1994. [4] Y. Kalnishkan. Linear relations between square-loss and Kolmogorov complexity. In Pro eedings of the Twelfth Annual Conferen e on Computational Learning Theory, pages 226 232. Association for Computing Machinery, 1999. [5] N. Littlestone and M. K. Warmuth. The weighted majority algorithm. Information and Computation, 10 :212 261, 1994. [6] F. A. Valentine. Convex Sets. McGraw-Hill Book Company, 1964. [7] V. Vovk. Probability theory for the Brier game. To appear in Theoreti al Computer S ien e. Preliminary version in M. Li and A. Maruoka, editors, Algorithmi Learning Theory, vol. 1316 of Le ture Notes in Computer S ien e, pages 323 33 . [ ] V. Vovk. Aggregating strategies. In M. Fulk and J. Case, editors, Pro eedings of the 3rd Annual Workshop on Computational Learning Theory, pages 371 3 3, San Mateo, CA, 1990. Morgan Kaufmann. [9] V. Vovk. A game of prediction with expert advice. Journal of Computer and System S ien es, (56):153 173, 199 . [10] V. Vovk and C. J. H. C. Watkins. Universal portfolio selection. In Pro eedings of the 11th Annual Conferen e on Computational Learning Theory, pages 12 23, 199 .
Predictin Nearly as Well as the Best Prunin of a Planar Decision Graph Eiji Takimoto1 and Manfred K. Warmuth2 1
2
Graduate School of Information Sciences, Tohoku University Sendai, 980-8579, Japan. Computer Science Department, University of California, Santa Cruz Santa Cruz, CA 95064, U.S.A.
Abs rac . We desi n e cient on-line al orithms that predict nearly as well as the best prunin of a planar decision raph. We assume that the raph has no cycles. As in the previous work on decision trees, we implicitly maintain one wei ht for each of the prunin s (exponentially many). The method works for a lar e class of al orithms that update its wei hts multiplicatively. It can also be used to desi n al orithms that predict nearly as well as the best convex combination of prunin s.
1
Introduction
Decision trees are widely used in Machine Learnin . Frequently a lar e tree is produced initially and then this tree is pruned for the purpose of obtainin a better predictor. A prunin is produced by deletin some nodes and with them all their successors. Althou h there are exponentially many prunin s, a recent method developed in codin theory [WST95] and machine learnin [Bun92] makes it possible to (implicitly) maintain one wei ht per prunin . In particular Helmbold and Schapire [HS97] use this method to desi n an ele ant al orithm that is uaranteed to predict nearly as well as the best prunin of a decision tree. Pereira and Sin er [PS97] modify this al orithm to the case of ed e-based prunin s instead of the node-based prunin s de ned above. Ed e-based prunin s are produced by cuttin some ed es of the ori inal decision tree and then removin all nodes below the cuts. Both de nitions are closely related. Ed e-based prunin s have been applied to statistical lan ua e modelin [PS97], where the out-de ree of nodes in the tree may be very lar e. In this paper we eneralize the methods from decision trees to planar directed acyclic raphs (da s). Trees, upside-down trees and series-parallel da s are all special cases of planar da s. We de ne a notion of ed e-based prunin s of a planar da . A ain we nd a way to e ciently maintain one wei ht for each of the exponentially many prunin s. In Fi . 1, the tree T 0 represents a node-based prunin of the decision tree T . Each node in the ori inal tree T is assumed to have a prediction value in This work was done while the author visited University of California, Santa Cruz. Supported by NSF rant CCR 9700201 O. Watanabe, T. Yokomori (Eds.): ALT’99, LNAI 1720, pp. 335 346, 1999. c Sprin er-Verla Berlin Heidelber 1999
336
Eiji Takimoto and Manfred K. Warmuth a
a
0.4
0.4
c
b
d
0.3
0.3
0.6 g
f
a 0.4
c
b
0.6 e
s
f
c 0.3
g b 0.6
0.9
0.6
0.2
0.6
0.1
0.1 f 0.6
h
i
0.9
0.2
g 0.1 d 0.9
e 0.2
t
T
T
0
Fi . 1. An example of a decision tree, a prunin , and a decision da some prediction space Y . Here, we assume Y = [0 1]. In the usual settin each instance (from some instance space) induces a path in the decision tree from the root to a leaf. The path is based on decisions done at the internal nodes. Thus w.l.o. . our instances are paths in the ori inal decision tree. For a iven path, a tree predicts the value at the leaf at which the path ends. For example, for the path a c f i , the ori inal tree T predicts the value 0.2 and the prunin T 0 predicts 0.6. In what follows, we consider the prediction values to be associated with the ed es. In Fi . 1 the prediction value of each ed e is iven at its lower endpoint. For example, the ed es a, b and c have prediction values 0 4, 0 6 and 0 3, respectively. Moreover, we think of a prunin as the set of ed es that are incident to the , leaves of the prunin . So, T and T 0 are represented by d e h i and b f respectively. Note that for any prunin R and any path P , R intersects P at exactly one ed e. That is, a prunin cuts each path at an ed e. The prunin R predicts on path P with the prediction value of the ed e that is cut. The notion of prunin can easily be eneralized to directed acyclic raphs. We de ne decision da s as da s with a special source and sink node where each ed e is assumed to have a prediction value. A prunin R of the decision da is de ned as a set of ed es such that for any s- path P , R intersects P with exactly one ed e. A ain the prunin R predicts on the instance/path P with the value of the ed e that is cut. It is easily seen that the ri htmost raph G in Fi . 1 is a decision da that is equivalent to T . We study learnin in the on-line prediction model where the decision da is iven to the learner. At each trial t = 1 2 , the learner receives a path Pt and must produce a prediction yt Y . Then an outcome yt in the outcome space Y is observed (which can be thou ht of as the correct value of Pt ). Finally, at [0 ] the end of the trial the learner su ers loss L(yt yt ), where L : Y Y is a xed loss function. Since each prunin R has a prediction value for Pt , the loss of R at this trial is de ned analo ously. The oal of the learner is to make predictions so that its total loss t L(yt yt ) is not much worse than the total
Predictin Nearly as Well as the Best Prunin of a Planar Decision Graph
337
loss of the best prunin or that of the best mixture (convex combination) of prunin s. It is strai htforward to apply any of a lar e family of on-line prediction al orithms to our problem. To this end, we just consider each prunin R of G as an expert that predicts as the prunin R does on all paths. Then, we can make use of any on-line al orithm that maintains one wei ht per expert and forms its own prediction yt by combinin the predictions of the experts usin the current wei hts (See for example: [Lit88, LW94, Vov90, Vov95, CBFH+ 97, KW97]). Many relative loss bounds have been proven for this settin boundin the additional total loss of the al orithm over the total loss of the best expert or best wei hted combination of experts. However, this direct implementation of the on-line al orithms is ine cient because one wei ht/expert would be used for each prunin of the decision da and the number of prunin s is usually exponentially lar e. So the oal is to nd e cient implementations of the direct al orithms so that the exponentially many wei hts are implicitly maintained. In the case of trees this is possible [HS97]. Other applications that simulate al orithms with exponentially many wei hts are iven in [HPW99, MW98]. We now sketch how this can be done when the decision da is planar. Recall that each prunin and path intersect at one ed e. Therefore in each trial the ed es on the path determine the predictions of all the prunin s as well as their losses. So in each trial the ed es on the path also incur a loss and the total loss of a prunin is always the sum of the total losses of all of its ed es. Under a very eneral settin the wei ht of a prunin is then a function of the wei hts of its ed es. Thus the exponentially many wei hts of the prunin s collapse to one wei ht per ed e. It is obvious that if we can e ciently update the ed e wei hts and compute the prediction of the direct al orithm from the ed e wei hts, then we have an e cient al orithm that behaves exactly as the direct al orithm. One of the most important family of on-line learnin al orithms is the one that does multiplicative updates of its wei hts. For this family the wei ht wR of a prunin R is always the product of the wei hts of its ed es, i.e. wR = e2R ve . The most important computation in determinin the prediction and updatin the wei hts is summin the current wei hts of all the prunin s, i.e. R e2R ve . We do not know how to e ciently compute this sum for arbitrary decision da s. However, for planar decision da s, computin this sum is reduced to computin another sum for the dual planar da . The prunin s in the primal raph correspond to paths in the dual, and paths in the primal to prunin s in the dual. Therefore the above sum is equivalent to P e2P ve , where P ran es over all paths in the dual da . Curiously enou h, the same formula appears as the likelihood of a sequence of symbols in a Hidden Markov Model where the ed e wei hts are the transition probabilities. So we can use the well known forwardbackward al orithm for computin the above formula e ciently [LRS83]. The overall time per trial is linear in the number of ed es of the decision da . For the case where the da is series-parallel, we can improve the time per trial to row linearly in the size of the instance (a path in the da ).
338
Eiji Takimoto and Manfred K. Warmuth
Another approach for solvin the on-line prunin problem is to use the specialist framework developed by Freund, Schapire, Sin er and Warmuth [FSSW97]. Now each ed e is considered to be a specialist. In trial t only the ed es on the path awake and all others are asleep . The predictions of the awake ed es are combined to form the prediction of the al orithm. The redeemin feature of their al orithm is that it works for arbitrary sets of prunin s and paths over some set of ed es with the property that any prunin and any path intersect at exactly one ed e. They can show that their al orithm performs nearly as well as any mixture of specialists, that is, essentially as well as the best sin le prunin . However, even in the case of decision trees the loss bound of their al orithm is quadratic in the size of the prunin . In contrast, the loss bound for the direct al orithm rows only linearly in the size of the prunin . Also when we use for example the EG al orithm [KW97] as our direct al orithm, then the direct al orithm (as well as its e cient simulation) predicts nearly as well as the best convex combination of prunin s.
2
On-Line Prunin of a Decision Da
A decision da is a directed acyclic raph G = (V E) with a desi nated start node s and a terminal node . We call s and the source and the sink of G, respectively. An s- path is a set of ed es of G that forms a path from the source to the sink. In the decision da G, each ed e e E is assumed to have a predictor that, when iven an instance (s- path) that includes the ed e e, makes a prediction from the prediction space Y . In a typical settin , the predictions would be real numbers from Y = [0 1]. Althou h the predictor at ed e e may make di erent predictions whenever the path passes throu h e, we write its prediction as (e) Y . A prunin R of G is a set of ed es such that for any s- path P , R intersects P with exactly one ed e, i.e., R P = 1. Let eR\P denote the ed e at which R and P intersect. Because of the intersection property, a prunin R can be thou ht of as a well-de ned function from any instance P to a prediction (eR\P ) Y . Let P(G) and (G) denote the set of all paths and all prunin s of G, respectively. For example, the decision da G in Fi . 1 has four prunin s, i.e., (G) = a b c b f d e . Assume that we are iven an instance P = a b e . Then, the prunin R = b f predicts 0.6 for this instance P , which is the prediction of the predictor at ed e b = eR\P . We study learnin in the on-line prediction model, where an al orithm is required not to actually produce prunin s but to make predictions for a iven instance sequence based on a iven decision da G. The oal is to make predictions that are competitive with those made by the best prunin of G or with those by the best mixture of prunin s of G. We will now state our learnin model more precisely. A prediction al orithm A is iven a decision da G as its input. At each trial t = 1 2 , al orithm A receives an instance/path Pt P(G) and enerates a prediction yt Y . After that, an outcome yt Y is observed. Y is a set called the outcome space. Typically, the outcome space Y would be the same
Predictin Nearly as Well as the Best Prunin of a Planar Decision Graph
339
as Y . At this trial, the al orithm A su ers loss L(yt yt ), where L : Y Y [0 ] is a xed loss function. For example the square loss is L(y y) = (y − y)2 and the relative-entropic loss is iven by L(y y) = y ln(y y) + (1 − y) ln((1 − y) (1 − y)). (PT yT )) (P(G) Y ) , For any instance-outcome sequence S = ((P1 y1 ) T the cumulative loss of A is de ned as LA (S) = t=1 L(yt yt ). In what follows, the cumulative loss of A is simply called the loss of A. Similarly, for a prunin R of G, the loss of R for S is de ned as T
LR (S) =
L(yt (eR\Pt )) t=1
The performance of A is measured in two ways. The rst one is to compare the loss of A to the loss of the best R. In other words, the oal of al orithm A is to make predictions so that its loss LA (S) is close to minR2R(G) LR (S). The other oal (that is harder to achieve) is to compare the loss of A to the loss of the best mixture of prunin s. To be more precise, we introduce a mixture vector u (G) and R uR = 1. Then the oal of A indexed by R so that uR 0 for R is to achieve a loss LA (S) that is close to minu Lu (S), where T
Lu (S) =
L(yt t=1
uR (eR\Pt )) R2R(G)
Note that the former oal can be seen as the special case of the latter one where the mixture vector u is restricted to unit vectors (i.e., uR = 1 for some particular R).
3
Dual Problem for a Planar Decision Da
In this section, we show that our problem of on-line prunin has an equivalent dual problem provided that the underlyin raph G is planar. The duality will be used to make our al orithms e cient. An s- cut of G is a minimal set of ed es of G such that its removal from G results in a raph where s and are disconnected. First we point out that a prunin of G is an s- cut of G as well. The converse is not necessarily true. For instance, the set a e f is an s- cut of G in Fi . 2 but it is not a prunin because a path a d e intersects the cut with more than 1 ed e. So, the set of prunin s (G) is a subset of all s- cuts of G, and our problem can be seen as an on-line min-cut problem where cuts are restricted in (G). To see this, let us consider the cumulative loss e = t:e2Pt L(yt (e)) at ed e e as the capacity of e. Then, the loss of a prunin R, LR (S) = t L(yt (eR\Pt )) = e2R e , can be interpreted as the total capacity of the cut R. This implies that a prunin of minimum loss is a minimum capacity cut from (G). It is known in the literature that the (unrestricted) min-cut problem for an s- planar raph can be reduced to the shortest path problem for its dual raph (see, e. ., [Hu69, Law70, Has81]). A sli ht modi cation of the reduction ives
340
Eiji Takimoto and Manfred K. Warmuth
s a a
1’ b d
t’
s’
c 2’ e
b
d
S’
2’ f
3’
1’
3’
c
t’
e
f
t D
Fi . 2. A decision da G and its dual da GD . us a dual problem for the best prunin (restricted min-cut) problem. Below we show how to construct the dual da GD from a planar decision da G that is suitable for our purpose. Assume we have a planar decision da G = (V E) with source s and sink . Since the raph G is acyclic, we have a planar representation of G so that the vertices in V are placed on a vertical line with all ed es downward. In this linear representation, the source s and the sink are placed on the top and the bottom on the line, respectively (See Fi . 2). The vertical line (the dotted line) bisects the plane and de nes two outer faces s’ and ’ of G. Let s’ be the ri ht face. The dual da GD = (V D E D ) is constructed as follows. The set of vertices V D consists of all faces of G. Let e E be an ed e of G which is common to the boundaries of two faces fr and fl in G. By virtue of the linear representation, we can let fr be the ri ht face on e and fl be the left face on e. Then, let E D include the ed e e0 = (fr fl ) directed from fr to fl . It is clear that the dual da GD is a planar directed acyclic raph with source s’ and sink ’, and the dual of GD is G. The followin proposition is crucial in this paper. Proposition 1. Let G be a planar decision da and GD be its dual da . Then, there is a one-to-one correspondence between s- paths P(G) in G and prunin s (GD ) in GD , and there is also a one-to-one correspondence between prunin s (G) in G and s’- ’ paths P(GD ) in GD . Thus there is a natural dual problem associated with the on-line prunin problem. We now describe this dual on-line shortest path problem. An al orithm A is iven as input a decision da G. At each trial t = 1 2 , al o(G) as the instance and enerates a predicrithm A receives a prunin Rt tion yt Y . The loss of A, denoted LA (S), for an instance-outcome sequence T S = ((R1 y1 ) (RT yT )) ( (G) Y ) is de ned as LA (S) = t=1 L(yt yt ). The class of predictors which the performance of A is now compared to consists of all paths. For a path P of G, the loss of P for S is de ned as
Predictin Nearly as Well as the Best Prunin of a Planar Decision Graph
341
T
LP (S) =
L(yt (eRt \P )) t=1
0 for P Similarly, for a mixture vector u indexed by P so that uP and P uP = 1, the loss of the u-mixture of paths is de ned as
P(G)
T
Lu (S) =
L(yt t=1
uP (eRt \P )) P 2P(G)
The objective of A is to make the loss as small as the loss of the best path P , i.e., minP LP (S), or the best mixture of paths, i.e., minu Lu (S). It is natural to call this the on-line shortest path problem because if we consider the cumulative loss e = t:e2Rt L(yt (e)) at ed e e as the len th of e, then the loss of P , LP (S) = e2P e , can be interpreted as the total len th of P . It is clear from the duality that the on-line prunin problem for a decision da G is equivalent to the on-line shortest path problem for its dual da GD . In what follows, we consider only the on-line shortest path problem.
4
Ine cient Direct Al orithm
In this section, we show the direct implementation of the al orithms for the online shortest path problem. Namely, the al orithm considers each path P of G as an expert that makes a prediction xt P = (eRt \P ) for a iven prunin Rt . Note that this direct implementation would be ine cient because the number of experts (the number of paths in this case) can be exponentially lar e. In eneral, such direct al orithms have the followin eneric form: They maintain a wei ht wt P [0 1] for each path P P(G); when iven the predictions xt P (= (eRt \P )) of all paths P , they combine these predictions based on the wei hts to make their own prediction yt , and then update the wei hts after the outcome yt is observed. In what follows, let w t and xt denote the wei ht and prediction vectors indexed by P P(G), respectively. Let N be the number of experts, i.e., the cardinality of P(G). More precisely, the eneric al orithm consists of two parts: Y which maps the cura prediction function pred : N [0 1]N Y N rent wei ht and prediction vectors (wt xt ) of experts to a prediction yt ; and [0 1]N which maps an update function upda e : N [0 1]N Y N Y (w t xt ) and outcome yt to a new wei ht vector w t+1 . Usin these two functions, the eneric on-line al orithm behaves as follows: For each trial t = 1 2 , 1. Observe predictions xt from the experts. 2. Predict yt = pred(w t xt ).
342
Eiji Takimoto and Manfred K. Warmuth
3. Observe outcome yt and su er loss L(yt yt ). 4. Calculate the new wei ht vector accordin to w t+1 = upda e(w t xt yt ). Vovk’s A re atin Al orithm (AA) [Vov90] is a seminal on-line al orithm of eneric form and has the best possible loss bound for a very wide class of loss functions. It updates its wei hts as wt+1 = upda e(w t xt yt ), where wt+1 P = wt P exp(−L(yt xt P ) cL ) for any P P(G). Here cL is a constant that depends on the loss function L. Since AA uses a complicated prediction function, we only discuss a simpli ed al orithm called the Wei hted Avera e Al orithm (WAA) [KW99]. The latter al orithm uses the same updates with a sli htly worse constant cL and predicts with the wei hted avera e based on the normalized wei hts: yt = pred(wt xt ) =
wt P xt P where wt P = wt P
wt P P
P 2P(G)
The followin theorem ives an upper bound on the loss of the WAA in terms of the loss of the best path. Theorem 1 ([KW99]). Assume Y = Y = [0 1]. Let the loss function L be monotone convex and twice di erentiable with respect to the second ar ument. Then, for any instance-outcome sequence S ( (G) Y ) , LWAA (S)
min
P 2P(G)
LP (S) + cL ln(1 w1 P )
where w1 P is the normalized initial wei ht of P . We can obtain a more powerful bound in terms of the loss of the best mixture of paths usin the exponentiated radient (EG) al orithm due to Kivinen and Warmuth [KW97]. The EG al orithm uses the same prediction function pred as the WAA and uses the update function w t+1 = upda e(w t xt yt ) so that for any P P(G), wt+1 P = wt P exp − xt P
L(yt z) z
z=yt
Here is a positive learnin rate. Kivinen and Warmuth show the followin loss bound of the EG al orithm for the square loss function L(y y) = (y − y)2 . Note that, for the square loss, the update above becomes wt+1 P = wt P exp(−2 (yt − yt )xt P ). Theorem 2 ([KW97]). Assume Y = Y = [0 1]. Let L be the square loss function. Then, for any instance-outcome sequence S ( (G) Y ) and for N any probability vector u [0 1] indexed by P , LEG (S)
2 1 Lu (S) + RE(u w1 ) 2−
where RE(u w 1 ) = P uP ln(uP w1 P ) is the relative entropy between u and the initial normalized wei ht vector w1 .
Predictin Nearly as Well as the Best Prunin of a Planar Decision Graph
343
Now we ive the two conditions on the direct al orithms that are required for our e cient implementation. De nition 1. Let w [0 1]N and x Y N be a wei ht and a prediction vector. Pk = P(G) be a partition of P(G) such that P Pj = for any Let P1 i = j and all paths in the same class have the same prediction. That is, for each class P , there exists x0 Y such that xP = x0 for any P P . In other words, x0 = (x01 x0k ) and w0 = (w10 wk0 ), where w0 = P 2Pi wP , can be seen as a projection of the ori inal prediction vector x and wei ht vector w onto the Pk . The prediction function pred is projection-preservin if partition P1 pred(w x) = pred(w 0 x0 ) for any w and x. De nition 2. The update function upda e is multiplicative if there exists a function f : Y Y Y such that for any w [0 1]N , x Y N and y Y , the new wei ht w0 = upda e(w x y) is iven by wP0 = wP f (xP y y) for any P , where y = pred(w x). These conditions are natural. In fact, they are actually satis ed by the prediction and update functions used in many families of al orithms such as AA [Vov90], WAA [KW99], EG and EGU [KW97]. Note that the projection used may chan e from trial to trial.
5
E cient Implementation of the Direct Al orithm
Now we ive an e cient implementation of a direct al orithm that consists of a projection-preservin prediction function pred and a multiplicative update function upda e. Clearly, it is su cient to show how to e ciently compute the functions pred and upda e. Obviously, we cannot explicitly maintain all of the wei hts wt P as the direct al orithm does since there may be exponentially many paths P in G. Instead, we maintain a wei ht vt e for each ed e e, which requires only a linear space. We will ive indirect al orithms below for computin pred and upda e so that, the wei hts vt e for ed es implicitly represent the wei hts wt P for all paths P as follows: wt P =
vt e
(1)
e2P
First we show an indirect al orithm for upda e which is simpler. Suppose that, iven a prunin Rt as instance, we have already calculated a prediction yt and observe an outcome yt . Then, the followin update for the wei ht of the ed es is equivalent to the update wt+1 = upda e(w t xt yt ) of the wei hts of the paths. Recall that wt is a wei ht vector indexed by P iven by (1) and xt is a prediction vector iven by xt P = (eRt \P ). Let f be the function associated with our multiplicative update (see De nition 2). For any ed e e E, the wei ht of e is updated accordin to vt+1 e =
vt e f ( (e) yt yt ) if e Rt , otherwise. vt e
(2)
344
Eiji Takimoto and Manfred K. Warmuth
Lemma 1. The update rule for ed es iven by (2) is equivalent to upda e(w t xt yt ). Proof. It su ces to show that the relation (1) is preserved after updatin . That P(G). Let e0 = eRt \P . Since upda e is is, wt+1 P = e2P vt+1 e for any P multiplicative, we have vt e f ( (e0 ) yt yt )
wt+1 P = wt P f (xt P yt yt ) = e2P
vt e
=
vt e f ( (e0 ) yt yt ) =
vt+1 e e2P
e2P nfe g
as required. Next we show an indirect al orithm for pred. Let the iven prunin be Rt = ek . For 1 i k, let P = P P(G) e P . Since Rt P = 1 e1 Pk = P(G) forms a partition of P(G) and clearly for any for any P , P1 path P P , we have xt P = (e ). So, x0 = ( (e1 )
(ek ))
(3)
is a projected prediction vector of xt . Therefore, if we have the correspondin projected wei ht vector w 0 , then by the projection-preservin property of pred we can obtain yt by pred(w 0 x0 ), which equals yt = pred(w x). Now what we have to do is to e ciently compute the projected wei hts for 1 i k: wt0 =
wt P = P 2Pi
wt P = P :ei 2P
vt e
(4)
P :ei 2P e2P
Surprisin ly, the -form formula above is similar to the formula of the likelihood of a sequence of symbols in a Hidden Markov Model (HMM) with a particular state transition (e ) [LRS83]. Thus we can compute (4) with the forwardbackward al orithm. For node u V , let Ps!u and Pu!t be the set of paths from s to the node u and the set of paths from the node u to , respectively. De ne vt e
(u) = P 2Ps
u
and
(u) =
e2P
vt e P 2Pu
t
e2P
Suppose that e = (u1 u2 ). Then, the set of all paths in P(G) throu h e is e P2 P1 Ps!u1 P2 Pu2 !t , and therefore the represented as P1 formula (4) is iven by (5) wt0 = (u1 )vt ei (u2 ) We summarize this result as the followin lemma. Lemma 2. Let x0 and w 0 be 0 0 pred(w x ) = pred(w x).
iven by (3) and (5), respectively. Then
Predictin Nearly as Well as the Best Prunin of a Planar Decision Graph
345
The forward-backward al orithm [LRS83] is an al orithm that e ciently computes and by dynamic pro rammin as follows: (u) = 1 if u = s and (u) = u 2V :(u u)2E (u0 )vt (u u) , otherwise. Similarly, (u) = 1 if u = and (u) = u 2V :(u u )2E (u0 )vt (u u ) , otherwise. It is clear that both and can be computed in time O( E ).
6
A More E cient Al orithm for Series-Parallel Da s
In the case of decision trees, there is a very e cient al orithm with per trial time linear in the size of the instance (a path in the decision tree) [HS97]. We now ive an al orithm with the same improved time per trial for series-parallel da s, which include decision trees. A series-parallel da G(s ) with source s and sink is de ned recursively Gk (sk k ) are as follows: An ed e (s ) is a series-parallel da ; If G1 (s1 1 ) Gk ) disjoint series-parallel da s, then the series connection G(s ) = s(G1 of these da s, where s = s1 , = s +1 for 1 i k−1 and = k , or the parallel Gk ) of these da s, where s = s1 = = sk and connection G(s ) = p(G1 = k , is a series-parallel da . Note that a series-parallel da has a = 1= parse tree, where each internal node represents a series or a parallel connection of the da s represented by its child nodes. It su ces to show that the projected wei hts (4) can be calculated in time linear in the size of instance/prunin Rt . To do so the al orithm maintains one wei ht vt G per one node G of the parse tree so that vt G =
vt e P 2P(G) e2P
holds. Note that if G consists of an sin le ed e e, then vt G = vt e ; if G = k Gk ), then vt G = Gk ), then vt G = s(G1 =1 vt Gi ; if G = p(G1 k v . Now (4), i.e., W (G e) = v t G i P 2P(G) e2P e 2P t e , is recursively =1 computed as W (G e) =
if G consists of e, vt e Gk ) and e W (G e)vt G vt Gi if G = s(G1 if G = p(G1 Gk ) and e W (G e)
G, G.
The wei hts vt G are also recursively updated as vt+1 G =
if G consists of e, vt+1 e Gk ) and Rt is in G , vt+1 Gi vt G vt Gi if G = s(G1 k v if G = p(G Gk ). t+1 G 1 i =1
It is not hard to see that the prediction and the update can be calculated in time linear in the size of Rt . Note that the dual of a series-parallel da is also a series-parallel da that has the same parse tree with the series and the parallel connections exchan ed. So we can solve the primal on-line prunin problem usin the same parse tree.
346
Eiji Takimoto and Manfred K. Warmuth
Acknowled ments We would like to thank Hiroshi Na amochi for callin our attention to the duality of planar raphs.
References W. Buntine. Learnin classi cation trees. Statistics and Computin , 2:63 73, 1992. [CBFH+ 97] N. Cesa-Bianchi, Y. Freund, D. Haussler, D. Helmbold, R. Schapire, and M. Warmuth. How to use expert advice. Journal of the ACM, 44(3):427 485, 1997. [FSSW97] Y. Freund, R. Schapire, Y. Sin er, and M. Warmuth. Usin and combinin predictors that specialize. 29th STOC, 334 343, 1997. [Has81] R. Hassin. Maximum flow in (s t) planar networks. Information Processin Letters, 13(3):107 107, 1981. [HPW99] D. Helmbold, S. Panizza, and M. Warmuth. Direct and indirect al orithm for on-line learnin of disjunctions. 4th EuroCOLT, 138 152, 1999. [HS97] D. Helmbold and R. Schapire. Predictin nearly as well as the best prunin of a decision tree. Machine Learnin , 27(1):51 68, 1997. [Hu69] T. Hu. Inte er Pro rammin and Network Flows. Addison-Wesley, 1969. [KW97] J. Kivinen and M. Warmuth. Additive versus exponentiated radient updates for linear prediction. Information and Computation, 132(1):1 64, 1997. [KW99] J. Kivinen and M. Warmuth. Avera in expert prediction. 4th EuroCOLT, 153 167, 1999. [Law70] E. Lawler. Combinatorial Optimization: Network and Matroids. Hold, Rinehart and Winston, New York, 1970. [Lit88] N. Littlestone. Learnin when irrelevant attributes abound: A new linearthreshold al orithm. Machine Learnin , 2:285 318, 1988. [LRS83] S. Levinson, L. Rabiner, and M. Sondhi. An introduction to the application of the theory of probabilistic functions of a markov process to automatic speech reco nition. Bell System Technical Journal, 62(4):1035 1074, 1983. [LW94] N. Littlestone and M. Warmuth. The wei hted majority al orithm. Information and Computation, 108(2):212 261, 1994. [MW98] M. Maass and M. Warmuth. E cient learnin with virtual threshold ates. Information and Computation, 141(1):66 83, 1998. [PS97] F. Pereira and Y. Sin er. An e cient extension to mixture techniques for prediction and decision trees. 10th COLT, 114 121, 1997. [Vov90] V. Vovk. A re atin strate ies. 3rd COLT, 371 383, 1990. [Vov95] V. Vovk. A ame of prediction with expert advice. 8th COLT, 51 60, 1995. [WST95] F. Willems, Y. Shtarkov, and T. Tjalkens. The context tree wei htin method: basic properties. IEEE Transactions on Information Theory, 41(3):653 664, 1995. [Bun92]
On Learnin Unions of Pattern Lan ua es and Tree Patterns Sally A. Goldman
1
and Stephen S. Kwek2
1
2
Washin ton University, St. Louis MO 63130-4899, USA,
[email protected], http://www.cs.wustl.edu/ sg Washin ton State University, Pullman WA 99164-1035, USA,
[email protected], http://www.eecs.wsu.edu/ kwek
Abs rac . We present e cient on-line al orithms for learnin unions of a constant number of tree patterns, unions of a constant number of onevariable pattern lan ua es, and unions of a constant number of pattern lan ua es with xed len th substitutions. By xed len th substitutions we mean that each occurence of variable x must be substituted by terminal strin s of xed len th l(x ). We prove that if an arbitrary unions of pattern lan ua es with xed len th substitutions can be learned e ciently then DNFs are e ciently learnable in the mistake bound model. Since we use a reduction to Winnow, our al orithms are robust a ainst attribute noise. Furthermore, they can be modi ed to handle concept drift. Also, our approach is quite eneral and may be applicable to learnin other pattern related classes. For example, we could learn a more eneral pattern lan ua e class in which a penalty (i.e. wei ht) is assi ned to each violation of the rule that a terminal symbol cannot be chan ed or that a pair of variable symbols, of the same variable, must be substituted by the same terminal strin . An instance is positive i the penalty incurred for violatin these rules is below a iven tolerable threshold.
1
Introduction
A pattern p is a strin in (T S) for sets T of terminal symbols and S of variable symbols. The number of terminal symbols could be in nite. For a pattern p, let L(p) denote the set of strin s from T + that can be obtained by substitutin nonempty strin s from T + for the variables in p. We call L(p) the pattern lan ua e enerated by p. The strin s in L(p) are positive instances and the others are ne ative instances. For example, p = 1x1 0x2 1x3 x1 0x2 is a 3-variable pattern. The instance 11010111001101011 is in L(p) since it can be obtained by the substitutions x1 = 101 x2 = 11 x3 = 001. Pattern lan ua es were rst introduced by An luin [4, 5]. Since then, they have been extensively investi ated in the identi cation in the limit framework [44, 41, 40, 21, 31, 45, 20, 1, 16, 38, 46]. They have also been studied in the PAC Supported in part by NSF Grant CCR-9734940. O. Watanabe, T. Yokomori (Eds.): ALT’99, LNAI 1720, pp. 347 363, 1999. c Sprin er-Verla Berlin Heidelber 1999
348
Sally A. Goldman and Stephen S. Kwek
learnin [33, 24, 39] and exact learnin [11, 19, 27, 32, 33] frameworks. They are applicable to text processin [36], automated data entry systems [41], case-based reasonin [22] and enome informatics [7, 8, 9, 13, 35, 42, 43]. Learnin eneral pattern lan ua es is a very di cult problem. In fact, even if the learner knows the tar et pattern, decidin whether a strin can be enerated by that pattern is NP-complete [4, 25]. Ko and Tzen [26] showed that the consistency problem of pattern lan ua es is 2P -complete. Schapire [39] proved a stron er result. He showed that pattern lan ua es cannot be learned e ciently in the PAC-model assumin P poly = N P poly re ardless of the representation used by the learnin al orithm. In the exact model, An luin [6] proved that learnin with membership and equivalence queries requires exponential time. A natural approach in makin pattern lan ua es learnable is to restrict the number of occurrences of each variable symbol in the pattern to one [40] or at most some constant k [33]. Another approach is to bound the number of variables by some constant (thou h there is no restriction on the number of times each variable symbol can be used). Kearns and Pitt [24] ave a polynomial-time PAClearnin al orithm for learnin such k-variable patterns under the assumption that examples are drawn from a product distribution. However, for arbitrary distributions, the problem seems to be di cult even if k = 2 [4, 18]. We present an e cient al orithm that does not place any restrictions on k or the number of times each variable symbol occurs (albeit at the cost of only allowin xed len th substitutions). Furthermore, we can also learn the union of a constant number of patterns even with attribute noise. For k = 1, An luin [4] presented a learner that produces a descriptive pattern in O(l4 lo l) update time, where l is the len th of all the examples seen so far. A pattern p is said to be descriptive if iven a sample S that can be enerated by p, no other pattern that enerates S can enerate a proper subset of the lan ua e enerated by p. Erlebach et. al. [16] ave a more e cient al orithm that outputs a descriptive pattern in expected total learnin time O( p 2 lo p ) where p is the len th of the tar et pattern p. Recently, Reischuk and Zeu mann [38] proved that if the sample S is drawn from some xed distribution satisfyin certain beni n restrictions and the learner is not required to output a descriptive pattern, then one can learn one-variable patterns with expected total time linear in the len th of the pattern while conver in within a constant number of rounds. In their paper, Reischuk and Zeu mann [38] su ested several research directions in learnin one-variable patterns. First, they pointed out that even with two variables (i.e. k = 2) the situation becomes considerably more complicated and will require additional tools. One open problem they su ested is to construct e cient al orithms for learnin unions of constant number of one-variable pattern lan ua es. In Section 5, we present an e cient al orithm to learn the union of L one-variable pattern lan ua es in the mistake bound model. Our alorithm tolerates attribute errors but requires the learner be iven one positive example, which does not contain attribute noise, for each pattern. The number of attribute errors of a labeled strin s y , with respect to a tar et pattern, is the number of (terminal) symbols of s that have to be chan ed so that the
On Learnin Unions of Pattern Lan ua es and Tree Patterns
349
classi cation of the resultin strin by the tar et pattern is consistent with y. The update time is polynomial in the len th of the noise-free positive example of each pattern, and the current instance that we want to classify. However, it is exponential in L. When L = 1 our al orithm is less e cient than Reischuck and Zeu mann al orithm. However, our analysis is a worst-case analysis which does not assume the sample is drawn from a xed distribution. It also tolerates concept drift. A concept class that closely resembles pattern lan ua es is the class of tree patterns. A tree pattern p is a rooted tree where the internal nodes are labeled usin a set T of terminal symbols while the leaves may be labeled usin T or a set S of variable symbols. An instance t is a round tree if all the nodes are labeled by terminal symbols. An instance t is in the lan ua e L(p) enerated by a tree pattern p if t can be obtained from p by substitutin the leaves labeled with the same variable symbol by the same round tree. Those tree patterns where the siblin s are distin uishable from each other are referred to as ordered and otherwise as unordered. A union of ordered (resp. unordered) tree patterns is called an ordered forest (resp. unordered forests). In this paper, we consider only ordered trees and forests. For recent results on learnin unordered forests see Amoth, Cull and Tadepalli [3]. The study of tree patterns is motivated by natural lan ua e processin [15] and symbolic inte ration [34] where instances are represented as parse trees and expressions [34], respectively. Tree patterns are also closely related to lo ic pro ram representations [10, 23]. Usin the exact learnin model with membership and equivalence queries, Arimura, Ishizaka and Shinohara [11] showed that ordered forests with bounded number of trees can be learned e ciently. Subsequently, Amoth, Cull and Tadepalli [2] showed that ordered forests with an in nite alphabet are exactly learnable usin equivalence and membership queries. They also showed that ordered trees are exactly learnable with only equivalence queries. We ive an e cient al orithm to learn unions of a constant number of ordered tree patterns (in the mistake bound model without membership queries) in the presence of attribute noise. The number of attribute errors of a labeled round tree t y , with respect to a tar et pattern, is the number of (terminal) symbols in the nodes of t that have to be chan ed so that the classi cation of the resultin tree by the tar et tree pattern is consistent with y. Our al orithm does not require any restrictions on the alphabet size for the terminal symbols or on the number of children per node.
2
Our Results
In this paper, we present al orithms to learn unions of pattern lan ua es and tree patterns. We obtain all of our al orithms by reductions of the followin flavor. We introduce two sets of boolean attributes. One set is to ensure that the terminal symbols have not been chan ed. The other set is for ensurin all variable symbols are substituted properly. The tar et concept is then represented as a conjunction of a relatively small number of these attributes. More speci cally, the number of
350
Sally A. Goldman and Stephen S. Kwek
relevant attributes depends only on the number of patterns in the tar et union, the number of variables in the patterns and the number of occurrences of the variable symbols in the patterns. We achieve this oal while keepin the total number of attributes polynomial in the len th of the examples (which could be arbitrarily lon er than the number of variable symbols). Furthermore, since the tar et concept is represented as a conjunction of boolean attributes, we can employ Winnow to obtain a small mistake bound and to handle attribute noise. Finally, since a disjunction of a constant number of terms can be reduced to a conjunction (with size exponential in the number of terms) we can use our technique to learn unions of a constant number of patterns. This approach seems to be quite eneral and was employed to learn eometric patterns [17]. It is possibly applicable to learnin other pattern related concept classes as well. In Section 4, we apply our technique to learn a union of a constant number of pattern lan ua es with the only restriction bein that there are xed len th substitutions. A pattern lan ua e L(p) is said to have xed len th substitutions if each variable x can only be substituted by terminal strin s of constant len th l(x ). The constant l(x ) depends only on x and can be di erent for di erent variables. Trivially, this means that all strin s in L(p) must be of the same len th. The resultin al orithm learns a union of pattern lan ua es L(p1 ), , L(pL ) with xed len th substitutions usin polynomial time (for L constant) for each prediction and with a worst-case mistake bound of 0 O@
L Y =1
! (2V − k + 1)
L X =1
! lo n
+
L T Y X
1 min (2Aj 2V − k + 1)A
j=1 =1
where k is the number of variables in p , V is the total number of occurrences of variable symbols in p , Aj is the worst-case number of attribute errors in trial j, and n is the len th of the iven positive example for p (which must have no attribute errors). Note that the mistake bound only has a lo arithmic dependence on the len th of the examples. In addition, we could assi n a penalty ( .e. wei ht) to each violation of the rule that a terminal symbol cannot be chan ed. The wei hts can be di erent for di erent terminal symbols. Similarly, we can also assi n a penalty to each violation of the rule that a pair of variable symbols, of the same variable, must be substituted by the same terminal strin . If the penalty incurred by an instance for violatin these rules is below a iven tolerable threshold then it is in the tar et concept L0 (p) enerated by p. If the penalty is above the threshold then it is not in L0 (p). Since Winnow can learn linear threshold functions, the al orithms we present here can be extended tot this more eneral class of pattern lan ua es. Contrastin this positive result, we prove that if unions of an arbitrary number of such patterns can be learned e ciently in the mistake bound model then DNFs can be learned e ciently in the mistake bound model. Whether or not DNF formulas can be e ciently learned is one of the more challen in open problems. The problem remains open even for the easier PAC learnin model.
On Learnin Unions of Pattern Lan ua es and Tree Patterns
351
Next, in Section 5, we present an al orithm to learn L(p1 ) L(pL ) where each p is a one-variable pattern. Our al orithm makes each prediction in polynomial time (for constant L) and has a worst-case mistake bound of 1 0 ! L ! L L T Y Y X X 2V lo n + min (2Aj 2V )A O@ =1
=1
j=1 =1
where V is the number of occurrences of the variable symbol in p , Aj is the worst-case number of attribute errors in trial j, n is the len th of the rst positive example for p (which must have no attribute errors) and m is the len th of the example to be classi ed. In Section 6, we apply our technique to obtain an al orithm to learn ordered pL . Our al orithm makes each predicforests composed of trees patterns p1 tion in polynomial time (for constant L) and has worst-case mistake bound 1 0 ! L ! L L T Y Y X X p lo n + min (2Aj p )A O@ =1
=1
j=1 =1
where Aj is the worst-case number of attribute errors in trial j, and n is the len th of the rst positive example for p (which must have no attribute errors). As in the case of pattern lan ua es with xed len th substitutions, it has been shown that in the exact learnin model with equivalence queries only, e cient learnability of ordered forests implies e cient learnability of DNFs [2]. Thus, it seems unlikely that unions of arbitrary number of ordered tree patterns can be learned in the mistake bound model. In all of our al orithms, the requirement that the learner is initially iven a noise free positive example for each pattern or tree pattern in the tar et can be relaxed. One way is to sample the instance space for positive labeled instances. If the attribute noise rate is low then with hi h probability, we can obtain one noise-free positive example for each (tree) pattern unless the positive instances for a particular (tree) pattern do not occur frequently. In the latter, we can i nore that (tree) pattern. We can then run our al orithms for each L-subset of these positive examples and use the wei hted majority al orithm [30] of Littlestone and Warmuth to lter out the optimum al orithm.
3
Preliminaries
In concept learnin , each instance in an instance space X is labeled accordin to some tar et concept f . The tar et concept is assumed to be some concept class C. The model used here is the on-line (a.k.a. mistake-bound) learnin model [28, 6]. In this model, learnin proceeds in a, possibly in nite, sequence of trials. In each trial, the learner is presented with an instance Xt from some domain X . The learner is required to make, in polynomial time, a prediction on the classi cation of Xt . In return, the learner receives the desired output f (Xt ) as feedback.
352
Sally A. Goldman and Stephen S. Kwek
A mistake is made if the prediction does not match the desired output. The learner’s objective is to minimize the total number of mistakes. An important result in this model is Littlestone’s al orithm Winnow for learnin small conjunctions (or disjunctions) of boolean attributes when there is a lar e number Pn n of irrelevant attributes. Winnow maintains a linear threshold where w is a wei ht that is associated with the boolean functions =1 w x attribute x . Initially, all the wei hts are equal toP1. Upon receivin an input vn , the al orithm predicts true if the sum n=1 w v is reater than the v1 xed threshold and false otherwise. Typically, the threshold is set to n. If the prediction is wron then the wei hts are updated as follows. Suppose the al orithm predicts false but the instance is in the tar et concept. Winnow promotes the wei ht w , for each attribute x in the instance that is set to 1, by multiplyin w by some constant update factor for > 1 (typically, we set = 2). Otherwise, the al orithm must have predicted true but the instance is not in the tar et concept. In this case, for each literal x in the instance that is set to 1, Winnow demotes the wei ht w by dividin it by . The number of attribute errors of a labeled example Xt yt , with respect to the tar et disjunction, is the number of attributes of Xt that have to be chan ed so that the classi cation of the resultin example by the tar et is consistent with yt . In the presence of attribute noise, Littlestone o ers the followin performance uarantee for Winnow. Theorem 1. [29] Suppose, the tar et concept is a k-conjunction (or k-disjunction) and makes at most A attribute errors. Then Winnow makes at most O(A + k lo (N )) mistakes on any sequence of trials. Auer and Warmuth [12] su ested a version of Winnow which tolerates concept drift. Here the tar et disjunction may drift (chan e slowly) in time. The idea is that when a wei ht is su ciently small, we do not demote it any further. We restrict our discussion in this paper to the ori inal version of Winnow but remark that we could use the drift-tolerant version of Winnow to yield results that tolerates shifts (details omitted).
4
Learnin Unions of Pattern Lan ua es with Fixed Len th Substitutions
Althou h eneral pattern lan ua es are di cult to learn, we prove the followin theorem which states that if the tar et concept is a union of L (a constant number of) pattern lan ua es that have xed len th substitutions then we can learn it e ciently in the on-line model with the presence of attribute noise. We note that for the case of a sin le pattern with xed len th substitutions without any attribute noise, one can use a direct application of the halvin al orithm [14, 28] to obtain an al orithm with a polynomial mistake bound. Alon with the restrictions mentioned above, when directly usin the halvin al orithm exponential time is required to make each prediction. The al orithm
On Learnin Unions of Pattern Lan ua es and Tree Patterns
353
we present handles a union of a constant number of patterns, is robust a ainst attribute noise, and each prediction is made in polynomial time. We rst prove our result for the case where the tar et is a sin le pattern with no attribute noise. Then, we eneralize our result to unions of patterns in the presence of attribute noise. Lemma 2. Suppose the tar et concept is the pattern lan ua e L(p) with xed xk len th substitutions. Further, suppose that p is composed of variables x1 with V total occurrences of the variable symbols in p. Then the tar et concept can be e ciently learned in the mistake bound model with O ((2V − k + 1) lo n) mistakes in the worst case. The time complexity per trial is O n3 where n is the len th of the rst (positive) counterexample. Proof. Our al orithm obtains its rst positive counterexample by predictin ne ative until it ets a positive counterexample s0 . Let n be the len th of s0 . Since all substitutions of the same variable in the tar et have the same len th, we know that if an instance has len th di erent from n then it is a ne ative instance. Thus, without loss of enerality, we assume all the instances are exactly of len th n. We denote the substrin of a strin s that be ins at position i and ends at position j by s[i j] and the ith symbol of s is denoted by s[i]. To make a prediction on an instance s, we transform s to a new instance with the followin sets of boolean attributes: X[i j l] 1 i j n 1 l n − j + 1. Each variable X[i j l] is set to 1 if and only if the two substrin s s[i i + l − 1] and s[j j + l − 1] are identical. C[i j] 1 i j n. The variable C[i j] is set to 1 if and only if the substrin s s[i j] and s0 [i j] are the same. We note that our reduction is a re nement of a more direct reduction that uses O(n2 ) variables (versus the O(n3 ) variables used above) for the case where the len th of the substitutions must always be one. The followin claim shows 3 that by introducin the n6 + o(n3 ) variables of the form X[i j l], the tar et concept can be represented as a conjunction where the number of relevant variables is independent of n (versus havin a linear dependence on n). By applyin Winnow to learn this conjunction, we obtain a mistake bound with a lo arithmic dependence on n versus a linear dependence on n. Claim. The tar et k-variable pattern p can be expressed in the transformed instance space as a conjunction of 2V − k + 1 attributes. Here, V is the total number of occurrences of the variable symbols in p. Proof. Since the substitutions of the same variable x must be of the same len th l(x), the substitution of a particular variable symbol in all positive instances must appear in the same locations. That is, for the variable symbol x to appear in a particular location in p, its substitution in a positive instance must appear in position i to i + l(x) − 1 for some xed i. The substitutions for a variable x that appears in two distinct positions i and j are the same i X[i j l(x)] = 1.
354
Sally A. Goldman and Stephen S. Kwek
Consider a particular variable, say x . Suppose that x appears in a positive j i . Then for an instance to be positive, the instance at positions j1 X[j i −1 j i l(x )] must all be ( − 1) transformed variables X[j1 j2 l(x )] set to 1. Conversely, if one of these transformed variables are set to 0 then the instance must be ne ative. Further, suppose s0 [i j] is a substrin of s0 that corresponds to a maximal substrin in p consistin of only terminal symbols. In other words, all symbols in s0 [i j] are terminal symbols in p, but s0 [i − 1] and s0 [j + 1] are symbols obtained from substitutin a variable with a strin of terminal symbols. Notice a ain that the substitution of a variable symbol must appear at a speci c location and be of the same len th. Therefore, for an instance s to be positive, the substrin s[i j] and s0 [i j] must be the same. The latter means that C[i j] must be set to 1. Conversely, if for some s0 [i j] that corresponds to a maximal substrin in p consistin of only terminal symbols, the substrin s[i j] of some instance s does not match s0 [i j] (i.e. C[i j] = 0) then s must be ne ative. There are at most V + 1 of the C[i j]’s that are positive (since each one, except the last, must end with one of the V variables). A positive instance s is positive if and only if (1) all the variables of the same variable symbol are substituted by the same strin s of terminal symbols and (2) none of the substrin s in p consistin of terminal symbols only are substituted. The above discussion implies that (1) and (2) can be ensured by checkin at P most k=1 ( − 1) = V − k variables X[i j l]’s and V + 1 variables C[i j]’s are all 1s, respectively. Consider the pattern p = x1 1x3 01x2 001x1 x2 11x1 with l(x1 ) = 3 l(x2 ) = 4 l(x3 ) = 2 as an example. The proof of the above claim says that it can be represented as the conjunction (C[4 4]
C[7 8]
C[13 15]
C[23 24])
^
(X[1 16 3]
X[16 25 3])
^
X[9 19 4]
The variable x1 must appear at position 1, 16 and 25. Thus, the tar et conjunction must contain the variable X[1 16 3] and X[16 25 3]. The substrin s[4 4] of any positive instance s must always be the same as s0 [4 4] and is the strin 1 . Thus, C[4 4] must be present. The presence of other attributes can be similarly explained. From the above claim we know that there are at most 2V − k + 1 relevant attributes. Combined with the fact that there are O(n3 ) boolean attributes, we obtain the desired mistake bound of Lemma 2 by applyin Winnow. (Since Winnow learns a disjunction of boolean attributes, we apply Winnow to learn the ne ation of the tar et concept which is represented as a disjunction of the ne ations of the attributes.) A strai htforward implementation of the above idea would have time complexity of O(n4 ) per trial. To reduce the time complexity, for each distinct pair of i and j, 1 i j n, the learner rst nds the lon est common substrin of the strin that be ins at position i and the strin that be ins at position j. Say the common substrin is of len th l0 . Then the learner sets all variables X[i j l] 1 l l0 to 1 and X[i j l] l > l0 to 0. The
On Learnin Unions of Pattern Lan ua es and Tree Patterns
355
C[i j]’s can be evaluated in a similar way. This implementation reduces the time complexity to O(n3 ) per trial. This completes the proof of Lemma 2. We now extend this result for the case of a union of a constant number of patterns with xed len th substitutions under attribute noise. Theorem 3. Suppose the tar et concept is a union of pattern lan ua es L(p1 ), i L, , L(pL ) with xed len th substitutions. Further, suppose that for 1 p has k variables and V total occurrences of variable symbols. Then the tar et concept can be e ciently learned in the mistake bound model. The number of mistakes made after T trials is bounded by 1 0 ! L ! L L T Y Y X X (2V − k + 1) lo n + min (2Aj 2V − k + 1)A O@ =1
=1
j=1 =1
in the worst case. The time complexity per trial is O (n1 nL )3 . We assume that initially the learner is iven a noise-free positive example, of len th n , for each pattern p . Here, A is the number of attribute errors in the ith trial. (For this bound to be meanin ful, we assume A is zero in most of the trials.) Proof. First we consider the case where the tar et is a union of L patterns satisfyin the condition of Theorem 3, but with no attribute noise. In this case, each pattern p can be represented as a conjunction C of 2V − k + 1 attributes. can be represented The tar QL et is a disjunction f of the C ’s. Thus, its complement 0 (2V − k + 1)-term DNF which we denote by f . The term in f must as a =1 contain exactly one literal from the set of transformed attributes correspondin L. Since there are at most O(n3 ) attributes for each to a pattern p , i = 1 pattern p , there are at most O((n1 nL )3 ) possible terms to consider. Each such candidate term can be treated as a new attribute. Applyin Winnow would then ive us the desired mistake bound. Further, as before, the transformed attributes correspondin to the pattern p can be computed in O(n3 ) time. Thus, the time complexity to update the O((n1 nL )3 ) attributes is O((n1 nL )3 ). Finally, we introduce attribute errors. Suppose Aj symbol errors occur at trial j. Each symbol error can result in at most two relevant attributes of C bein complemented. There are at most 2V − k + 1 literals in C . Thus, at most min(2Aj 2V − k + 1) of the attributes in C are complemented. This implies QL that at most =1 min(2Aj 2V −k +1) attributes in f 0 are complemented. This ives us the second term in the mistake bound that is due to attribute errors. The next theorem su ests that it appears necessary to bound the number of patterns in the tar et for it be e ciently learnable. Theorem 4. In the mistake bound model, if unions of arbitrary number of pattern lan ua es with xed len th substitution restriction can be learned e ciently, then DNFs can be learned e ciently. Proof. Suppose the learner is asked to learn a DNF f in the mistake bound model. Without loss of enerality, we can assume f is monotone and there are n
356
Sally A. Goldman and Stephen S. Kwek
variables x1 , , xn . Let 0 1 and 1 n be sets of terminal and variable symbols, respectively. Each term t in f can be represented as a pattern p(t) with n characters. The ith character is set to 1 if the literal x is in term t and otherwise. We represent an instance x as an n-bit vector (strin ). If we restrict l( ) = 1 for all ’s then clearly, t(x) = 1 i x L(p(t)). This is a polynomialtime prediction preservin reduction [37], which completes the proof.
5
Learnin Unions of One-Variable Pattern Lan ua es
We now consider the case of one-variable patterns without the xed len th substitution requirement. As in the last section, we rst prove our result for the case where the tar et is a sin le pattern with no attribute noise. Then, we apply the same technique to eneralize this result to unions of patterns in the presence of attribute noise. Lemma 5. Suppose the tar et concept is a one-variable pattern p with V occurrences of the variable symbol. Then the tar et concept can be e ciently learned in the mistake bound model with O in the worst case. The time (V lo n) mistakes complexity per trial is O n4 mV = O n5 m where n is the len th of the rst (positive) counterexample and m is the len th of the example to be classi ed. Proof. The learner uesses ne ative until obtainin a positive counterexample s0 . Denote the len th of s0 by n, and the startin position1 of the ith (countin from the leftmost end of the pattern) substitution of the variable x by . For a moment, we assume the learner is told the number of occurrences V of the variable symbols in the tar et, and len th of the substituted terminal strin . Suppose the learner is asked to classify a iven unlabeled instance s of len th m. If the di erence in len th of s0 and s is not divisible by V then we can conclude immediately that s must be classi ed ne ative. Henceforth, we assume the di erence between the len ths of s0 and s is divisible by V . If s is positive th substitution then the substitution of x in s has len th 0 = + m−n V . The i + (i − 1) m−n and the substitution of x in s must be in at location 0 = V for the variable x is the substrin s[ 0 0 + 0 − 1]. In other words, to see if all substitutions of x in s are the same, we simply check for all i = 2 V , whether s[ 0 −1 0 −1 + 0 − 1] = s[ 0 0 + 0 − 1]. If this is not so then we can immediately conclude that s L(p). Unfortunately, we do not know the ’s. To circumvent this problem, we introduce new attributes X[ γ i] 2 γ n 1 i V such that X[ γ i] 0 + (i − 1) m−n is set to true if and only if the substrin s[ + (i − 1) m−n V V + − 1] m−n m−n 0 is the same as s[γ + i V γ + i V + − 1]. Clearly, if all substitutions of x in s are the same then the (V − 1)-conjunction CX = X[
1
2
2]
X[
−1
is satis ed, and vice versa. 1
These positions are not known to the learner.
i]
X[
V −1
V
V]
On Learnin Unions of Pattern Lan ua es and Tree Patterns
357
To classify an instance correctly as positive, we also need to ensure that the terminal symbols in p remain the same. Let 0 = − and V +1 = n. Then clearly, the ith substrin of terminal symbols between the ith variable symbol and i + 1st variable symbol is the strin s0 [ + +1 − 1] (which is de ned to be the empty strin if + > +1 − 1). If none of the terminal symbols in this maximal substrin of terminal symbols is chan ed in s then it must appear in s[ 0 + 0 0 +1 − 1]. In other words, to check if none of the terminal symbols in the tar et has been replaced, it is su cient and necessary to verify that s[
0
+
0
0
+1
− 1] = s0 [
+
+1
− 1] i = 0
V
(1)
As before, since we do not know where the ’s are, we introduce new attributes C[i B E] 0 i V 1 B E n. We set C[i B E] to 1 when m−n E + i ] = s [B E]. It is easy to verify that sayin Equation 1 is s[B + i m−n 0 V V satis ed is the same as sayin the conjunction CT (shown below) is satis ed. CT =
V ^
C[i
+
+1
− 1]
=0
Therefore, the tar et pattern p can be represented as a conjunction CT CX of 2V boolean attributes. There are O(n2 V ) possible attributes to consider. Thus by runnin Winnow to learn CT CX uarantees at most O(2V (lo n+lo V )) = O(2V lo n) mistakes are made (since V n). The question remains in uessin and V correctly. Well there are only O(n2 ) such uesses. We can run one copy of the above al orithm for each uess and run wei hted majority al orithm [30] on these al orithms. The mistake bound is O(lo (n2 ) + 2V lo n) = O(V lo n) with runnin time O(n4 mV ) = O(n5 m). Lemma 5 can be extended to learn unions of one-variable pattern lan ua es in the presence of attribute noise (except for the rst counterexample which must be noise free). The bound obtained is shown in the next theorem. Theorem 6. Suppose the tar et concept p is a union of one-variable pattern lan ua es L(p1 ), , L(pL ). Further, suppose the number of occurrences of the variable symbol in p is V . Then p can be e ciently learned in the mistake bound model. The number of mistakes made after T trials is bounded by ! ! L ! L L T Y Y X X 2V lo n + 2 min (Aj V ) O =1
=1
t=1 =1
PL V = in the worst case. The time complexity per trial is O m(n1 nL )4 =1 5 O m(n1 nL ) . Here, We assume that initially the learner is iven a noise-free positive example, of len th n , for each pattern p . m is the len th of the unlabeled example to be classi ed in the T + 1st trial.
358
Sally A. Goldman and Stephen S. Kwek
A is the number of attribute errors in the ith trial. (For this bound to be meanin ful, we assume that A is zero in most of the trials.) Proof. (Sketch) We obtain this result by extendin Lemma 5 to unions of lanua es with attribute noise usin the same technique as that used in extendin Lemma 2 to Theorem 3.
6
Learnin Ordered Forests
We have demonstrated how the problem of learnin unions of pattern lan ua es can be reduced to learnin conjunctions of boolean attributes. Next, we apply this idea to learnin ordered forests with bounded number of trees. No restrictions are needed on the number of children per node or the alphabet size for the terminal symbols. pL can be e Theorem 7. Ordered forests composed of trees patterns p1 ciently learned in the on-line model. The number of mistakes made after T trials is bounded by 1 0 ! L ! L L T Y Y X X p lo n + min (2Aj p )A O@ =1
=1
j=1 =1
in the worst case. The time complexity per trial is O (n1 nL )3 . Here, We assume that initially the learner is iven a noise-free positive example, of len th n , for each tree pattern p . Aj is the number of attribute errors in the (j)th trial. (For this bound to be meanin ful, we assume that Aj is zero in most of the trials.) Proof. (Sketch) We present only the proof for the case of learnin a sin le ordered tree pattern. The extension of the proof to the case of learnin ordered forests in the presence of attribute errors is like that used to prove Theorem 3. Suppose t is a tree and u is a node in t. Let patht (u) denote the labeled path obtained by traversin from the root of t to u. Given two distinct trees t and t0 , we say patht (u) = patht0 (u0 ) if and only if the sequences of the node labels (except for the last) and the branches taken as we traversed from the root of t to u and from the root of t0 to u0 are the same. As before, we simply keep predictin ne ative until we et a positive counterexample t0 . Let n denote the number of nodes in t0 . To make a prediction on an instance t, we transform t to a new instance with the followin set of O(n2 ) attributes (See Fi ure 1 for an illustration). For each vertex u0 in t0 , we introduce a new attribute C[u0 ]. This attribute is set to 1 if and only if patht0 (u0 ) = patht (u) for some node u in t, and the labels of the nodes u in t and u0 in t0 are the same.
On Learnin Unions of Pattern Lan ua es and Tree Patterns
b
b
b
x1
x2
a
1
b 2
a
4 a
x1 8 b
3a
a 9 a
359
5
10 b
7 a
6 a 11 b
12 a
a (positive) instance t0 that can be generated by p
tree pattern p
Fi . 1. The ure on the left shows a tree pattern p. The ure on the ri ht is a tree instance t0 that can be enerated by p. If t0 is the rst counterexample obtained then the conjunctive representation of p is C[1] C[2] C[3] C[6] X[4 7]. For each distinct pair of nodes u0 and v0 in t0 , we introduce a new attribute X[u0 v0 ]. X[u0 v0 ] is set to 1 if and only if there are two distinct nodes u and v in t that satis es: 1. patht0 (u0 ) = patht (u) and patht0 (v0 ) = patht (v) 2. The two subtrees in t that are rooted at u and v are identical. (Since the siblin s are distin uishable, we can check that the subtrees are identical in linear time). Let t0 be the new instance with transformation.
n 2
boolean attributes obtained by the above
Claim. The tar et tree pattern p can be represented as a conjunction f of at most p of the new boolean attributes such that iven an instance t, the transformed instance t0 is classi ed positive by f i t is classi ed as positive by p. Proof. To verify an instance t is in L(p), it is necessary and su cient to ensure the followin two conditions are satis ed. 1. For each node u in p that is labeled by a terminal symbol, there is a correspondin node u in t such that pathp (u) = patht (u) and both u and u have the same terminal label. 2. For each pair of distinct leaves u and v in p labeled by the same variable, there are two nodes u and v in t such that pathp (u) = patht (u) and pathp (v) = patht (v). Furthermore, the subtrees in t rooted at u and v are identical. That is, the substitutions in t for u and v are the same.
360
Sally A. Goldman and Stephen S. Kwek
Clearly, pathp (u) = patht (u) is equivalent to patht0 (u0 ) = patht (u) for the nodes u0 in t0 that corresponds u in p. Condition 1 is satis ed if and only if for each node u0 in t0 that corresponds to a node in p labeled usin a terminal symbol, the attribute C[u0 ] is set to 1. To ensure Condition 2 is satis ed, it is su cient to check that X[u0 v0 ] = 1 for each pair of distinct nodes u0 and v0 in t0 that corresponds to some pair of distinct leaves in p that are labeled by the same variable symbol. Suppose the leaves in t0 that corresponds to substitutin a variable symbol x in p are l1 , ..., lk . Then it su ces to check that X[l1 l2 ] = X[l3 l4 ] = = X[lk−1 lk ] = 1. Therefore, the tar et concept can be represented as a conjunctions of at most p of the transformed attributes. This completes the proof of the claim. Combinin the above Lemma with Theorem 1 and the technique used in Theorem 3 completes the proof of Theorem 7. Amoth, Cull and Tadepalli [2] have shown that DNF and the class of ordered forests with bounded2 label alphabet size and bounded number of children per node are equivalent. Hence, it seems unlikely that unions of arbitrary number of tree patterns can be learned in the mistake bound model.
7
Conclusion
In this paper, we demonstrated how learnin unions of pattern lan ua es and pattern-related concept can be reduced to learnin disjunctions of boolean attributes. In particular, we presented e cient on-line al orithms for learnin unions of a constant number of tree patterns, unions of a constant number of one-variable pattern lan ua es, and unions of a constant number of pattern lanua es with xed len th substitutions. All of our al orithms are robust a ainst attribute noise and can be modi ed to handle concept drift. Further, our mistake bounds only have a lo arithmic dependence on the len th of the examples. The requirement that the learner be iven a noise-free example for each pattern can be removed by samplin as discussed in Section 1. There are several interestin future directions su ested by this work. As we have discussed, we could eneralize the class of pattern lan ua es by assi nin a penalty ( .e. wei ht) to each violation of the rule that a terminal symbol cannot be chan ed. The wei hts can be di erent for di erent terminal symbols. Similarly, we can also assi n a penalty to each violation of the rule that a pair of variable symbols, of the same variable, must be substituted by the same terminal strin . If the penalty incurred by an instance for violatin these rules is below a iven tolerable threshold then it is in the tar et concept L0 (p) enerated by p. If the penalty is above the threshold then it is not in L0 (p). It would be very interestin to explore applications for this extension and compare our approach to those currently in use. 2
They showed that ordered forests can be learned usin subset queries and equivalence queries. Further, if the alphabet size or number of children per node is unbounded, then subset queries can be simulated usin membership queries by usin a unique label or a subtree to stand for each variable.
On Learnin Unions of Pattern Lan ua es and Tree Patterns
361
In this paper, we solved one of the open problems su ested by Reischuk and Zeu mann [38]. Namely, we ave an e cient al orithm to learn unions of a constant number of one-variable pattern lan ua es. We also were able to learn a unions constant number of pattern lan ua es (with no restriction on the number of variables) when we restricted the substitutions to xed len th substitutions. A challen in open problem from Reischuk and Zeu mann that we did not resolve here is learnin the class of 2-variable pattern lan ua es (in the mistake bound model). While, additional tools will be needed to solve this problem, we feel that the technique proposed here may be applicable for this problem.
References [1] Andris Ambainis, Sanjay Jain, and Arun Sharma. Ordinal mind chan e complexity of lan ua e identi cation. In Computational Learnin Theory: Eurocolt ’97, pa es 301 315. Sprin er-Verla , 1997. [2] Thomas R. Amoth, Paul Cull, and Prasad Tadepalli. Exact learnin of tree patterns from queries and counterexamples. In Proc. 11th Annu. Conf. on Comput. Learnin Theory, pa es 175 186. ACM Press, New York, NY, 1998. [3] Thomas R. Amoth, Paul Cull, and Prasad Tadepalli. Exact learnin of unordered tree patterns from queries. In Proc. 12th Annu. Conf. on Comput. Learnin Theory, pa es 323 332. ACM Press, New York, NY, 1999. [4] D. An luin. Findin patterns common to a set of strin s. J. of Comput. Syst. Sci., 21:46 62, 1980. [5] D. An luin. Inductive inference of formal lan ua es from positive data. Inform. Control, 45(2):117 135, May 1980. [6] D. An luin. Queries and concept learnin . Machine Learnin , 2(4):319 342, April 1988. [7] S. Arikawa, S. Kuhara, S. Miyano, Y. Mukouchi, A. Shinohara, and T. Shinohara. A machine discovery from amino acid sequences by decision trees over re ular patterns. In Intern. Conference on Fifth Generation Computer Systems, 1992. [8] S. Arikawa, S. Miyano, A. Shinohara, T. Shinohara, and A.Yamamota. Al orithmic learnin theory with elementary formal systems. In IEICE Trans. Inf. and Syst., volume E75-D No 4, pa es 405 414, 1992. [9] S. Arikawa, A. Shinohara, S. Miyano, and A. Shinohara. More about learnin elementary formal systems. In Nonmonotonic and Inductive Lo ic, Lecture Notes in Arti cial Intelli ence, volume 659, pa es 107 117. Sprin er-Verla , 1991. [10] H. Arimura, H. Ishizaka, T. Shinohara, and S. Otsuki. A eneralization of the least eneral eneralization. In Machine Learnin , volume 13, pa es 59 85. Oxford Univ. Press, 1994. [11] Hiroki Arimura, Hiroki Ishizaka, and Takeshi Shinohara. Learnin unions of tree patterns usin queries. In Proc. 6th Int. Workshop on Al orithmic Learnin Theory, pa es 66 79. Sprin er-Verla , 1995. [12] Peter Auer and Manfred Warmuth. Trackin the best disjunction. In Proceedin s of the 36th Annual Symposium on Foundations of Computer Science, pa es 312 321. IEEE Computer Society Press, Los Alamitos, CA, 1995. [13] A. Bairoch. Prosite: A dictionary of sites and patterns in proteins. In Nucleic Acid Research, volume 19, pa es 2241 2245, 1991. [14] J. M. Barzdin and R. V. Frievald. On the prediction of eneral recursive functions. Soviet Math. Doklady, 13:1224 1228, 1972.
362
Sally A. Goldman and Stephen S. Kwek
[15] C. Cardie. Empirical methods in information extraction. In AI Ma azine, volume 18, pa es 65 80, 1997. [16] Thomas Erlebach, Peter Rossmanith, Hans Stadtherr, An elika Ste er, and Thomas Zeu mann. Learnin one-variable pattern lan ua es very e ciently on avera e, in parallel, and by askin queries. In Al orithmic Learnin Theory: ALT ’97, pa es 260 276. Sprin er-Verla , 1997. [17] Sally A. Goldman, Stephen S. Kwek, and Stephen D. Scott. A nostic learnin of eometric patterns. In Proc. 10th Annu. Conf. on Comput. Learnin Theory, pa es 325 333. ACM Press, New York, NY, 1997. [18] C. Hua and K. Ko. A note on the pattern- ndin problem. Technical Report UH-CS-84-4, Department of Computer Science, University of Houston, 1984. [19] O. H. Ibarra and T. Jian . Learnin re ular lan ua es from counterexamples. In Proc. 1st Annu. Workshop on Comput. Learnin Theory, pa es 371 385, San Mateo, CA, 1988. Mor an Kaufmann. [20] Sanjay Jain and Arun Sharma. Elementary formal systems, intrinsic complexity, and procrastination. In Proc. 9th Annu. Conf. on Comput. Learnin Theory, pa es 181 192. ACM Press, New York, NY, 1996. [21] K. P. Jantke. Polynomial-time inference of eneral pattern lan ua es. In Proceedin s of the Symposium of Theoretical Aspects of Computer Science; Lecture Notes in Computer Science, volume 166, pa es 314 325. Sprin er, 1984. [22] K. P. Jantke and S. Lan e. Case-based representation and learnin of pattern lan ua es. In Proc. 4th Internat. Workshop on Al orithmic Learnin Theory, pa es 87 100. Sprin er Verla , 1993. Lecture Notes in Arti cial Intelli ence 744. [23] C. Pa e Jr. and A. Frisch. Generalization and learnability: A study of constrained atoms. In Inductive Lo ic Pro rammin , pa es 29 61, 1992. [24] M. Kearns and L. Pitt. A polynomial-time al orithm for learnin k-variable pattern lan ua es from examples. In Proc. 2nd Annu. Workshop on Comput. Learnin Theory, pa es 57 71, San Mateo, CA, 1989. Mor an Kaufmann. [25] K. Ko, A. Marron, and W. Tzen . Learni strin patterns and tree patterns from examples. abstract. In State University of New York Stony Brook, 1989. [26] K. Ko and W. Tzen . Three 2p -complete problems in computational learnin theory. Computational Complexity, 1(3):269 310, 1991. [27] S. Lan e and R. Wieha en. Polynomial time inference of arbitrary pattern lanua es. New Generation Computin , 8:361 370, 1991. [28] N. Littlestone. Learnin when irrelevant attributes abound: A new linearthreshold al orithm. Machine Learnin , 2:285 318, 1988. [29] N. Littlestone. Redundant noisy attributes, attribute errors, and linear threshold learnin usin Winnow. In Proc. 4th Annu. Workshop on Comput. Learnin Theory, pa es 147 156, San Mateo, CA, 1991. Mor an Kaufmann. [30] N. Littlestone and M. K. Warmuth. The wei hted majority al orithm. Information and Computation, 108(2):212 261, 1994. [31] A. Marron. Learnin pattern lan ua es from a sin le initial example and from queries. In Proc. 1st Annu. Workshop on Comput. Learnin Theory, pa es 345 358, San Mateo, CA, 1988. Mor an Kaufmann. [32] Satoshi Matsumoto and Ayumi Shinohara. Learnin pattern lan ua es usin queries. In Computational Learnin Theory: Eurocolt ’97, pa es 185 197. Sprin erVerla , 1997. [33] Andrew R. Mitchell. Learnability of a subclass of extended pattern lan ua es. In Proc. 11th Annu. Conf. on Comput. Learnin Theory, pa es 64 71. ACM Press, New York, NY, 1998.
On Learnin Unions of Pattern Lan ua es and Tree Patterns
363
[34] T. Mitchell, P. Ut o , and R. Banerji. Learnin by experimentation: Acquirin and re nin problem solvin heuristics. In R. Michalski, J. Carbonell, T. Mitchell eds., Machine Learnin , pa es 163 190. Palo Alto, CA: Tio a, 1983. [35] S. Miyano, A. Shinohara, and T. Shinohara. Which classes of elementary formal systems are polynomial-time learnable? In Proc. 2nd Int. Workshop on Al orithmic Learnin Theory, pa es 139 150. IOS Press, 1992. [36] R. Nix. Editin by example. In Proc. 11th ACM Symposium on Principles of Pro rammin Lan ua es, pa es 186 195. ACM Press, 1984. [37] L. Pitt and M. K. Warmuth. Prediction preservin reducibility. J. of Comput. Syst. Sci., 41(3):430 467, December 1990. Special issue of the for the Third Annual Conference of Structure in Complexity Theory (Washin ton, DC., June 88). [38] R¨ udi er Reischuk and Thomas Zeu mann. Learnin one-variable pattern lanua es in linear avera e time. In Proc. 11th Annu. Conf. on Comput. Learnin Theory, pa es 198 208. ACM Press, New York, NY, 1998. [39] R. E. Schapire. Pattern lan ua es are not learnable. In Proc. 3rd Annu. Workshop on Comput. Learnin Theory, pa es 122 129, San Mateo, CA, 1990. Mor an Kaufmann. [40] T. Shinohara. Polynomial time inference of extended re ular pattern lan ua es. In RIMS Symposia on Software Science and En ineerin , Kyoto, Japan, pa es 115 127. Sprin er Verla , 1982. Lecture Notes in Computer Science 147. [41] T. Shinohara. Polynomial time inference of pattern lan ua es and its applications. Proceedin s, 7th IBM Symp. on Math. Foundations of Computer Science, 1982. [42] E. Tateishi, O. Maruyama, and S. Miyano. Extractin motifs from positive and ne ative sequence data. In Proc. 13th Symposium on Theoretical Aspects of Computer Science, Lecture Notes in Computer Science 1046, pa es 219 230, 1996. [43] E. Tateishi and S. Miyano. A reedy strate y for ndin motifs from positive and ne ative examples. In Proc. First Paci c Symposium on Biocomputin , pa es 599 613. World Scienti c Press, 1996. [44] R. Wieha en and T. Zeu mann. In norin data may be the only way to learn e ciently. Journal of Experimental and Arti cial Intelli ence, 6:131 144, 1994. [45] K. Wri ht. Identi cation of unions of lan ua es drawn from an identi able class. In Proc. 2nd Annu. Workshop on Comput. Learnin Theory, pa es 328 333. Moran Kaufmann, 1989. (See also the correction by Motoki, Shinohara and Wri ht in the Proceedin s of the Fourth Annual Workshop on Computational Learnin Theory, pa e 375, 1991). [46] T. Zeu mann. Lan e and Wieha en’s pattern lan ua e learnin al orithm: An avera e-case analysis with respect to its total learnin time. Technical Report RIFIS-TR-CS-111, RIFIS, Kyushu University 33, 1995.
Author Index
Balcazar, Jose L., 77 Bshouty, Nader H., 206 Castro, Jor e, 77 Cheun , Dennis, 231 Dalmau, V ctor, 301 De Comite, Francesco, 219 Denis, Francois, 219 Domin o, Carlos, 241 Eiron, Nadav, 206 Ev eniou, Theodoros, 106 Fukumizu, Kenji, 51 Gilleron, Remi, 219 Goldman, Sally A., 347 Grieser, Gunter, 118 Guijarro, David, 77, 313 Hara uchi, Makoto, 194 Hermo, Montserrat, 291 Hirata, Kouichi, 157 Kalnishkan, Yuri, 323 Kushilevitz, Eyal, 206 Kwek, Stephen S., 347 Lan e, Ste en, 118 Lav n, V ctor, 291 Letouzey, Fabien, 219
Mitchell, Andrew, 93 Morik, Katharina, 1 Morita, Nobuhiro, 194 Nessel, Jochen, 264 Nock, Richard, 182 Okubo, Yoshiaki, 194 Pontil, Massimiliano, 106, 252 Rifkin, Ryan, 252 Rossmanith, Peter, 132 Sasaki, Yutaka, 169 Schapire, Robert E., 13 Sche er, Tobias, 93 Sharma, Arun, 93 Simon, Hans-Ulrich, 77 Stephan, Frank, 93, 276 Takimoto, Eiji, 335 Tarui, Jun, 313 Tsukiji, Tatsuie, 313 Verri, Alessandro, 252 Warmuth, Manfred K., 335 Watanabe, Sumio, 39 Watson, Phil, 145 Wiedermann, Jir , 63 Yamanishi, Kenji, 26 Zeu mann, Thomas, 276